├── Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb
├── Implementing the serving layer of lambda architecture using Redshift.ipynb
├── Implementing the speed layer of lambda architecture using Structured Spark Streaming
├── Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb
├── Lambda_architecture-2.png
└── README.md
/Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka\n",
8 | "\n",
9 | "### Purpose:\n",
10 | "- store all the tweets that were produced by Kafka Producer into S3\n",
11 | "- export them into Redshift\n",
12 | "- perform aggregation on the tweets to get the desired output of batch layer\n",
13 | "- achieve this by: \n",
14 | " - every couple of hours get the latest unseen tweets produced by the Kafka Producer and store them into a S3 archive\n",
15 | " - every night run a sql query to compute the result of batch layer\n",
16 | "\n",
17 | "### Contents: \n",
18 | "- [Defining the Kafka consumer](#1)\n",
19 | "- [Defining a Amazon Web Services S3 storage client](#2)\n",
20 | "- [Writing the data into a S3 bucket](#3)\n",
21 | "- [Exporting data from S3 bucket to Amazon Redshift using COPY command](#4)\n",
22 | "- [Aggregating \"raw\" tweets in Redshift](#5)\n",
23 | "- [Deployment](#6)"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Required libraries"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 1,
36 | "metadata": {},
37 | "outputs": [],
38 | "source": [
39 | "from kafka import KafkaConsumer\n",
40 | "from io import StringIO\n",
41 | "import boto3\n",
42 | "import time\n",
43 | "import random"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "\n",
51 | "### Defining the Kafka consumer\n",
52 | "- setting the location of Kafka Broker\n",
53 | "- specifying the group_id and consumer_timeout\n",
54 | "- subsribing to a topic"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 2,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "consumer = KafkaConsumer(\n",
64 | " bootstrap_servers='localhost:9092',\n",
65 | " auto_offset_reset='latest', # Reset partition offsets upon OffsetOutOfRangeError\n",
66 | " group_id='test', # must have a unique consumer group id \n",
67 | " consumer_timeout_ms=1000) \n",
68 | " # How long to listen for messages - we do it for 10 seconds \n",
69 | " # because we poll the kafka broker only each couple of hours\n",
70 | "\n",
71 | "consumer.subscribe('tweets-lambda1')"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "\n",
79 | "### Defining a Amazon Web Services S3 storage client\n",
80 | "- setting the autohrizaition and bucket"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 3,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "s3_resource = boto3.resource(\n",
90 | " 's3',\n",
91 | " aws_access_key_id='AKIAIXUPHT6ERRMQYINQ',\n",
92 | " aws_secret_access_key='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',\n",
93 | ")\n",
94 | "\n",
95 | "s3_client = s3_resource.meta.client\n",
96 | "bucket_name = 'lambda-architecture123'\n"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "\n",
104 | "### Writing the data into a S3 bucket\n",
105 | "- polling the Kafka Broker\n",
106 | "- aggregating the latest messages into a single object in the bucket\n",
107 | "\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 4,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "def store_twitter_data(path):\n",
117 | " csv_buffer = StringIO() # S3 storage is object storage -> our document is just a large string\n",
118 | "\n",
119 | " for message in consumer: # this acts as \"get me an iterator over the latest messages I haven't seen\"\n",
120 | " csv_buffer.write(message.value.decode() + '\\n') \n",
121 | "# print(message)\n",
122 | " s3_resource.Object(bucket_name,path).put(Body=csv_buffer.getvalue())"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "\n",
130 | "### Exporting data from S3 bucket to Amazon Redshift using COPY command\n",
131 | "- authenticate and create a connection using psycopg module\n",
132 | "- export data using COPY command from S3 to Redshift \"raw\" table"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 19,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "import psycopg2\n",
142 | "config = { 'dbname': 'lambda', \n",
143 | " 'user':'dorian',\n",
144 | " 'pwd':'Demo1234',\n",
145 | " 'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n",
146 | " 'port':'5439'\n",
147 | " }\n",
148 | "conn = psycopg2.connect(dbname=config['dbname'], host=config['host'], \n",
149 | " port=config['port'], user=config['user'], \n",
150 | " password=config['pwd'])"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 6,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": [
159 | "def copy_files(conn, path):\n",
160 | " curs = conn.cursor()\n",
161 | " curs.execute(\"\"\" \n",
162 | " copy \n",
163 | " batch_raw\n",
164 | " from \n",
165 | " 's3://lambda-architecture123/\"\"\" + path + \"\"\"' \n",
166 | " access_key_id 'AKIAIXUPHT6ERRMQYINQ'\n",
167 | " secret_access_key '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'\n",
168 | " delimiter ';'\n",
169 | " region 'eu-central-1'\n",
170 | " \"\"\")\n",
171 | " curs.close()\n",
172 | " conn.commit()\n"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "### Computing the batch layer output\n",
180 | "- querying the raw tweets stored in redshift to get the desired batch layer output"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 7,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "def compute_batch_layer(conn):\n",
190 | " curs = conn.cursor()\n",
191 | " curs.execute(\"\"\" \n",
192 | " drop table if exists batch_layer;\n",
193 | "\n",
194 | " with raw_dedup as (\n",
195 | " SELECT\n",
196 | " distinct id,created_at,followers_count,location,favorite_count,retweet_count\n",
197 | " FROM\n",
198 | " batch_raw\n",
199 | " ),\n",
200 | " batch_result as (\n",
201 | " SELECT\n",
202 | " location,\n",
203 | " count(id) as count_id,\n",
204 | " sum(followers_count) as sum_followers_count,\n",
205 | " sum(favorite_count) as sum_favorite_count,\n",
206 | " sum(retweet_count) as sum_retweet_count\n",
207 | " FROM\n",
208 | " raw_dedup\n",
209 | " group by \n",
210 | " location\n",
211 | " )\n",
212 | " select \n",
213 | " *\n",
214 | " INTO\n",
215 | " batch_layer\n",
216 | " FROM\n",
217 | " batch_result\"\"\")\n",
218 | " curs.close()\n",
219 | " conn.commit()"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 8,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "# compute_batch_layer(conn)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "\n",
236 | "### Deployment \n",
237 | "- perform the task every couple of hours and wait in between"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 9,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "def periodic_work(interval):\n",
247 | " while True:\n",
248 | " path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n",
249 | " store_twitter_data(path)\n",
250 | " copy_files(conn, path)\n",
251 | " #interval should be an integer, the number of seconds to wait\n",
252 | " time.sleep(interval)"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 10,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "# periodic_work(60 * 60) ## 60 minutes !"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 22,
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n",
271 | "\n",
272 | "store_twitter_data(path)"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 23,
278 | "metadata": {},
279 | "outputs": [
280 | {
281 | "name": "stderr",
282 | "output_type": "stream",
283 | "text": [
284 | "ERROR:root:An unexpected error occurred while tokenizing input\n",
285 | "The following traceback may be corrupted or invalid\n",
286 | "The error message is: ('EOF in multi-line string', (1, 4))\n",
287 | "\n"
288 | ]
289 | },
290 | {
291 | "ename": "DatabaseError",
292 | "evalue": "SSL SYSCALL error: Operation timed out\n",
293 | "output_type": "error",
294 | "traceback": [
295 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
296 | "\u001b[0;31mDatabaseError\u001b[0m Traceback (most recent call last)",
297 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcopy_files\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
298 | "\u001b[0;32m\u001b[0m in \u001b[0;36mcopy_files\u001b[0;34m(conn, path)\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mdelimiter\u001b[0m \u001b[0;34m';'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0mregion\u001b[0m \u001b[0;34m'eu-central-1'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \"\"\")\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0mcurs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
299 | "\u001b[0;31mDatabaseError\u001b[0m: SSL SYSCALL error: Operation timed out\n"
300 | ]
301 | }
302 | ],
303 | "source": [
304 | "copy_files(conn, path)"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 21,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "# run at the end of the day\n",
314 | "compute_batch_layer(conn)"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 14,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": [
323 | "#conn.close()"
324 | ]
325 | }
326 | ],
327 | "metadata": {
328 | "kernelspec": {
329 | "display_name": "Python 3",
330 | "language": "python",
331 | "name": "python3"
332 | },
333 | "language_info": {
334 | "codemirror_mode": {
335 | "name": "ipython",
336 | "version": 3
337 | },
338 | "file_extension": ".py",
339 | "mimetype": "text/x-python",
340 | "name": "python",
341 | "nbconvert_exporter": "python",
342 | "pygments_lexer": "ipython3",
343 | "version": "3.6.1"
344 | }
345 | },
346 | "nbformat": 4,
347 | "nbformat_minor": 2
348 | }
349 |
--------------------------------------------------------------------------------
/Implementing the serving layer of lambda architecture using Redshift.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Implementing the serving layer of lambda architecture using Redshift\n",
8 | "\n",
9 | "### Purpose:\n",
10 | "- merge the output of speed and batch layer aggregations\n",
11 | "- achieve this by: \n",
12 | " - every couple of hours run the re-computation\n",
13 | " - use the output of batch layer as base table\n",
14 | " - upsert the up-to-date values of speed layer into the base table \n",
15 | "\n",
16 | "### Contents: \n",
17 | "- [Creating the serving layer](#1)"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "### Requirements"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "import psycopg2"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "\n",
41 | "### Creating the serving layer\n",
42 | "- authenticate and create a connection using psycopg module\n",
43 | "- create and populate a temporary table with it's base being batch layer and upserting the speed layer\n",
44 | "- drop the current serving layer and use the above mentioned temporary table for serving layer (no downtime)"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 2,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "config = { 'dbname': 'lambda', \n",
54 | " 'user':'dorian',\n",
55 | " 'pwd':'Demo1234',\n",
56 | " 'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n",
57 | " 'port':'5439'\n",
58 | " }\n",
59 | "conn = psycopg2.connect(dbname=config['dbname'], host=config['host'], \n",
60 | " port=config['port'], user=config['user'], \n",
61 | " password=config['pwd'])"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 3,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "curs = conn.cursor()\n",
71 | "curs.execute(\"\"\" \n",
72 | " DROP TABLE IF EXISTS serving_layer_temp; \n",
73 | "\n",
74 | " SELECT \n",
75 | " *\n",
76 | " INTO \n",
77 | " serving_layer_temp\n",
78 | " FROM \n",
79 | " batch_layer ;\n",
80 | "\n",
81 | "\n",
82 | "\n",
83 | " UPDATE \n",
84 | " serving_layer_temp\n",
85 | " SET\n",
86 | " count_id = count_id + speed_layer.\"count(id)\",\n",
87 | " sum_followers_count = sum_followers_count + speed_layer.\"sum(followers_count)\",\n",
88 | " sum_favorite_count = sum_favorite_count + speed_layer.\"sum(favorite_count)\",\n",
89 | " sum_retweet_count = sum_retweet_count + speed_layer.\"sum(retweet_count)\"\n",
90 | " FROM\n",
91 | " speed_layer\n",
92 | " WHERE \n",
93 | " serving_layer_temp.location = speed_layer.location ;\n",
94 | "\n",
95 | "\n",
96 | "\n",
97 | " INSERT INTO \n",
98 | " serving_layer_temp\n",
99 | " SELECT \n",
100 | " * \n",
101 | " FROM \n",
102 | " speed_layer\n",
103 | " WHERE \n",
104 | " speed_layer.location \n",
105 | " NOT IN (\n",
106 | " SELECT \n",
107 | " DISTINCT location \n",
108 | " FROM \n",
109 | " serving_layer_temp \n",
110 | " ) ;\n",
111 | " \n",
112 | " \n",
113 | " drop table serving_layer ;\n",
114 | " \n",
115 | " alter table serving_layer_temp rename to serving_layer ; \n",
116 | " \n",
117 | "\"\"\")\n",
118 | "curs.close()\n",
119 | "conn.commit()\n",
120 | "conn.close()"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {},
127 | "outputs": [],
128 | "source": []
129 | }
130 | ],
131 | "metadata": {
132 | "kernelspec": {
133 | "display_name": "Python 3",
134 | "language": "python",
135 | "name": "python3"
136 | },
137 | "language_info": {
138 | "codemirror_mode": {
139 | "name": "ipython",
140 | "version": 3
141 | },
142 | "file_extension": ".py",
143 | "mimetype": "text/x-python",
144 | "name": "python",
145 | "nbconvert_exporter": "python",
146 | "pygments_lexer": "ipython3",
147 | "version": "3.6.1"
148 | }
149 | },
150 | "nbformat": 4,
151 | "nbformat_minor": 2
152 | }
153 |
--------------------------------------------------------------------------------
/Implementing the speed layer of lambda architecture using Structured Spark Streaming:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"text":"%md \n\n# Implementing the speed layer of lambda architecture using Structured Spark Streaming\n\n### Purpose: \n- provide analytics on real time data (\"intra day\") which batch layer cannot efficiently achieve\n- achieve this by:\n - ingest latest tweets from Kafka Producer and analtze only those for the current day \n - perform aggregations over the data to get the desired output of speed layer\n\n### Contents: \n- Configuring spark\n- Spark Structured Streaming\n - Input stage - defining the data source\n - Result stage - performing transformations on the stream\n - Output stage\n- Connecting to redshift cluster\n- Exporting data to Redshift","user":"anonymous","dateUpdated":"2017-11-11T15:55:51+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Implementing the speed layer of lambda architecture using Structured Spark Streaming
\n
Purpose:
\n
\n - provide analytics on real time data (“intra day”) which batch layer cannot efficiently achieve
\n - achieve this by:\n
\n - ingest latest tweets from Kafka Producer and analtze only those for the current day
\n - perform aggregations over the data to get the desired output of speed layer
\n
\n \n
\n
Contents:
\n
\n - Configuring spark
\n - Spark Structured Streaming\n
\n - Input stage - defining the data source
\n - Result stage - performing transformations on the stream
\n - Output stage
\n
\n \n - Connecting to redshift cluster
\n - Exporting data to Redshift
\n
\n
"}]},"apps":[],"jobName":"paragraph_1509909863130_-1135889452","id":"20171105-202423_1160220584","dateCreated":"2017-11-05T20:24:23+0100","dateStarted":"2017-11-11T15:55:51+0100","dateFinished":"2017-11-11T15:55:51+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:273"},{"text":"%dep\n\nz.load(\"/Volumes/SD/Downloads/RedshiftJDBC42-1.2.10.1009.jar\")","user":"anonymous","dateUpdated":"2017-11-11T16:24:36+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@7dfb0e5a\n"}]},"apps":[],"jobName":"paragraph_1509878460764_1962878483","id":"20171105-114100_1870675802","dateCreated":"2017-11-05T11:41:00+0100","dateStarted":"2017-11-11T16:24:36+0100","dateFinished":"2017-11-11T16:24:44+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:274"},{"text":"%md \n\n### Requirements","user":"anonymous","dateUpdated":"2017-11-11T14:30:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Requirements
\n
"}]},"apps":[],"jobName":"paragraph_1510394086125_-1443536682","id":"20171111-105446_1758804734","dateCreated":"2017-11-11T10:54:46+0100","dateStarted":"2017-11-11T14:30:05+0100","dateFinished":"2017-11-11T14:30:05+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:275"},{"text":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._","user":"anonymous","dateUpdated":"2017-11-11T16:24:59+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._\n"}]},"apps":[],"jobName":"paragraph_1510394103383_-688043084","id":"20171111-105503_1480409678","dateCreated":"2017-11-11T10:55:03+0100","dateStarted":"2017-11-11T16:25:00+0100","dateFinished":"2017-11-11T16:25:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:276"},{"text":"%md \n\n### Configuring Spark\n- properly configuring spark for our workload\n- defining case class for tweets which will be used later on","user":"anonymous","dateUpdated":"2017-11-11T14:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Configuring Spark
\n
\n - properly configuring spark for our workload
\n - defining case class for tweets which will be used later on
\n
\n
"}]},"apps":[],"jobName":"paragraph_1509911366990_1694144500","id":"20171105-204926_42354742","dateCreated":"2017-11-05T20:49:26+0100","dateStarted":"2017-11-11T14:30:07+0100","dateFinished":"2017-11-11T14:30:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:277"},{"text":"Thread.sleep(5000)\n\nval spark = SparkSession\n .builder()\n .config(\"spark.sql.shuffle.partitions\",\"2\") // we are running this on my laptop\n .appName(\"Spark Structured Streaming example\")\n .getOrCreate()\n \ncase class tweet (id: String, created_at : String, followers_count: String, location : String, favorite_count : String, retweet_count : String)\n\nThread.sleep(5000)","user":"anonymous","dateUpdated":"2017-11-11T16:25:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@20d06238\ndefined class tweet\n"}]},"apps":[],"jobName":"paragraph_1509779101105_500060833","id":"20171104-080501_1626533215","dateCreated":"2017-11-04T08:05:01+0100","dateStarted":"2017-11-11T16:25:20+0100","dateFinished":"2017-11-11T16:25:33+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:278"},{"text":"%md \n\n### Input stage - defining the data source\n- using Kafka as data source we specify:\n - location of kafka broker\n - relevant kafka topic\n - how to treat starting offsets","user":"anonymous","dateUpdated":"2017-11-11T14:30:09+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Input stage - defining the data source
\n
\n - using Kafka as data source we specify:\n
\n - location of kafka broker
\n - relevant kafka topic
\n - how to treat starting offsets
\n
\n \n
\n
"}]},"apps":[],"jobName":"paragraph_1509881379573_-268499860","id":"20171105-122939_1561723250","dateCreated":"2017-11-05T12:29:39+0100","dateStarted":"2017-11-11T14:30:09+0100","dateFinished":"2017-11-11T14:30:09+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:279"},{"text":"var data_stream = spark\n .readStream // contantly expanding dataframe\n .format(\"kafka\")\n .option(\"kafka.bootstrap.servers\", \"localhost:9092\")\n .option(\"subscribe\", \"tweets-lambda1\")\n .option(\"startingOffsets\",\"latest\") //or latest\n .load()\n \n// note how similar API is to the batch version","user":"anonymous","dateUpdated":"2017-11-11T16:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509828449824_1485333558","id":"20171104-214729_426230385","dateCreated":"2017-11-04T21:47:29+0100","dateStarted":"2017-11-11T16:30:07+0100","dateFinished":"2017-11-11T16:30:14+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:280"},{"text":"data_stream.schema","user":"anonymous","dateUpdated":"2017-11-11T16:30:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res8: org.apache.spark.sql.types.StructType = StructType(StructField(key,BinaryType,true), StructField(value,BinaryType,true), StructField(topic,StringType,true), StructField(partition,IntegerType,true), StructField(offset,LongType,true), StructField(timestamp,TimestampType,true), StructField(timestampType,IntegerType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896872792_453692520","id":"20171105-164752_719330437","dateCreated":"2017-11-05T16:47:52+0100","dateStarted":"2017-11-11T16:30:20+0100","dateFinished":"2017-11-11T16:30:23+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:281"},{"text":"%md \n\n### Result stage - performing transformations on the stream\n- extract the value column of kafka message\n- parse each row into a member of tweet class\n- filter to only look at todays tweets as results\n- perform aggregations","user":"anonymous","dateUpdated":"2017-11-11T14:30:12+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Result stage - performing transformations on the stream
\n
\n - extract the value column of kafka message
\n - parse each row into a member of tweet class
\n - filter to only look at todays tweets as results
\n - perform aggregations
\n
\n
"}]},"apps":[],"jobName":"paragraph_1509912237119_-1371939219","id":"20171105-210357_553784609","dateCreated":"2017-11-05T21:03:57+0100","dateStarted":"2017-11-11T14:30:12+0100","dateFinished":"2017-11-11T14:30:12+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:282"},{"text":"var data_stream_cleaned = data_stream\n .selectExpr(\"CAST(value AS STRING) as string_value\")\n .as[String]\n .map(x => (x.split(\";\"))) //wrapped array\n .map(x => tweet(x(0), x(1), x(2), x(3), x(4), x(5)))\n .selectExpr( \"cast(id as long) id\", \"CAST(created_at as timestamp) created_at\", \"cast(followers_count as int) followers_count\", \"location\", \"cast(favorite_count as int) favorite_count\", \"cast(retweet_count as int) retweet_count\")\n .toDF()\n .filter(col(\"created_at\").gt(current_date())) // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output\n .groupBy(\"location\")\n .agg(count(\"id\"), sum(\"followers_count\"), sum(\"favorite_count\"), sum(\"retweet_count\"))","user":"anonymous","dateUpdated":"2017-11-11T16:32:33+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream_cleaned: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509838694466_-674508628","id":"20171105-003814_85265595","dateCreated":"2017-11-05T00:38:14+0100","dateStarted":"2017-11-11T16:32:33+0100","dateFinished":"2017-11-11T16:32:43+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:283"},{"text":"data_stream_cleaned.schema","user":"anonymous","dateUpdated":"2017-11-11T16:32:46+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res9: org.apache.spark.sql.types.StructType = StructType(StructField(location,StringType,true), StructField(count(id),LongType,false), StructField(sum(followers_count),LongType,true), StructField(sum(favorite_count),LongType,true), StructField(sum(retweet_count),LongType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896863262_62756989","id":"20171105-164743_19706666","dateCreated":"2017-11-05T16:47:43+0100","dateStarted":"2017-11-11T16:32:46+0100","dateFinished":"2017-11-11T16:32:47+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:284"},{"text":"%md \n\n### Output stage\n- specify the following:\n - data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )\n - trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)\n - output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode","user":"anonymous","dateUpdated":"2017-11-11T16:33:38+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Output stage
\n
\n - specify the following:\n
\n - data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )
\n - trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)
\n - output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode
\n
\n \n
\n
"}]},"apps":[],"jobName":"paragraph_1509912259110_1463706147","id":"20171105-210419_1857413462","dateCreated":"2017-11-05T21:04:19+0100","dateStarted":"2017-11-11T16:33:38+0100","dateFinished":"2017-11-11T16:33:42+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:285"},{"text":"val query = data_stream_cleaned.writeStream\n .format(\"memory\")\n .queryName(\"demo\")\n .trigger(ProcessingTime(\"30 seconds\")) // means that that spark will look for new data only every minute\n .outputMode(\"complete\") // could also be append or update\n .start()","user":"anonymous","dateUpdated":"2017-11-11T16:34:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"query: org.apache.spark.sql.streaming.StreamingQuery = Streaming Query demo [id = 824a35c8-af8d-4b57-83c4-4cbf3c1f9418, runId = 5dcdee01-de96-403e-997e-350bfd5d8f2c] [state = ACTIVE]\n"}]},"apps":[],"jobName":"paragraph_1509873755450_-691017020","id":"20171105-102235_1294551369","dateCreated":"2017-11-05T10:22:35+0100","dateStarted":"2017-11-11T16:34:18+0100","dateFinished":"2017-11-11T16:34:22+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:286"},{"text":"query.status","user":"anonymous","dateUpdated":"2017-11-11T16:34:25+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res10: org.apache.spark.sql.streaming.StreamingQueryStatus =\n{\n \"message\" : \"Processing new data\",\n \"isDataAvailable\" : true,\n \"isTriggerActive\" : true\n}\n"}]},"apps":[],"jobName":"paragraph_1510413138022_529774344","id":"20171111-161218_1171458195","dateCreated":"2017-11-11T16:12:18+0100","dateStarted":"2017-11-11T16:34:25+0100","dateFinished":"2017-11-11T16:34:26+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:287"},{"text":"query.explain","user":"anonymous","dateUpdated":"2017-11-11T16:34:35+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"== Physical Plan ==\n*HashAggregate(keys=[location#36], functions=[count(id#40L), sum(cast(followers_count#42 as bigint)), sum(cast(favorite_count#43 as bigint)), sum(cast(retweet_count#44 as bigint))])\n+- Exchange hashpartitioning(location#36, 2)\n +- *HashAggregate(keys=[location#36], functions=[partial_count(id#40L), partial_sum(cast(followers_count#42 as bigint)), partial_sum(cast(favorite_count#43 as bigint)), partial_sum(cast(retweet_count#44 as bigint))])\n +- *Project [cast(cast(id#33 as decimal(20,0)) as bigint) AS id#40L, cast(cast(followers_count#35 as decimal(20,0)) as int) AS followers_count#42, location#36, cast(cast(favorite_count#37 as decimal(20,0)) as int) AS favorite_count#43, cast(cast(retweet_count#38 as decimal(20,0)) as int) AS retweet_count#44]\n +- *Filter (isnotnull(created_at#34) && (cast(cast(created_at#34 as timestamp) as string) > cast(current_batch_timestamp(1510414470012, DateType) as string)))\n +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).id, true) AS id#33, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).created_at, true) AS created_at#34, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).followers_count, true) AS followers_count#35, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).location, true) AS location#36, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).favorite_count, true) AS favorite_count#37, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).retweet_count, true) AS retweet_count#38]\n +- *MapElements , obj#32: $line155526911422.$read$$iw$$iw$tweet\n +- *MapElements , obj#22: [Ljava.lang.String;\n +- *DeserializeToObject string_value#15.toString, obj#21: java.lang.String\n +- *Project [cast(value#141 as string) AS string_value#15]\n +- Scan ExistingRDD[key#140,value#141,topic#142,partition#143,offset#144L,timestamp#145,timestampType#146]\n"}]},"apps":[],"jobName":"paragraph_1510413000978_-581196590","id":"20171111-161000_1529325108","dateCreated":"2017-11-11T16:10:00+0100","dateStarted":"2017-11-11T16:34:35+0100","dateFinished":"2017-11-11T16:34:36+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:288"},{"text":"%spark.sql \n\nselect * from demo","user":"anonymous","dateUpdated":"2017-11-11T16:39:15+0100","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"location\tcount(id)\tsum(followers_count)\tsum(favorite_count)\tsum(retweet_count)\nYokohama\t2\t1110\t0\t115\nAngola, Luanda \t3\t1572\t0\t94221\nハッピーワールド\t4\t384\t0\t0\n推しが定まりません。そらワンいって考えます\t4\t6932\t0\t0\nTallinn, Estonia\t4\t248\t0\t464\n\t233\t310223\t0\t1344636\nOhio\t2\t28806\t0\t0\n京都市右京区\t3\t4272\t0\t0\n孤独\t3\t1059\t0\t0\nGeorgia, USA\t4\t3448\t0\t0\nManaus / São Paulo\t4\t1056\t0\t0\n当てます。\t3\t1299\t0\t0\nTriplex\t4\t7780\t0\t88\nPhilly✈️Tuskegee👉🏾D.C.\t3\t4842\t0\t3\nnew jersey\t4\t864\t0\t0\nさいたま市\t4\t3384\t0\t0\n日本 東京\t3\t2337\t0\t0\nD.O. 도경수\t3\t1428\t0\t3018\n千葉 船橋市\t4\t156\t0\t0\nEntroncamento, Portugal\t4\t1392\t0\t2473\nAntwerpen, Belgium\t3\t1629\t0\t0\nチ ン パ ン .🌷 ¦ヘッダー ⇒ フミ !!!\t3\t669\t0\t208\nEau Claire, Wisconsin\t3\t254721\t0\t1\n@無所属 お仕事依頼DMへ📩\t3\t2406\t0\t0\nJeje l'omo Eko nlo\t3\t1008\t0\t0\nstaten island, new york\t3\t348\t0\t0\nlagos\t3\t3450\t0\t234\nAtlanta GA Metro\t3\t273\t0\t0\nBerks. UK\t3\t13383\t0\t66\nΑνδραβιδα\t3\t7398\t0\t5172\nFinland, no polar bears here\t3\t609\t0\t0\nGuayaquil, Ecuador\t3\t1827\t0\t120\n내가 하는 말이 있다면 그건 거짓말\t4\t20\t0\t0\nLouisville, KY\t3\t2250\t0\t0\nottawa\t4\t1024\t0\t0\nً\t4\t11140\t0\t122960\ncairo\t2\t706\t0\t0\nSan Francisco, CA\t3\t1167\t0\t0\n🇧🇸 // 🇯🇲\t3\t3522\t0\t111\nNorwalk, CA\t3\t765\t0\t0\n京都 - 大阪\t4\t2968\t0\t341\nBeen all around the world\t3\t1161\t0\t43983\nUSA\t6\t31782\t0\t0\n호그와트\t3\t183\t0\t1077\nSeabrook, TX\t3\t1023\t0\t9860\nNein. \t3\t5421\t0\t0\nかるわかのストレス清掃員♪\t3\t372\t0\t0\nRochester/London, UK\t3\t7557\t0\t12\nbuenos aires argentina\t3\t444\t0\t0\nHoseok's heart~\t4\t7268\t0\t0\nNorth West, England\t3\t909\t0\t1431\nGloucester, UK\t3\t5424\t0\t0\n愛知県岡崎市\t4\t1964\t0\t0\nアップルパイの中\t3\t759\t0\t0\nMakati City\t3\t3\t0\t129771\nCorrientes\t3\t411\t0\t1521\nYら,猫好き,車好き,ゲーム好きさんと繋がりたい🗿\t3\t630\t0\t0\n{May - Sofi - Dians - Paula}\t3\t48723\t4\t1\nToronto\t4\t3744\t0\t0\nどすこいパレード\t4\t1044\t0\t98\nMalaysia\t4\t1264\t0\t48736\nJapan Tokyo\t4\t32520\t0\t0\n#Adelaido 👑\t2\t2094\t6\t2\nJapan\t3\t126603\t0\t0\nPilipínas\t3\t1362\t0\t20979\nManaus, Brasil\t3\t3222\t0\t0\n127.0.0.1\t3\t5997\t0\t4143\nFigueres\t3\t4485\t0\t0\n運動会する\t4\t256\t0\t0\nRepublic of the Philippines\t3\t1347\t0\t54\nアバリスの膝の上\t3\t630\t0\t0\nPortugal\t2\t1476\t0\t1241\nOhio, USA\t4\t1332\t0\t0\nlincoln, england\t4\t2844\t0\t4\nCalifornia, USA\t3\t1281\t0\t402\nLondon, Ontario\t3\t7467\t0\t0\nOrlando, FL\t4\t1056\t0\t0\nLos Angeles, CA\t7\t29565\t0\t775\nAtlanta, GA\t3\t279\t0\t318\nmismuertos city\t3\t1449\t0\t57932\nカルデアの冷蔵庫\t3\t141\t0\t0\nBarueri, Brasil\t3\t2607\t0\t3\nahead\t3\t831\t1\t1\nparadise with drake\t3\t3426\t0\t0\nmiami\t3\t504\t0\t0\n栃木県\t4\t316\t0\t0\nJHB/SF\t4\t11420\t0\t0\n京都府 京都市 \t3\t18\t0\t0\nMichigan \t4\t3272\t0\t0\nAvellaneda, Buenos Aires\t3\t774\t0\t10923\n대한민국\t3\t63\t0\t1068\n日本\t19\t6294\t0\t231\nether \t4\t88\t0\t16\nPunjab, Pakistan\t2\t2088\t0\t0\nUnited Kingdom\t4\t1740\t0\t0\nToday's Market Movement\t3\t586\t0\t9\nOn Melancholy Hill\t3\t498\t0\t0\nLos Angeles\t4\t612\t0\t0\nMadrid, Comunidad de Madrid\t3\t7578\t0\t375\nMakati City, National Capital \t3\t660\t0\t36\nSingapore \t3\t588\t0\t204\nIndia\t3\t342\t0\t0\nNorth Riding/Samrand\t3\t315\t0\t29427\nBinghamton University\t3\t2166\t0\t0\nGurgaon, India\t4\t18184\t0\t0\n広島県\t3\t642\t0\t206\nみんなの心の中☆\t2\t204\t0\t0\n"}]},"apps":[],"jobName":"paragraph_1509873913879_-781586253","id":"20171105-102513_1944185364","dateCreated":"2017-11-05T10:25:13+0100","dateStarted":"2017-11-11T16:39:16+0100","dateFinished":"2017-11-11T16:39:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:289"},{"text":"%md\n\n### Connecting to redshift cluster\n- defining JDBC connection to connect to redshift","user":"anonymous","dateUpdated":"2017-11-11T14:30:16+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Connecting to redshift cluster
\n
\n - defining JDBC connection to connect to redshift
\n
\n
"}]},"apps":[],"jobName":"paragraph_1509909816175_369050675","id":"20171105-202336_1741159108","dateCreated":"2017-11-05T20:23:36+0100","dateStarted":"2017-11-11T14:30:16+0100","dateFinished":"2017-11-11T14:30:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:290"},{"text":"//create properties object\nClass.forName(\"com.amazon.redshift.jdbc42.Driver\")\n\nval prop = new java.util.Properties\nprop.setProperty(\"driver\", \"com.amazon.redshift.jdbc42.Driver\")\nprop.setProperty(\"user\", \"dorian\")\nprop.setProperty(\"password\", \"Demo1234\") \n\n//jdbc mysql url - destination database is named \"data\"\nval url = \"jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\"\n\n//destination database table \nval table = \"speed_layer\"","user":"anonymous","dateUpdated":"2017-11-11T16:36:02+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res13: Class[_] = class com.amazon.redshift.jdbc42.Driver\nprop: java.util.Properties = {}\nres15: Object = null\nres16: Object = null\nres17: Object = null\nurl: String = jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\ntable: String = speed_layer\n"}]},"apps":[],"jobName":"paragraph_1509874084488_-1803262626","id":"20171105-102804_443283701","dateCreated":"2017-11-05T10:28:04+0100","dateStarted":"2017-11-11T16:36:02+0100","dateFinished":"2017-11-11T16:36:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:291"},{"text":"%md \n\n### Exporting data to Redshift\n- \"overwriting\" the table with results of query stored in memory as result of the speed layer\n- scheduling the function to run every hour\n","user":"anonymous","dateUpdated":"2017-11-11T14:30:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Exporting data to Redshift
\n
\n - “overwriting” the table with results of query stored in memory as result of the speed layer
\n - scheduling the function to run every hour
\n
\n
"}]},"apps":[],"jobName":"paragraph_1509881434795_1558195972","id":"20171105-123034_774658214","dateCreated":"2017-11-05T12:30:34+0100","dateStarted":"2017-11-11T14:30:18+0100","dateFinished":"2017-11-11T14:30:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:292"},{"text":"val df = spark.sql(\"select * from demo\")\n\n//write data from spark dataframe to database\ndf.write.mode(\"overwrite\").jdbc(url, table, prop)\n","user":"anonymous","dateUpdated":"2017-11-11T16:39:19+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"df: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509878690443_-615786910","id":"20171105-114450_965590151","dateCreated":"2017-11-05T11:44:50+0100","dateStarted":"2017-11-11T16:39:19+0100","dateFinished":"2017-11-11T16:39:30+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:293"},{"text":"","user":"anonymous","dateUpdated":"2017-11-11T16:37:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1509949807769_1425690320","id":"20171106-073007_1469584911","dateCreated":"2017-11-06T07:30:07+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:294"}],"name":"Implementing the speed layer of lambda architecture using Structured Spark Streaming","id":"2CY6MSDVK","angularObjects":{"2CZTSJM9A:shared_process":[],"2CXD9FT1P:shared_process":[],"2CYBUJCZE:shared_process":[],"2CZXYWRND:shared_process":[],"2CWX9E9KA:shared_process":[],"2CWJS6R2N:shared_process":[],"2CW6U7X7Z:shared_process":[],"2CY9X3W1T:shared_process":[],"2CX93H291:shared_process":[],"2CZRYW3SZ:shared_process":[],"2CYT9Z9RC:shared_process":[],"2CY2R49R6:shared_process":[],"2CYQW36AU:shared_process":[],"2CWPRKMXH:shared_process":[],"2CWU95D3A:shared_process":[],"2CXJ7UBRF:shared_process":[],"2CWWTMY7M:shared_process":[],"2CY3EBJAE:shared_process":[],"2CYFQZER9:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Ingesting realtime tweets using Apache Kafka, Tweepy and Python\n",
8 | "\n",
9 | "### Purpose:\n",
10 | "- main data source for the lambda architecture pipeline\n",
11 | "- uses twitter streaming API to simulate new events coming in every minute\n",
12 | "- Kafka Producer sends the tweets as records to the Kafka Broker\n",
13 | "\n",
14 | "### Contents: \n",
15 | "- [Twitter setup](#1)\n",
16 | "- [Defining the Kafka producer](#2)\n",
17 | "- [Producing and sending records to the Kafka Broker](#3)\n",
18 | "- [Deployment](#4)"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "### Required libraries"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 1,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "import tweepy\n",
35 | "import time\n",
36 | "from kafka import KafkaConsumer, KafkaProducer"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "\n",
44 | "### Twitter setup\n",
45 | "- getting the API object using authorization information\n",
46 | "- you can find more details on how to get the authorization here:\n",
47 | "https://developer.twitter.com/en/docs/basics/authentication/overview"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 2,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "# twitter setup\n",
57 | "consumer_key = \"1\"\n",
58 | "consumer_secret = \"2\"\n",
59 | "access_token = \"3\"\n",
60 | "access_token_secret = \"4\"\n",
61 | "# Creating the authentication object\n",
62 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
63 | "# Setting your access token and secret\n",
64 | "auth.set_access_token(access_token, access_token_secret)\n",
65 | "# Creating the API object by passing in auth information\n",
66 | "api = tweepy.API(auth) \n"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "A helper function to normalize the time a tweet was created with the time of our system"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 3,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "from datetime import datetime, timedelta\n",
83 | "\n",
84 | "def normalize_timestamp(time):\n",
85 | " mytime = datetime.strptime(time, \"%Y-%m-%d %H:%M:%S\")\n",
86 | " mytime += timedelta(hours=1) # the tweets are timestamped in GMT timezone, while I am in +1 timezone\n",
87 | " return (mytime.strftime(\"%Y-%m-%d %H:%M:%S\")) "
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "\n",
95 | "### Defining the Kafka producer\n",
96 | "- specify the Kafka Broker\n",
97 | "- specify the topic name\n",
98 | "- optional: specify partitioning strategy"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 4,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": [
107 | "producer = KafkaProducer(bootstrap_servers='localhost:9092')\n",
108 | "topic_name = 'tweets-lambda1'"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "\n",
116 | "### Producing and sending records to the Kafka Broker\n",
117 | "- querying the Twitter API Object\n",
118 | "- extracting relevant information from the response\n",
119 | "- formatting and sending the data to proper topic on the Kafka Broker\n",
120 | "- resulting tweets have following attributes:\n",
121 | " - id \n",
122 | " - created_at\n",
123 | " - followers_count\n",
124 | " - location\n",
125 | " - favorite_count\n",
126 | " - retweet_count"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 6,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": [
135 | "def get_twitter_data():\n",
136 | " res = api.search(\"Apple OR iphone OR iPhone\")\n",
137 | " for i in res:\n",
138 | " record = ''\n",
139 | " record += str(i.user.id_str)\n",
140 | " record += ';'\n",
141 | " record += str(normalize_timestamp(str(i.created_at)))\n",
142 | " record += ';'\n",
143 | " record += str(i.user.followers_count)\n",
144 | " record += ';'\n",
145 | " record += str(i.user.location)\n",
146 | " record += ';'\n",
147 | " record += str(i.favorite_count)\n",
148 | " record += ';'\n",
149 | " record += str(i.retweet_count)\n",
150 | " record += ';'\n",
151 | " producer.send(topic_name, str.encode(record))"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 9,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "get_twitter_data()"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "\n",
168 | "### Deployment \n",
169 | "- perform the task every couple of minutes and wait in between"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 11,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "def periodic_work(interval):\n",
179 | " while True:\n",
180 | " get_twitter_data()\n",
181 | " #interval should be an integer, the number of seconds to wait\n",
182 | " time.sleep(interval)\n"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": null,
188 | "metadata": {},
189 | "outputs": [],
190 | "source": [
191 | "periodic_work(60 * 0.1) # get data every couple of minutes"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": []
200 | }
201 | ],
202 | "metadata": {
203 | "kernelspec": {
204 | "display_name": "Python 3",
205 | "language": "python",
206 | "name": "python3"
207 | },
208 | "language_info": {
209 | "codemirror_mode": {
210 | "name": "ipython",
211 | "version": 3
212 | },
213 | "file_extension": ".py",
214 | "mimetype": "text/x-python",
215 | "name": "python",
216 | "nbconvert_exporter": "python",
217 | "pygments_lexer": "ipython3",
218 | "version": "3.6.1"
219 | }
220 | },
221 | "nbformat": 4,
222 | "nbformat_minor": 2
223 | }
224 |
--------------------------------------------------------------------------------
/Lambda_architecture-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dorianbg/lambda-architecture-demo/879a4a2d7b1cb3496c0bb5c63eed07a66d834b33/Lambda_architecture-2.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Developing Lambda Architecture
2 |
3 | This is a repository for the code found in my series of blog posts on implementing the Lambda Architecture::
4 |