Implementing the speed layer of lambda architecture using Structured Spark Streaming

├── Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb
├── Implementing the serving layer of lambda architecture using Redshift.ipynb
├── Implementing the speed layer of lambda architecture using Structured Spark Streaming
├── Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb
├── Lambda_architecture-2.png
└── README.md


/Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka\n",
  8 |     "\n",
  9 |     "### Purpose:\n",
 10 |     "- store all the tweets that were produced by Kafka Producer into S3\n",
 11 |     "- export them into Redshift\n",
 12 |     "- perform aggregation on the tweets to get the desired output of batch layer\n",
 13 |     "- achieve this by: \n",
 14 |     "    - every couple of hours get the latest unseen tweets produced by the Kafka Producer and store them into a S3 archive\n",
 15 |     "    - every night run a sql query to compute the result of batch layer\n",
 16 |     "\n",
 17 |     "### Contents: \n",
 18 |     "- [Defining the Kafka consumer](#1)\n",
 19 |     "- [Defining a Amazon Web Services S3 storage client](#2)\n",
 20 |     "- [Writing the data into a S3 bucket](#3)\n",
 21 |     "- [Exporting data from S3 bucket to Amazon Redshift using COPY command](#4)\n",
 22 |     "- [Aggregating \"raw\" tweets in Redshift](#5)\n",
 23 |     "- [Deployment](#6)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "### Required libraries"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "from kafka import KafkaConsumer\n",
 40 |     "from io import StringIO\n",
 41 |     "import boto3\n",
 42 |     "import time\n",
 43 |     "import random"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "<a id=\"1\"></a>\n",
 51 |     "### Defining the Kafka consumer\n",
 52 |     "- setting the location of Kafka Broker\n",
 53 |     "- specifying the group_id and consumer_timeout\n",
 54 |     "- subsribing to a topic"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 2,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "consumer = KafkaConsumer(\n",
 64 |     "                        bootstrap_servers='localhost:9092',\n",
 65 |     "                        auto_offset_reset='latest',  # Reset partition offsets upon OffsetOutOfRangeError\n",
 66 |     "                        group_id='test',   # must have a unique consumer group id \n",
 67 |     "                        consumer_timeout_ms=1000)  \n",
 68 |     "                                # How long to listen for messages - we do it for 10 seconds \n",
 69 |     "                                # because we poll the kafka broker only each couple of hours\n",
 70 |     "\n",
 71 |     "consumer.subscribe('tweets-lambda1')"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "<a id=\"2\"></a>\n",
 79 |     "### Defining a Amazon Web Services S3 storage client\n",
 80 |     "- setting the autohrizaition and bucket"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 3,
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "s3_resource = boto3.resource(\n",
 90 |     "    's3',\n",
 91 |     "    aws_access_key_id='AKIAIXUPHT6ERRMQYINQ',\n",
 92 |     "    aws_secret_access_key='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',\n",
 93 |     ")\n",
 94 |     "\n",
 95 |     "s3_client = s3_resource.meta.client\n",
 96 |     "bucket_name = 'lambda-architecture123'\n"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "<a id=\"3\"></a>\n",
104 |     "### Writing the data into a S3 bucket\n",
105 |     "- polling the Kafka Broker\n",
106 |     "- aggregating the latest messages into a single object in the bucket\n",
107 |     "\n"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": 4,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "def store_twitter_data(path):\n",
117 |     "    csv_buffer = StringIO() # S3 storage is object storage -> our document is just a large string\n",
118 |     "\n",
119 |     "    for message in consumer: # this acts as \"get me an iterator over the latest messages I haven't seen\"\n",
120 |     "        csv_buffer.write(message.value.decode() + '\\n') \n",
121 |     "#        print(message)\n",
122 |     "    s3_resource.Object(bucket_name,path).put(Body=csv_buffer.getvalue())"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "<a id=\"4\"></a>\n",
130 |     "### Exporting data from S3 bucket to Amazon Redshift using COPY command\n",
131 |     "- authenticate and create a connection using psycopg module\n",
132 |     "- export data using COPY command from S3 to Redshift \"raw\" table"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 19,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "import psycopg2\n",
142 |     "config = { 'dbname': 'lambda', \n",
143 |     "           'user':'dorian',\n",
144 |     "           'pwd':'Demo1234',\n",
145 |     "           'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n",
146 |     "           'port':'5439'\n",
147 |     "         }\n",
148 |     "conn =  psycopg2.connect(dbname=config['dbname'], host=config['host'], \n",
149 |     "                              port=config['port'], user=config['user'], \n",
150 |     "                              password=config['pwd'])"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 6,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "def copy_files(conn, path):\n",
160 |     "    curs = conn.cursor()\n",
161 |     "    curs.execute(\"\"\" \n",
162 |     "        copy \n",
163 |     "            batch_raw\n",
164 |     "        from \n",
165 |     "            's3://lambda-architecture123/\"\"\" + path + \"\"\"'  \n",
166 |     "            access_key_id 'AKIAIXUPHT6ERRMQYINQ'\n",
167 |     "            secret_access_key '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'\n",
168 |     "            delimiter ';'\n",
169 |     "            region 'eu-central-1'\n",
170 |     "    \"\"\")\n",
171 |     "    curs.close()\n",
172 |     "    conn.commit()\n"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "### Computing the batch layer output\n",
180 |     "- querying the raw tweets stored in redshift to get the desired batch layer output"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 7,
186 |    "metadata": {},
187 |    "outputs": [],
188 |    "source": [
189 |     "def compute_batch_layer(conn):\n",
190 |     "    curs = conn.cursor()\n",
191 |     "    curs.execute(\"\"\" \n",
192 |     "        drop table if exists batch_layer;\n",
193 |     "\n",
194 |     "        with raw_dedup as (\n",
195 |     "        SELECT\n",
196 |     "            distinct id,created_at,followers_count,location,favorite_count,retweet_count\n",
197 |     "        FROM\n",
198 |     "            batch_raw\n",
199 |     "        ),\n",
200 |     "        batch_result as (\n",
201 |     "            SELECT\n",
202 |     "                location,\n",
203 |     "                count(id) as count_id,\n",
204 |     "                sum(followers_count) as sum_followers_count,\n",
205 |     "                sum(favorite_count) as sum_favorite_count,\n",
206 |     "                sum(retweet_count) as sum_retweet_count\n",
207 |     "            FROM\n",
208 |     "                raw_dedup\n",
209 |     "            group by \n",
210 |     "                location\n",
211 |     "        )\n",
212 |     "        select \n",
213 |     "            *\n",
214 |     "        INTO\n",
215 |     "            batch_layer\n",
216 |     "        FROM\n",
217 |     "            batch_result\"\"\")\n",
218 |     "    curs.close()\n",
219 |     "    conn.commit()"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 8,
225 |    "metadata": {},
226 |    "outputs": [],
227 |    "source": [
228 |     "# compute_batch_layer(conn)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "metadata": {},
234 |    "source": [
235 |     "<a id=\"5\"></a>\n",
236 |     "### Deployment \n",
237 |     "- perform the task every couple of hours and wait in between"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": 9,
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": [
246 |     "def periodic_work(interval):\n",
247 |     "    while True:\n",
248 |     "        path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n",
249 |     "        store_twitter_data(path)\n",
250 |     "        copy_files(conn, path)\n",
251 |     "        #interval should be an integer, the number of seconds to wait\n",
252 |     "        time.sleep(interval)"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": 10,
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "# periodic_work(60 * 60) ## 60 minutes !"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 22,
267 |    "metadata": {},
268 |    "outputs": [],
269 |    "source": [
270 |     "path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n",
271 |     "\n",
272 |     "store_twitter_data(path)"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": 23,
278 |    "metadata": {},
279 |    "outputs": [
280 |     {
281 |      "name": "stderr",
282 |      "output_type": "stream",
283 |      "text": [
284 |       "ERROR:root:An unexpected error occurred while tokenizing input\n",
285 |       "The following traceback may be corrupted or invalid\n",
286 |       "The error message is: ('EOF in multi-line string', (1, 4))\n",
287 |       "\n"
288 |      ]
289 |     },
290 |     {
291 |      "ename": "DatabaseError",
292 |      "evalue": "SSL SYSCALL error: Operation timed out\n",
293 |      "output_type": "error",
294 |      "traceback": [
295 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
296 |       "\u001b[0;31mDatabaseError\u001b[0m                             Traceback (most recent call last)",
297 |       "\u001b[0;32m<ipython-input-23-283833030d14>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcopy_files\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
298 |       "\u001b[0;32m<ipython-input-6-e3e962d1c071>\u001b[0m in \u001b[0;36mcopy_files\u001b[0;34m(conn, path)\u001b[0m\n\u001b[1;32m     10\u001b[0m             \u001b[0mdelimiter\u001b[0m \u001b[0;34m';'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m             \u001b[0mregion\u001b[0m \u001b[0;34m'eu-central-1'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m     \"\"\")\n\u001b[0m\u001b[1;32m     13\u001b[0m     \u001b[0mcurs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m     \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
299 |       "\u001b[0;31mDatabaseError\u001b[0m: SSL SYSCALL error: Operation timed out\n"
300 |      ]
301 |     }
302 |    ],
303 |    "source": [
304 |     "copy_files(conn, path)"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 21,
310 |    "metadata": {},
311 |    "outputs": [],
312 |    "source": [
313 |     "# run at the end of the day\n",
314 |     "compute_batch_layer(conn)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 14,
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": [
323 |     "#conn.close()"
324 |    ]
325 |   }
326 |  ],
327 |  "metadata": {
328 |   "kernelspec": {
329 |    "display_name": "Python 3",
330 |    "language": "python",
331 |    "name": "python3"
332 |   },
333 |   "language_info": {
334 |    "codemirror_mode": {
335 |     "name": "ipython",
336 |     "version": 3
337 |    },
338 |    "file_extension": ".py",
339 |    "mimetype": "text/x-python",
340 |    "name": "python",
341 |    "nbconvert_exporter": "python",
342 |    "pygments_lexer": "ipython3",
343 |    "version": "3.6.1"
344 |   }
345 |  },
346 |  "nbformat": 4,
347 |  "nbformat_minor": 2
348 | }
349 | 


--------------------------------------------------------------------------------
/Implementing the serving layer of lambda architecture using Redshift.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Implementing the serving layer of lambda architecture using Redshift\n",
  8 |     "\n",
  9 |     "### Purpose:\n",
 10 |     "- merge the output of speed and batch layer aggregations\n",
 11 |     "- achieve this by: \n",
 12 |     "    - every couple of hours run the re-computation\n",
 13 |     "    - use the output of batch layer as base table\n",
 14 |     "    - upsert the up-to-date values of speed layer into the base table \n",
 15 |     "\n",
 16 |     "### Contents: \n",
 17 |     "- [Creating the serving layer](#1)"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "### Requirements"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 1,
 30 |    "metadata": {},
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "import psycopg2"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "<a id=\"1\"></a>\n",
 41 |     "### Creating the serving layer\n",
 42 |     "- authenticate and create a connection using psycopg module\n",
 43 |     "- create and populate a temporary table with it's base being batch layer and upserting the speed layer\n",
 44 |     "- drop the current serving layer and use the above mentioned temporary table for serving layer (no downtime)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 2,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "config = { 'dbname': 'lambda', \n",
 54 |     "           'user':'dorian',\n",
 55 |     "           'pwd':'Demo1234',\n",
 56 |     "           'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n",
 57 |     "           'port':'5439'\n",
 58 |     "         }\n",
 59 |     "conn =  psycopg2.connect(dbname=config['dbname'], host=config['host'], \n",
 60 |     "                              port=config['port'], user=config['user'], \n",
 61 |     "                              password=config['pwd'])"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 3,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "curs = conn.cursor()\n",
 71 |     "curs.execute(\"\"\" \n",
 72 |     "    DROP TABLE IF EXISTS serving_layer_temp; \n",
 73 |     "\n",
 74 |     "    SELECT \n",
 75 |     "         *\n",
 76 |     "    INTO \n",
 77 |     "        serving_layer_temp\n",
 78 |     "    FROM \n",
 79 |     "        batch_layer ;\n",
 80 |     "\n",
 81 |     "\n",
 82 |     "\n",
 83 |     "    UPDATE \n",
 84 |     "        serving_layer_temp\n",
 85 |     "    SET\n",
 86 |     "        count_id = count_id + speed_layer.\"count(id)\",\n",
 87 |     "        sum_followers_count = sum_followers_count + speed_layer.\"sum(followers_count)\",\n",
 88 |     "        sum_favorite_count = sum_favorite_count + speed_layer.\"sum(favorite_count)\",\n",
 89 |     "        sum_retweet_count = sum_retweet_count + speed_layer.\"sum(retweet_count)\"\n",
 90 |     "    FROM\n",
 91 |     "        speed_layer\n",
 92 |     "    WHERE \n",
 93 |     "        serving_layer_temp.location = speed_layer.location ;\n",
 94 |     "\n",
 95 |     "\n",
 96 |     "\n",
 97 |     "    INSERT INTO \n",
 98 |     "        serving_layer_temp\n",
 99 |     "    SELECT \n",
100 |     "        * \n",
101 |     "    FROM \n",
102 |     "        speed_layer\n",
103 |     "    WHERE \n",
104 |     "        speed_layer.location \n",
105 |     "    NOT IN (\n",
106 |     "        SELECT \n",
107 |     "            DISTINCT location \n",
108 |     "        FROM \n",
109 |     "            serving_layer_temp \n",
110 |     "    ) ;\n",
111 |     "    \n",
112 |     "    \n",
113 |     "    drop table serving_layer ;\n",
114 |     "    \n",
115 |     "    alter table serving_layer_temp rename to serving_layer ;        \n",
116 |     "    \n",
117 |     "\"\"\")\n",
118 |     "curs.close()\n",
119 |     "conn.commit()\n",
120 |     "conn.close()"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": []
129 |   }
130 |  ],
131 |  "metadata": {
132 |   "kernelspec": {
133 |    "display_name": "Python 3",
134 |    "language": "python",
135 |    "name": "python3"
136 |   },
137 |   "language_info": {
138 |    "codemirror_mode": {
139 |     "name": "ipython",
140 |     "version": 3
141 |    },
142 |    "file_extension": ".py",
143 |    "mimetype": "text/x-python",
144 |    "name": "python",
145 |    "nbconvert_exporter": "python",
146 |    "pygments_lexer": "ipython3",
147 |    "version": "3.6.1"
148 |   }
149 |  },
150 |  "nbformat": 4,
151 |  "nbformat_minor": 2
152 | }
153 | 


--------------------------------------------------------------------------------
/Implementing the speed layer of lambda architecture using Structured Spark Streaming:
--------------------------------------------------------------------------------
1 | ﻿{"paragraphs":[{"text":"%md \n\n# Implementing the speed layer of lambda architecture using Structured Spark Streaming\n\n### Purpose: \n- provide analytics on real time data (\"intra day\") which batch layer cannot efficiently achieve\n- achieve this by:\n    - ingest latest tweets from Kafka Producer and analtze only those for the current day \n    - perform aggregations over the data to get the desired output of speed layer\n\n### Contents: \n- Configuring spark\n- Spark Structured Streaming\n    - Input stage - defining the data source\n    - Result stage - performing transformations on the stream\n    - Output stage\n- Connecting to redshift cluster\n- Exporting data to Redshift","user":"anonymous","dateUpdated":"2017-11-11T15:55:51+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h1>Implementing the speed layer of lambda architecture using Structured Spark Streaming</h1>\n<h3>Purpose:</h3>\n<ul>\n  <li>provide analytics on real time data (&ldquo;intra day&rdquo;) which batch layer cannot efficiently achieve</li>\n  <li>achieve this by:\n    <ul>\n      <li>ingest latest tweets from Kafka Producer and analtze only those for the current day</li>\n      <li>perform aggregations over the data to get the desired output of speed layer</li>\n    </ul>\n  </li>\n</ul>\n<h3>Contents:</h3>\n<ul>\n  <li>Configuring spark</li>\n  <li>Spark Structured Streaming\n    <ul>\n      <li>Input stage - defining the data source</li>\n      <li>Result stage - performing transformations on the stream</li>\n      <li>Output stage</li>\n    </ul>\n  </li>\n  <li>Connecting to redshift cluster</li>\n  <li>Exporting data to Redshift</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509909863130_-1135889452","id":"20171105-202423_1160220584","dateCreated":"2017-11-05T20:24:23+0100","dateStarted":"2017-11-11T15:55:51+0100","dateFinished":"2017-11-11T15:55:51+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:273"},{"text":"%dep\n\nz.load(\"/Volumes/SD/Downloads/RedshiftJDBC42-1.2.10.1009.jar\")","user":"anonymous","dateUpdated":"2017-11-11T16:24:36+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@7dfb0e5a\n"}]},"apps":[],"jobName":"paragraph_1509878460764_1962878483","id":"20171105-114100_1870675802","dateCreated":"2017-11-05T11:41:00+0100","dateStarted":"2017-11-11T16:24:36+0100","dateFinished":"2017-11-11T16:24:44+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:274"},{"text":"%md \n\n### Requirements","user":"anonymous","dateUpdated":"2017-11-11T14:30:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Requirements</h3>\n</div>"}]},"apps":[],"jobName":"paragraph_1510394086125_-1443536682","id":"20171111-105446_1758804734","dateCreated":"2017-11-11T10:54:46+0100","dateStarted":"2017-11-11T14:30:05+0100","dateFinished":"2017-11-11T14:30:05+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:275"},{"text":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._","user":"anonymous","dateUpdated":"2017-11-11T16:24:59+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._\n"}]},"apps":[],"jobName":"paragraph_1510394103383_-688043084","id":"20171111-105503_1480409678","dateCreated":"2017-11-11T10:55:03+0100","dateStarted":"2017-11-11T16:25:00+0100","dateFinished":"2017-11-11T16:25:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:276"},{"text":"%md \n\n### Configuring Spark\n- properly configuring spark for our workload\n- defining case class for tweets which will be used later on","user":"anonymous","dateUpdated":"2017-11-11T14:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Configuring Spark</h3>\n<ul>\n  <li>properly configuring spark for our workload</li>\n  <li>defining case class for tweets which will be used later on</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509911366990_1694144500","id":"20171105-204926_42354742","dateCreated":"2017-11-05T20:49:26+0100","dateStarted":"2017-11-11T14:30:07+0100","dateFinished":"2017-11-11T14:30:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:277"},{"text":"Thread.sleep(5000)\n\nval spark = SparkSession\n  .builder()\n  .config(\"spark.sql.shuffle.partitions\",\"2\")  // we are running this on my laptop\n  .appName(\"Spark Structured Streaming example\")\n  .getOrCreate()\n  \ncase class tweet (id: String, created_at : String, followers_count: String, location : String, favorite_count : String, retweet_count : String)\n\nThread.sleep(5000)","user":"anonymous","dateUpdated":"2017-11-11T16:25:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@20d06238\ndefined class tweet\n"}]},"apps":[],"jobName":"paragraph_1509779101105_500060833","id":"20171104-080501_1626533215","dateCreated":"2017-11-04T08:05:01+0100","dateStarted":"2017-11-11T16:25:20+0100","dateFinished":"2017-11-11T16:25:33+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:278"},{"text":"%md \n\n### Input stage - defining the data source\n- using Kafka as data source we specify:\n    - location of kafka broker\n    - relevant kafka topic\n    - how to treat starting offsets","user":"anonymous","dateUpdated":"2017-11-11T14:30:09+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Input stage - defining the data source</h3>\n<ul>\n  <li>using Kafka as data source we specify:\n    <ul>\n      <li>location of kafka broker</li>\n      <li>relevant kafka topic</li>\n      <li>how to treat starting offsets</li>\n    </ul>\n  </li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509881379573_-268499860","id":"20171105-122939_1561723250","dateCreated":"2017-11-05T12:29:39+0100","dateStarted":"2017-11-11T14:30:09+0100","dateFinished":"2017-11-11T14:30:09+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:279"},{"text":"var data_stream = spark\n  .readStream // contantly expanding dataframe\n  .format(\"kafka\")\n  .option(\"kafka.bootstrap.servers\", \"localhost:9092\")\n  .option(\"subscribe\", \"tweets-lambda1\")\n  .option(\"startingOffsets\",\"latest\")  //or latest\n  .load()\n \n// note how similar API is to the batch version","user":"anonymous","dateUpdated":"2017-11-11T16:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509828449824_1485333558","id":"20171104-214729_426230385","dateCreated":"2017-11-04T21:47:29+0100","dateStarted":"2017-11-11T16:30:07+0100","dateFinished":"2017-11-11T16:30:14+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:280"},{"text":"data_stream.schema","user":"anonymous","dateUpdated":"2017-11-11T16:30:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res8: org.apache.spark.sql.types.StructType = StructType(StructField(key,BinaryType,true), StructField(value,BinaryType,true), StructField(topic,StringType,true), StructField(partition,IntegerType,true), StructField(offset,LongType,true), StructField(timestamp,TimestampType,true), StructField(timestampType,IntegerType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896872792_453692520","id":"20171105-164752_719330437","dateCreated":"2017-11-05T16:47:52+0100","dateStarted":"2017-11-11T16:30:20+0100","dateFinished":"2017-11-11T16:30:23+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:281"},{"text":"%md \n\n### Result stage - performing transformations on the stream\n- extract the value column of kafka message\n- parse each row into a member of tweet class\n- filter to only look at todays tweets as results\n- perform aggregations","user":"anonymous","dateUpdated":"2017-11-11T14:30:12+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Result stage - performing transformations on the stream</h3>\n<ul>\n  <li>extract the value column of kafka message</li>\n  <li>parse each row into a member of tweet class</li>\n  <li>filter to only look at todays tweets as results</li>\n  <li>perform aggregations</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509912237119_-1371939219","id":"20171105-210357_553784609","dateCreated":"2017-11-05T21:03:57+0100","dateStarted":"2017-11-11T14:30:12+0100","dateFinished":"2017-11-11T14:30:12+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:282"},{"text":"var data_stream_cleaned = data_stream\n    .selectExpr(\"CAST(value AS STRING) as string_value\")\n    .as[String]\n    .map(x => (x.split(\";\"))) //wrapped array\n    .map(x => tweet(x(0), x(1), x(2),  x(3), x(4), x(5)))\n    .selectExpr( \"cast(id as long) id\", \"CAST(created_at as timestamp) created_at\",  \"cast(followers_count as int) followers_count\", \"location\", \"cast(favorite_count as int) favorite_count\", \"cast(retweet_count as int) retweet_count\")\n    .toDF()\n    .filter(col(\"created_at\").gt(current_date()))   // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output\n    .groupBy(\"location\")\n    .agg(count(\"id\"), sum(\"followers_count\"), sum(\"favorite_count\"),  sum(\"retweet_count\"))","user":"anonymous","dateUpdated":"2017-11-11T16:32:33+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream_cleaned: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509838694466_-674508628","id":"20171105-003814_85265595","dateCreated":"2017-11-05T00:38:14+0100","dateStarted":"2017-11-11T16:32:33+0100","dateFinished":"2017-11-11T16:32:43+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:283"},{"text":"data_stream_cleaned.schema","user":"anonymous","dateUpdated":"2017-11-11T16:32:46+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res9: org.apache.spark.sql.types.StructType = StructType(StructField(location,StringType,true), StructField(count(id),LongType,false), StructField(sum(followers_count),LongType,true), StructField(sum(favorite_count),LongType,true), StructField(sum(retweet_count),LongType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896863262_62756989","id":"20171105-164743_19706666","dateCreated":"2017-11-05T16:47:43+0100","dateStarted":"2017-11-11T16:32:46+0100","dateFinished":"2017-11-11T16:32:47+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:284"},{"text":"%md \n\n### Output stage\n- specify the following:\n    - data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )\n    - trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)\n    - output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode","user":"anonymous","dateUpdated":"2017-11-11T16:33:38+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Output stage</h3>\n<ul>\n  <li>specify the following:\n    <ul>\n      <li>data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )</li>\n      <li>trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)</li>\n      <li>output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode</li>\n    </ul>\n  </li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509912259110_1463706147","id":"20171105-210419_1857413462","dateCreated":"2017-11-05T21:04:19+0100","dateStarted":"2017-11-11T16:33:38+0100","dateFinished":"2017-11-11T16:33:42+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:285"},{"text":"val query = data_stream_cleaned.writeStream\n    .format(\"memory\")\n    .queryName(\"demo\")\n    .trigger(ProcessingTime(\"30 seconds\"))   // means that that spark will look for new data only every minute\n    .outputMode(\"complete\") // could also be append or update\n    .start()","user":"anonymous","dateUpdated":"2017-11-11T16:34:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"query: org.apache.spark.sql.streaming.StreamingQuery = Streaming Query demo [id = 824a35c8-af8d-4b57-83c4-4cbf3c1f9418, runId = 5dcdee01-de96-403e-997e-350bfd5d8f2c] [state = ACTIVE]\n"}]},"apps":[],"jobName":"paragraph_1509873755450_-691017020","id":"20171105-102235_1294551369","dateCreated":"2017-11-05T10:22:35+0100","dateStarted":"2017-11-11T16:34:18+0100","dateFinished":"2017-11-11T16:34:22+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:286"},{"text":"query.status","user":"anonymous","dateUpdated":"2017-11-11T16:34:25+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res10: org.apache.spark.sql.streaming.StreamingQueryStatus =\n{\n  \"message\" : \"Processing new data\",\n  \"isDataAvailable\" : true,\n  \"isTriggerActive\" : true\n}\n"}]},"apps":[],"jobName":"paragraph_1510413138022_529774344","id":"20171111-161218_1171458195","dateCreated":"2017-11-11T16:12:18+0100","dateStarted":"2017-11-11T16:34:25+0100","dateFinished":"2017-11-11T16:34:26+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:287"},{"text":"query.explain","user":"anonymous","dateUpdated":"2017-11-11T16:34:35+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"== Physical Plan ==\n*HashAggregate(keys=[location#36], functions=[count(id#40L), sum(cast(followers_count#42 as bigint)), sum(cast(favorite_count#43 as bigint)), sum(cast(retweet_count#44 as bigint))])\n+- Exchange hashpartitioning(location#36, 2)\n   +- *HashAggregate(keys=[location#36], functions=[partial_count(id#40L), partial_sum(cast(followers_count#42 as bigint)), partial_sum(cast(favorite_count#43 as bigint)), partial_sum(cast(retweet_count#44 as bigint))])\n      +- *Project [cast(cast(id#33 as decimal(20,0)) as bigint) AS id#40L, cast(cast(followers_count#35 as decimal(20,0)) as int) AS followers_count#42, location#36, cast(cast(favorite_count#37 as decimal(20,0)) as int) AS favorite_count#43, cast(cast(retweet_count#38 as decimal(20,0)) as int) AS retweet_count#44]\n         +- *Filter (isnotnull(created_at#34) && (cast(cast(created_at#34 as timestamp) as string) > cast(current_batch_timestamp(1510414470012, DateType) as string)))\n            +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).id, true) AS id#33, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).created_at, true) AS created_at#34, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).followers_count, true) AS followers_count#35, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).location, true) AS location#36, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).favorite_count, true) AS favorite_count#37, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).retweet_count, true) AS retweet_count#38]\n               +- *MapElements <function1>, obj#32: $line155526911422.$read$$iw$$iw$tweet\n                  +- *MapElements <function1>, obj#22: [Ljava.lang.String;\n                     +- *DeserializeToObject string_value#15.toString, obj#21: java.lang.String\n                        +- *Project [cast(value#141 as string) AS string_value#15]\n                           +- Scan ExistingRDD[key#140,value#141,topic#142,partition#143,offset#144L,timestamp#145,timestampType#146]\n"}]},"apps":[],"jobName":"paragraph_1510413000978_-581196590","id":"20171111-161000_1529325108","dateCreated":"2017-11-11T16:10:00+0100","dateStarted":"2017-11-11T16:34:35+0100","dateFinished":"2017-11-11T16:34:36+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:288"},{"text":"%spark.sql \n\nselect * from demo","user":"anonymous","dateUpdated":"2017-11-11T16:39:15+0100","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"location\tcount(id)\tsum(followers_count)\tsum(favorite_count)\tsum(retweet_count)\nYokohama\t2\t1110\t0\t115\nAngola, Luanda \t3\t1572\t0\t94221\nハッピーワールド\t4\t384\t0\t0\n推しが定まりません。そらワンいって考えます\t4\t6932\t0\t0\nTallinn, Estonia\t4\t248\t0\t464\n\t233\t310223\t0\t1344636\nOhio\t2\t28806\t0\t0\n京都市右京区\t3\t4272\t0\t0\n孤独\t3\t1059\t0\t0\nGeorgia, USA\t4\t3448\t0\t0\nManaus / São Paulo\t4\t1056\t0\t0\n当てます。\t3\t1299\t0\t0\nTriplex\t4\t7780\t0\t88\nPhilly✈️Tuskegee👉🏾D.C.\t3\t4842\t0\t3\nnew jersey\t4\t864\t0\t0\nさいたま市\t4\t3384\t0\t0\n日本 東京\t3\t2337\t0\t0\nD.O. 도경수\t3\t1428\t0\t3018\n千葉 船橋市\t4\t156\t0\t0\nEntroncamento, Portugal\t4\t1392\t0\t2473\nAntwerpen, Belgium\t3\t1629\t0\t0\nチ ン パ ン ．🌷 ￤ﾍｯﾀﾞｰ ⇒ ﾌﾐ !!!\t3\t669\t0\t208\nEau Claire, Wisconsin\t3\t254721\t0\t1\n@無所属  お仕事依頼DMへ📩\t3\t2406\t0\t0\nJeje l'omo Eko nlo\t3\t1008\t0\t0\nstaten island, new york\t3\t348\t0\t0\nlagos\t3\t3450\t0\t234\nAtlanta GA Metro\t3\t273\t0\t0\nBerks. UK\t3\t13383\t0\t66\nΑνδραβιδα\t3\t7398\t0\t5172\nFinland, no polar bears here\t3\t609\t0\t0\nGuayaquil, Ecuador\t3\t1827\t0\t120\n내가 하는 말이 있다면 그건 거짓말\t4\t20\t0\t0\nLouisville, KY\t3\t2250\t0\t0\nottawa\t4\t1024\t0\t0\n‏ً\t4\t11140\t0\t122960\ncairo\t2\t706\t0\t0\nSan Francisco, CA\t3\t1167\t0\t0\n🇧🇸 // 🇯🇲\t3\t3522\t0\t111\nNorwalk, CA\t3\t765\t0\t0\n京都 - 大阪\t4\t2968\t0\t341\nBeen all around the world\t3\t1161\t0\t43983\nUSA\t6\t31782\t0\t0\n호그와트\t3\t183\t0\t1077\nSeabrook, TX\t3\t1023\t0\t9860\nNein. \t3\t5421\t0\t0\nかるわかのストレス清掃員♪\t3\t372\t0\t0\nRochester/London, UK\t3\t7557\t0\t12\nbuenos aires argentina\t3\t444\t0\t0\nHoseok's heart~\t4\t7268\t0\t0\nNorth West, England\t3\t909\t0\t1431\nGloucester, UK\t3\t5424\t0\t0\n愛知県岡崎市\t4\t1964\t0\t0\nアップルパイの中\t3\t759\t0\t0\nMakati City\t3\t3\t0\t129771\nCorrientes\t3\t411\t0\t1521\nYら,猫好き,車好き,ゲーム好きさんと繋がりたい🗿\t3\t630\t0\t0\n{May - Sofi - Dians - Paula}\t3\t48723\t4\t1\nToronto\t4\t3744\t0\t0\nどすこいパレード\t4\t1044\t0\t98\nMalaysia\t4\t1264\t0\t48736\nJapan Tokyo\t4\t32520\t0\t0\n#Adelaido 👑\t2\t2094\t6\t2\nJapan\t3\t126603\t0\t0\nPilipínas\t3\t1362\t0\t20979\nManaus, Brasil\t3\t3222\t0\t0\n127.0.0.1\t3\t5997\t0\t4143\nFigueres\t3\t4485\t0\t0\n運動会する\t4\t256\t0\t0\nRepublic of the Philippines\t3\t1347\t0\t54\nアバリスの膝の上\t3\t630\t0\t0\nPortugal\t2\t1476\t0\t1241\nOhio, USA\t4\t1332\t0\t0\nlincoln, england\t4\t2844\t0\t4\nCalifornia, USA\t3\t1281\t0\t402\nLondon, Ontario\t3\t7467\t0\t0\nOrlando, FL\t4\t1056\t0\t0\nLos Angeles, CA\t7\t29565\t0\t775\nAtlanta, GA\t3\t279\t0\t318\nmismuertos city\t3\t1449\t0\t57932\nカルデアの冷蔵庫\t3\t141\t0\t0\nBarueri, Brasil\t3\t2607\t0\t3\nahead\t3\t831\t1\t1\nparadise with drake\t3\t3426\t0\t0\nmiami\t3\t504\t0\t0\n栃木県\t4\t316\t0\t0\nJHB/SF\t4\t11420\t0\t0\n京都府 京都市 \t3\t18\t0\t0\nMichigan \t4\t3272\t0\t0\nAvellaneda, Buenos Aires\t3\t774\t0\t10923\n대한민국\t3\t63\t0\t1068\n日本\t19\t6294\t0\t231\nether \t4\t88\t0\t16\nPunjab, Pakistan\t2\t2088\t0\t0\nUnited Kingdom\t4\t1740\t0\t0\nToday's Market Movement\t3\t586\t0\t9\nOn Melancholy Hill\t3\t498\t0\t0\nLos Angeles\t4\t612\t0\t0\nMadrid, Comunidad de Madrid\t3\t7578\t0\t375\nMakati City, National Capital \t3\t660\t0\t36\nSingapore \t3\t588\t0\t204\nIndia\t3\t342\t0\t0\nNorth Riding/Samrand\t3\t315\t0\t29427\nBinghamton University\t3\t2166\t0\t0\nGurgaon, India\t4\t18184\t0\t0\n広島県\t3\t642\t0\t206\nみんなの心の中☆\t2\t204\t0\t0\n"}]},"apps":[],"jobName":"paragraph_1509873913879_-781586253","id":"20171105-102513_1944185364","dateCreated":"2017-11-05T10:25:13+0100","dateStarted":"2017-11-11T16:39:16+0100","dateFinished":"2017-11-11T16:39:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:289"},{"text":"%md\n\n### Connecting to redshift cluster\n- defining JDBC connection to connect to redshift","user":"anonymous","dateUpdated":"2017-11-11T14:30:16+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Connecting to redshift cluster</h3>\n<ul>\n  <li>defining JDBC connection to connect to redshift</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509909816175_369050675","id":"20171105-202336_1741159108","dateCreated":"2017-11-05T20:23:36+0100","dateStarted":"2017-11-11T14:30:16+0100","dateFinished":"2017-11-11T14:30:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:290"},{"text":"//create properties object\nClass.forName(\"com.amazon.redshift.jdbc42.Driver\")\n\nval prop = new java.util.Properties\nprop.setProperty(\"driver\", \"com.amazon.redshift.jdbc42.Driver\")\nprop.setProperty(\"user\", \"dorian\")\nprop.setProperty(\"password\", \"Demo1234\") \n\n//jdbc mysql url - destination database is named \"data\"\nval url = \"jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\"\n\n//destination database table \nval table = \"speed_layer\"","user":"anonymous","dateUpdated":"2017-11-11T16:36:02+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res13: Class[_] = class com.amazon.redshift.jdbc42.Driver\nprop: java.util.Properties = {}\nres15: Object = null\nres16: Object = null\nres17: Object = null\nurl: String = jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\ntable: String = speed_layer\n"}]},"apps":[],"jobName":"paragraph_1509874084488_-1803262626","id":"20171105-102804_443283701","dateCreated":"2017-11-05T10:28:04+0100","dateStarted":"2017-11-11T16:36:02+0100","dateFinished":"2017-11-11T16:36:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:291"},{"text":"%md \n\n### Exporting data to Redshift\n- \"overwriting\" the table with results of query stored in memory as result of the speed layer\n- scheduling the function to run every hour\n","user":"anonymous","dateUpdated":"2017-11-11T14:30:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h3>Exporting data to Redshift</h3>\n<ul>\n  <li>&ldquo;overwriting&rdquo; the table with results of query stored in memory as result of the speed layer</li>\n  <li>scheduling the function to run every hour</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1509881434795_1558195972","id":"20171105-123034_774658214","dateCreated":"2017-11-05T12:30:34+0100","dateStarted":"2017-11-11T14:30:18+0100","dateFinished":"2017-11-11T14:30:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:292"},{"text":"val df = spark.sql(\"select * from demo\")\n\n//write data from spark dataframe to database\ndf.write.mode(\"overwrite\").jdbc(url, table, prop)\n","user":"anonymous","dateUpdated":"2017-11-11T16:39:19+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"df: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509878690443_-615786910","id":"20171105-114450_965590151","dateCreated":"2017-11-05T11:44:50+0100","dateStarted":"2017-11-11T16:39:19+0100","dateFinished":"2017-11-11T16:39:30+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:293"},{"text":"","user":"anonymous","dateUpdated":"2017-11-11T16:37:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1509949807769_1425690320","id":"20171106-073007_1469584911","dateCreated":"2017-11-06T07:30:07+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:294"}],"name":"Implementing the speed layer of lambda architecture using Structured Spark Streaming","id":"2CY6MSDVK","angularObjects":{"2CZTSJM9A:shared_process":[],"2CXD9FT1P:shared_process":[],"2CYBUJCZE:shared_process":[],"2CZXYWRND:shared_process":[],"2CWX9E9KA:shared_process":[],"2CWJS6R2N:shared_process":[],"2CW6U7X7Z:shared_process":[],"2CY9X3W1T:shared_process":[],"2CX93H291:shared_process":[],"2CZRYW3SZ:shared_process":[],"2CYT9Z9RC:shared_process":[],"2CY2R49R6:shared_process":[],"2CYQW36AU:shared_process":[],"2CWPRKMXH:shared_process":[],"2CWU95D3A:shared_process":[],"2CXJ7UBRF:shared_process":[],"2CWWTMY7M:shared_process":[],"2CY3EBJAE:shared_process":[],"2CYFQZER9:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}


--------------------------------------------------------------------------------
/Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Ingesting realtime tweets using Apache Kafka, Tweepy and Python\n",
  8 |     "\n",
  9 |     "### Purpose:\n",
 10 |     "- main data source for the lambda architecture pipeline\n",
 11 |     "- uses twitter streaming API to simulate new events coming in every minute\n",
 12 |     "- Kafka Producer sends the tweets as records to the Kafka Broker\n",
 13 |     "\n",
 14 |     "### Contents: \n",
 15 |     "- [Twitter setup](#1)\n",
 16 |     "- [Defining the Kafka producer](#2)\n",
 17 |     "- [Producing and sending records to the Kafka Broker](#3)\n",
 18 |     "- [Deployment](#4)"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "### Required libraries"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 1,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import tweepy\n",
 35 |     "import time\n",
 36 |     "from kafka import KafkaConsumer, KafkaProducer"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "<a id=\"1\"></a>\n",
 44 |     "### Twitter setup\n",
 45 |     "- getting the API object using authorization information\n",
 46 |     "- you can find more details on how to get the authorization here:\n",
 47 |     "https://developer.twitter.com/en/docs/basics/authentication/overview"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 2,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "# twitter setup\n",
 57 |     "consumer_key = \"1\"\n",
 58 |     "consumer_secret = \"2\"\n",
 59 |     "access_token = \"3\"\n",
 60 |     "access_token_secret = \"4\"\n",
 61 |     "# Creating the authentication object\n",
 62 |     "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
 63 |     "# Setting your access token and secret\n",
 64 |     "auth.set_access_token(access_token, access_token_secret)\n",
 65 |     "# Creating the API object by passing in auth information\n",
 66 |     "api = tweepy.API(auth) \n"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "A helper function to normalize the time a tweet was created with the time of our system"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 3,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "from datetime import datetime, timedelta\n",
 83 |     "\n",
 84 |     "def normalize_timestamp(time):\n",
 85 |     "    mytime = datetime.strptime(time, \"%Y-%m-%d %H:%M:%S\")\n",
 86 |     "    mytime += timedelta(hours=1)   # the tweets are timestamped in GMT timezone, while I am in +1 timezone\n",
 87 |     "    return (mytime.strftime(\"%Y-%m-%d %H:%M:%S\")) "
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "<a id=\"2\"></a>\n",
 95 |     "### Defining the Kafka producer\n",
 96 |     "- specify the Kafka Broker\n",
 97 |     "- specify the topic name\n",
 98 |     "- optional: specify partitioning strategy"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 4,
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "producer = KafkaProducer(bootstrap_servers='localhost:9092')\n",
108 |     "topic_name = 'tweets-lambda1'"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "<a id=\"3\"></a>\n",
116 |     "### Producing and sending records to the Kafka Broker\n",
117 |     "- querying the Twitter API Object\n",
118 |     "- extracting relevant information from the response\n",
119 |     "- formatting and sending the data to proper topic on the Kafka Broker\n",
120 |     "- resulting tweets have following attributes:\n",
121 |     "    - id \n",
122 |     "    - created_at\n",
123 |     "    - followers_count\n",
124 |     "    - location\n",
125 |     "    - favorite_count\n",
126 |     "    - retweet_count"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 6,
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": [
135 |     "def get_twitter_data():\n",
136 |     "    res = api.search(\"Apple OR iphone OR iPhone\")\n",
137 |     "    for i in res:\n",
138 |     "        record = ''\n",
139 |     "        record += str(i.user.id_str)\n",
140 |     "        record += ';'\n",
141 |     "        record += str(normalize_timestamp(str(i.created_at)))\n",
142 |     "        record += ';'\n",
143 |     "        record += str(i.user.followers_count)\n",
144 |     "        record += ';'\n",
145 |     "        record += str(i.user.location)\n",
146 |     "        record += ';'\n",
147 |     "        record += str(i.favorite_count)\n",
148 |     "        record += ';'\n",
149 |     "        record += str(i.retweet_count)\n",
150 |     "        record += ';'\n",
151 |     "        producer.send(topic_name, str.encode(record))"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 9,
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "get_twitter_data()"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "<a id=\"4\"></a>\n",
168 |     "### Deployment \n",
169 |     "- perform the task every couple of minutes and wait in between"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 11,
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "def periodic_work(interval):\n",
179 |     "    while True:\n",
180 |     "        get_twitter_data()\n",
181 |     "        #interval should be an integer, the number of seconds to wait\n",
182 |     "        time.sleep(interval)\n"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "metadata": {},
189 |    "outputs": [],
190 |    "source": [
191 |     "periodic_work(60 * 0.1)  # get data every couple of minutes"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": []
200 |   }
201 |  ],
202 |  "metadata": {
203 |   "kernelspec": {
204 |    "display_name": "Python 3",
205 |    "language": "python",
206 |    "name": "python3"
207 |   },
208 |   "language_info": {
209 |    "codemirror_mode": {
210 |     "name": "ipython",
211 |     "version": 3
212 |    },
213 |    "file_extension": ".py",
214 |    "mimetype": "text/x-python",
215 |    "name": "python",
216 |    "nbconvert_exporter": "python",
217 |    "pygments_lexer": "ipython3",
218 |    "version": "3.6.1"
219 |   }
220 |  },
221 |  "nbformat": 4,
222 |  "nbformat_minor": 2
223 | }
224 | 


--------------------------------------------------------------------------------
/Lambda_architecture-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dorianbg/lambda-architecture-demo/879a4a2d7b1cb3496c0bb5c63eed07a66d834b33/Lambda_architecture-2.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Developing Lambda Architecture
 2 | 
 3 | <h3 id="toc_1">This is a repository for the code found in my series of blog posts on implementing the Lambda Architecture::</h3>
 4 | <ul>
 5 | 	<li><a href="https://dorianbg.wordpress.com/2017/11/10/introduction-to-lambda-architecture/">Introduction to Lambda Architecture</a></li>
 6 | 	<li><a href="https://dorianbg.wordpress.com/2017/11/11/ingesting-realtime-tweets-using-apache-kafka-tweepy-and-python/">Implementing Data Ingestion using Apache Kafka, Tweepy</a></li>
 7 | 	<li><a href="https://dorianbg.wordpress.com/2017/11/11/building-the-batch-layer-of-lambda-architecture-using-s3-redshift-and-apache-kafka/">Implementing Batch Layer using Kafka, S3, Redshift</a></li>
 8 | 	<li><a href="https://dorianbg.wordpress.com/2017/11/11/building-the-speed-layer-of-lambda-architecture-using-structured-spark-streaming/">Implementing Speed Layer using Spark Structured Streaming</a></li>
 9 | 	<li><a href="https://dorianbg.wordpress.com/2017/11/11/building-the-serving-layer-of-lambda-architecture-using-redshift/">Implementing Serving Layer using Redshift</a></li>
10 | 


--------------------------------------------------------------------------------