├── Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb ├── Implementing the serving layer of lambda architecture using Redshift.ipynb ├── Implementing the speed layer of lambda architecture using Structured Spark Streaming ├── Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb ├── Lambda_architecture-2.png └── README.md /Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Implementing the batch layer of lambda architecture using S3, Redshift and Apache Kafka\n", 8 | "\n", 9 | "### Purpose:\n", 10 | "- store all the tweets that were produced by Kafka Producer into S3\n", 11 | "- export them into Redshift\n", 12 | "- perform aggregation on the tweets to get the desired output of batch layer\n", 13 | "- achieve this by: \n", 14 | " - every couple of hours get the latest unseen tweets produced by the Kafka Producer and store them into a S3 archive\n", 15 | " - every night run a sql query to compute the result of batch layer\n", 16 | "\n", 17 | "### Contents: \n", 18 | "- [Defining the Kafka consumer](#1)\n", 19 | "- [Defining a Amazon Web Services S3 storage client](#2)\n", 20 | "- [Writing the data into a S3 bucket](#3)\n", 21 | "- [Exporting data from S3 bucket to Amazon Redshift using COPY command](#4)\n", 22 | "- [Aggregating \"raw\" tweets in Redshift](#5)\n", 23 | "- [Deployment](#6)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### Required libraries" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "from kafka import KafkaConsumer\n", 40 | "from io import StringIO\n", 41 | "import boto3\n", 42 | "import time\n", 43 | "import random" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "\n", 51 | "### Defining the Kafka consumer\n", 52 | "- setting the location of Kafka Broker\n", 53 | "- specifying the group_id and consumer_timeout\n", 54 | "- subsribing to a topic" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "consumer = KafkaConsumer(\n", 64 | " bootstrap_servers='localhost:9092',\n", 65 | " auto_offset_reset='latest', # Reset partition offsets upon OffsetOutOfRangeError\n", 66 | " group_id='test', # must have a unique consumer group id \n", 67 | " consumer_timeout_ms=1000) \n", 68 | " # How long to listen for messages - we do it for 10 seconds \n", 69 | " # because we poll the kafka broker only each couple of hours\n", 70 | "\n", 71 | "consumer.subscribe('tweets-lambda1')" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "\n", 79 | "### Defining a Amazon Web Services S3 storage client\n", 80 | "- setting the autohrizaition and bucket" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 3, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "s3_resource = boto3.resource(\n", 90 | " 's3',\n", 91 | " aws_access_key_id='AKIAIXUPHT6ERRMQYINQ',\n", 92 | " aws_secret_access_key='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',\n", 93 | ")\n", 94 | "\n", 95 | "s3_client = s3_resource.meta.client\n", 96 | "bucket_name = 'lambda-architecture123'\n" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "\n", 104 | "### Writing the data into a S3 bucket\n", 105 | "- polling the Kafka Broker\n", 106 | "- aggregating the latest messages into a single object in the bucket\n", 107 | "\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 4, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "def store_twitter_data(path):\n", 117 | " csv_buffer = StringIO() # S3 storage is object storage -> our document is just a large string\n", 118 | "\n", 119 | " for message in consumer: # this acts as \"get me an iterator over the latest messages I haven't seen\"\n", 120 | " csv_buffer.write(message.value.decode() + '\\n') \n", 121 | "# print(message)\n", 122 | " s3_resource.Object(bucket_name,path).put(Body=csv_buffer.getvalue())" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "\n", 130 | "### Exporting data from S3 bucket to Amazon Redshift using COPY command\n", 131 | "- authenticate and create a connection using psycopg module\n", 132 | "- export data using COPY command from S3 to Redshift \"raw\" table" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 19, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "import psycopg2\n", 142 | "config = { 'dbname': 'lambda', \n", 143 | " 'user':'dorian',\n", 144 | " 'pwd':'Demo1234',\n", 145 | " 'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n", 146 | " 'port':'5439'\n", 147 | " }\n", 148 | "conn = psycopg2.connect(dbname=config['dbname'], host=config['host'], \n", 149 | " port=config['port'], user=config['user'], \n", 150 | " password=config['pwd'])" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 6, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "def copy_files(conn, path):\n", 160 | " curs = conn.cursor()\n", 161 | " curs.execute(\"\"\" \n", 162 | " copy \n", 163 | " batch_raw\n", 164 | " from \n", 165 | " 's3://lambda-architecture123/\"\"\" + path + \"\"\"' \n", 166 | " access_key_id 'AKIAIXUPHT6ERRMQYINQ'\n", 167 | " secret_access_key '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'\n", 168 | " delimiter ';'\n", 169 | " region 'eu-central-1'\n", 170 | " \"\"\")\n", 171 | " curs.close()\n", 172 | " conn.commit()\n" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "### Computing the batch layer output\n", 180 | "- querying the raw tweets stored in redshift to get the desired batch layer output" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 7, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "def compute_batch_layer(conn):\n", 190 | " curs = conn.cursor()\n", 191 | " curs.execute(\"\"\" \n", 192 | " drop table if exists batch_layer;\n", 193 | "\n", 194 | " with raw_dedup as (\n", 195 | " SELECT\n", 196 | " distinct id,created_at,followers_count,location,favorite_count,retweet_count\n", 197 | " FROM\n", 198 | " batch_raw\n", 199 | " ),\n", 200 | " batch_result as (\n", 201 | " SELECT\n", 202 | " location,\n", 203 | " count(id) as count_id,\n", 204 | " sum(followers_count) as sum_followers_count,\n", 205 | " sum(favorite_count) as sum_favorite_count,\n", 206 | " sum(retweet_count) as sum_retweet_count\n", 207 | " FROM\n", 208 | " raw_dedup\n", 209 | " group by \n", 210 | " location\n", 211 | " )\n", 212 | " select \n", 213 | " *\n", 214 | " INTO\n", 215 | " batch_layer\n", 216 | " FROM\n", 217 | " batch_result\"\"\")\n", 218 | " curs.close()\n", 219 | " conn.commit()" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 8, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "# compute_batch_layer(conn)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "\n", 236 | "### Deployment \n", 237 | "- perform the task every couple of hours and wait in between" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 9, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "def periodic_work(interval):\n", 247 | " while True:\n", 248 | " path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n", 249 | " store_twitter_data(path)\n", 250 | " copy_files(conn, path)\n", 251 | " #interval should be an integer, the number of seconds to wait\n", 252 | " time.sleep(interval)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 10, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "# periodic_work(60 * 60) ## 60 minutes !" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 22, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "path = 'tweets/'+ time.strftime(\"%Y/%m/%d/%H\") + '_tweets_' + str(random.randint(1,1000)) + '.log'\n", 271 | "\n", 272 | "store_twitter_data(path)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 23, 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "name": "stderr", 282 | "output_type": "stream", 283 | "text": [ 284 | "ERROR:root:An unexpected error occurred while tokenizing input\n", 285 | "The following traceback may be corrupted or invalid\n", 286 | "The error message is: ('EOF in multi-line string', (1, 4))\n", 287 | "\n" 288 | ] 289 | }, 290 | { 291 | "ename": "DatabaseError", 292 | "evalue": "SSL SYSCALL error: Operation timed out\n", 293 | "output_type": "error", 294 | "traceback": [ 295 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 296 | "\u001b[0;31mDatabaseError\u001b[0m Traceback (most recent call last)", 297 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mcopy_files\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 298 | "\u001b[0;32m\u001b[0m in \u001b[0;36mcopy_files\u001b[0;34m(conn, path)\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mdelimiter\u001b[0m \u001b[0;34m';'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0mregion\u001b[0m \u001b[0;34m'eu-central-1'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \"\"\")\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0mcurs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcommit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 299 | "\u001b[0;31mDatabaseError\u001b[0m: SSL SYSCALL error: Operation timed out\n" 300 | ] 301 | } 302 | ], 303 | "source": [ 304 | "copy_files(conn, path)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 21, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "# run at the end of the day\n", 314 | "compute_batch_layer(conn)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 14, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "#conn.close()" 324 | ] 325 | } 326 | ], 327 | "metadata": { 328 | "kernelspec": { 329 | "display_name": "Python 3", 330 | "language": "python", 331 | "name": "python3" 332 | }, 333 | "language_info": { 334 | "codemirror_mode": { 335 | "name": "ipython", 336 | "version": 3 337 | }, 338 | "file_extension": ".py", 339 | "mimetype": "text/x-python", 340 | "name": "python", 341 | "nbconvert_exporter": "python", 342 | "pygments_lexer": "ipython3", 343 | "version": "3.6.1" 344 | } 345 | }, 346 | "nbformat": 4, 347 | "nbformat_minor": 2 348 | } 349 | -------------------------------------------------------------------------------- /Implementing the serving layer of lambda architecture using Redshift.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Implementing the serving layer of lambda architecture using Redshift\n", 8 | "\n", 9 | "### Purpose:\n", 10 | "- merge the output of speed and batch layer aggregations\n", 11 | "- achieve this by: \n", 12 | " - every couple of hours run the re-computation\n", 13 | " - use the output of batch layer as base table\n", 14 | " - upsert the up-to-date values of speed layer into the base table \n", 15 | "\n", 16 | "### Contents: \n", 17 | "- [Creating the serving layer](#1)" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "### Requirements" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import psycopg2" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "\n", 41 | "### Creating the serving layer\n", 42 | "- authenticate and create a connection using psycopg module\n", 43 | "- create and populate a temporary table with it's base being batch layer and upserting the speed layer\n", 44 | "- drop the current serving layer and use the above mentioned temporary table for serving layer (no downtime)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "config = { 'dbname': 'lambda', \n", 54 | " 'user':'dorian',\n", 55 | " 'pwd':'Demo1234',\n", 56 | " 'host':'data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com',\n", 57 | " 'port':'5439'\n", 58 | " }\n", 59 | "conn = psycopg2.connect(dbname=config['dbname'], host=config['host'], \n", 60 | " port=config['port'], user=config['user'], \n", 61 | " password=config['pwd'])" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 3, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "curs = conn.cursor()\n", 71 | "curs.execute(\"\"\" \n", 72 | " DROP TABLE IF EXISTS serving_layer_temp; \n", 73 | "\n", 74 | " SELECT \n", 75 | " *\n", 76 | " INTO \n", 77 | " serving_layer_temp\n", 78 | " FROM \n", 79 | " batch_layer ;\n", 80 | "\n", 81 | "\n", 82 | "\n", 83 | " UPDATE \n", 84 | " serving_layer_temp\n", 85 | " SET\n", 86 | " count_id = count_id + speed_layer.\"count(id)\",\n", 87 | " sum_followers_count = sum_followers_count + speed_layer.\"sum(followers_count)\",\n", 88 | " sum_favorite_count = sum_favorite_count + speed_layer.\"sum(favorite_count)\",\n", 89 | " sum_retweet_count = sum_retweet_count + speed_layer.\"sum(retweet_count)\"\n", 90 | " FROM\n", 91 | " speed_layer\n", 92 | " WHERE \n", 93 | " serving_layer_temp.location = speed_layer.location ;\n", 94 | "\n", 95 | "\n", 96 | "\n", 97 | " INSERT INTO \n", 98 | " serving_layer_temp\n", 99 | " SELECT \n", 100 | " * \n", 101 | " FROM \n", 102 | " speed_layer\n", 103 | " WHERE \n", 104 | " speed_layer.location \n", 105 | " NOT IN (\n", 106 | " SELECT \n", 107 | " DISTINCT location \n", 108 | " FROM \n", 109 | " serving_layer_temp \n", 110 | " ) ;\n", 111 | " \n", 112 | " \n", 113 | " drop table serving_layer ;\n", 114 | " \n", 115 | " alter table serving_layer_temp rename to serving_layer ; \n", 116 | " \n", 117 | "\"\"\")\n", 118 | "curs.close()\n", 119 | "conn.commit()\n", 120 | "conn.close()" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [] 129 | } 130 | ], 131 | "metadata": { 132 | "kernelspec": { 133 | "display_name": "Python 3", 134 | "language": "python", 135 | "name": "python3" 136 | }, 137 | "language_info": { 138 | "codemirror_mode": { 139 | "name": "ipython", 140 | "version": 3 141 | }, 142 | "file_extension": ".py", 143 | "mimetype": "text/x-python", 144 | "name": "python", 145 | "nbconvert_exporter": "python", 146 | "pygments_lexer": "ipython3", 147 | "version": "3.6.1" 148 | } 149 | }, 150 | "nbformat": 4, 151 | "nbformat_minor": 2 152 | } 153 | -------------------------------------------------------------------------------- /Implementing the speed layer of lambda architecture using Structured Spark Streaming: -------------------------------------------------------------------------------- 1 | {"paragraphs":[{"text":"%md \n\n# Implementing the speed layer of lambda architecture using Structured Spark Streaming\n\n### Purpose: \n- provide analytics on real time data (\"intra day\") which batch layer cannot efficiently achieve\n- achieve this by:\n - ingest latest tweets from Kafka Producer and analtze only those for the current day \n - perform aggregations over the data to get the desired output of speed layer\n\n### Contents: \n- Configuring spark\n- Spark Structured Streaming\n - Input stage - defining the data source\n - Result stage - performing transformations on the stream\n - Output stage\n- Connecting to redshift cluster\n- Exporting data to Redshift","user":"anonymous","dateUpdated":"2017-11-11T15:55:51+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Implementing the speed layer of lambda architecture using Structured Spark Streaming

\n

Purpose:

\n
    \n
  • provide analytics on real time data (“intra day”) which batch layer cannot efficiently achieve
  • \n
  • achieve this by:\n
      \n
    • ingest latest tweets from Kafka Producer and analtze only those for the current day
    • \n
    • perform aggregations over the data to get the desired output of speed layer
    • \n
    \n
  • \n
\n

Contents:

\n
    \n
  • Configuring spark
  • \n
  • Spark Structured Streaming\n
      \n
    • Input stage - defining the data source
    • \n
    • Result stage - performing transformations on the stream
    • \n
    • Output stage
    • \n
    \n
  • \n
  • Connecting to redshift cluster
  • \n
  • Exporting data to Redshift
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509909863130_-1135889452","id":"20171105-202423_1160220584","dateCreated":"2017-11-05T20:24:23+0100","dateStarted":"2017-11-11T15:55:51+0100","dateFinished":"2017-11-11T15:55:51+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:273"},{"text":"%dep\n\nz.load(\"/Volumes/SD/Downloads/RedshiftJDBC42-1.2.10.1009.jar\")","user":"anonymous","dateUpdated":"2017-11-11T16:24:36+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@7dfb0e5a\n"}]},"apps":[],"jobName":"paragraph_1509878460764_1962878483","id":"20171105-114100_1870675802","dateCreated":"2017-11-05T11:41:00+0100","dateStarted":"2017-11-11T16:24:36+0100","dateFinished":"2017-11-11T16:24:44+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:274"},{"text":"%md \n\n### Requirements","user":"anonymous","dateUpdated":"2017-11-11T14:30:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Requirements

\n
"}]},"apps":[],"jobName":"paragraph_1510394086125_-1443536682","id":"20171111-105446_1758804734","dateCreated":"2017-11-11T10:54:46+0100","dateStarted":"2017-11-11T14:30:05+0100","dateFinished":"2017-11-11T14:30:05+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:275"},{"text":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._","user":"anonymous","dateUpdated":"2017-11-11T16:24:59+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.sql.streaming.ProcessingTime\nimport java.util.concurrent._\n"}]},"apps":[],"jobName":"paragraph_1510394103383_-688043084","id":"20171111-105503_1480409678","dateCreated":"2017-11-11T10:55:03+0100","dateStarted":"2017-11-11T16:25:00+0100","dateFinished":"2017-11-11T16:25:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:276"},{"text":"%md \n\n### Configuring Spark\n- properly configuring spark for our workload\n- defining case class for tweets which will be used later on","user":"anonymous","dateUpdated":"2017-11-11T14:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Configuring Spark

\n
    \n
  • properly configuring spark for our workload
  • \n
  • defining case class for tweets which will be used later on
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509911366990_1694144500","id":"20171105-204926_42354742","dateCreated":"2017-11-05T20:49:26+0100","dateStarted":"2017-11-11T14:30:07+0100","dateFinished":"2017-11-11T14:30:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:277"},{"text":"Thread.sleep(5000)\n\nval spark = SparkSession\n .builder()\n .config(\"spark.sql.shuffle.partitions\",\"2\") // we are running this on my laptop\n .appName(\"Spark Structured Streaming example\")\n .getOrCreate()\n \ncase class tweet (id: String, created_at : String, followers_count: String, location : String, favorite_count : String, retweet_count : String)\n\nThread.sleep(5000)","user":"anonymous","dateUpdated":"2017-11-11T16:25:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@20d06238\ndefined class tweet\n"}]},"apps":[],"jobName":"paragraph_1509779101105_500060833","id":"20171104-080501_1626533215","dateCreated":"2017-11-04T08:05:01+0100","dateStarted":"2017-11-11T16:25:20+0100","dateFinished":"2017-11-11T16:25:33+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:278"},{"text":"%md \n\n### Input stage - defining the data source\n- using Kafka as data source we specify:\n - location of kafka broker\n - relevant kafka topic\n - how to treat starting offsets","user":"anonymous","dateUpdated":"2017-11-11T14:30:09+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Input stage - defining the data source

\n
    \n
  • using Kafka as data source we specify:\n
      \n
    • location of kafka broker
    • \n
    • relevant kafka topic
    • \n
    • how to treat starting offsets
    • \n
    \n
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509881379573_-268499860","id":"20171105-122939_1561723250","dateCreated":"2017-11-05T12:29:39+0100","dateStarted":"2017-11-11T14:30:09+0100","dateFinished":"2017-11-11T14:30:09+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:279"},{"text":"var data_stream = spark\n .readStream // contantly expanding dataframe\n .format(\"kafka\")\n .option(\"kafka.bootstrap.servers\", \"localhost:9092\")\n .option(\"subscribe\", \"tweets-lambda1\")\n .option(\"startingOffsets\",\"latest\") //or latest\n .load()\n \n// note how similar API is to the batch version","user":"anonymous","dateUpdated":"2017-11-11T16:30:07+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream: org.apache.spark.sql.DataFrame = [key: binary, value: binary ... 5 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509828449824_1485333558","id":"20171104-214729_426230385","dateCreated":"2017-11-04T21:47:29+0100","dateStarted":"2017-11-11T16:30:07+0100","dateFinished":"2017-11-11T16:30:14+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:280"},{"text":"data_stream.schema","user":"anonymous","dateUpdated":"2017-11-11T16:30:20+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res8: org.apache.spark.sql.types.StructType = StructType(StructField(key,BinaryType,true), StructField(value,BinaryType,true), StructField(topic,StringType,true), StructField(partition,IntegerType,true), StructField(offset,LongType,true), StructField(timestamp,TimestampType,true), StructField(timestampType,IntegerType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896872792_453692520","id":"20171105-164752_719330437","dateCreated":"2017-11-05T16:47:52+0100","dateStarted":"2017-11-11T16:30:20+0100","dateFinished":"2017-11-11T16:30:23+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:281"},{"text":"%md \n\n### Result stage - performing transformations on the stream\n- extract the value column of kafka message\n- parse each row into a member of tweet class\n- filter to only look at todays tweets as results\n- perform aggregations","user":"anonymous","dateUpdated":"2017-11-11T14:30:12+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Result stage - performing transformations on the stream

\n
    \n
  • extract the value column of kafka message
  • \n
  • parse each row into a member of tweet class
  • \n
  • filter to only look at todays tweets as results
  • \n
  • perform aggregations
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509912237119_-1371939219","id":"20171105-210357_553784609","dateCreated":"2017-11-05T21:03:57+0100","dateStarted":"2017-11-11T14:30:12+0100","dateFinished":"2017-11-11T14:30:12+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:282"},{"text":"var data_stream_cleaned = data_stream\n .selectExpr(\"CAST(value AS STRING) as string_value\")\n .as[String]\n .map(x => (x.split(\";\"))) //wrapped array\n .map(x => tweet(x(0), x(1), x(2), x(3), x(4), x(5)))\n .selectExpr( \"cast(id as long) id\", \"CAST(created_at as timestamp) created_at\", \"cast(followers_count as int) followers_count\", \"location\", \"cast(favorite_count as int) favorite_count\", \"cast(retweet_count as int) retweet_count\")\n .toDF()\n .filter(col(\"created_at\").gt(current_date())) // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output\n .groupBy(\"location\")\n .agg(count(\"id\"), sum(\"followers_count\"), sum(\"favorite_count\"), sum(\"retweet_count\"))","user":"anonymous","dateUpdated":"2017-11-11T16:32:33+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"data_stream_cleaned: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509838694466_-674508628","id":"20171105-003814_85265595","dateCreated":"2017-11-05T00:38:14+0100","dateStarted":"2017-11-11T16:32:33+0100","dateFinished":"2017-11-11T16:32:43+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:283"},{"text":"data_stream_cleaned.schema","user":"anonymous","dateUpdated":"2017-11-11T16:32:46+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res9: org.apache.spark.sql.types.StructType = StructType(StructField(location,StringType,true), StructField(count(id),LongType,false), StructField(sum(followers_count),LongType,true), StructField(sum(favorite_count),LongType,true), StructField(sum(retweet_count),LongType,true))\n"}]},"apps":[],"jobName":"paragraph_1509896863262_62756989","id":"20171105-164743_19706666","dateCreated":"2017-11-05T16:47:43+0100","dateStarted":"2017-11-11T16:32:46+0100","dateFinished":"2017-11-11T16:32:47+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:284"},{"text":"%md \n\n### Output stage\n- specify the following:\n - data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )\n - trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)\n - output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode","user":"anonymous","dateUpdated":"2017-11-11T16:33:38+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Output stage

\n
    \n
  • specify the following:\n
      \n
    • data sink - exporting to memory (table can be accessed similar to registerTempTable()/ createOrReplaceTempView() function )
    • \n
    • trigger - time between running the pipeline (ie. when to do: polling for new data, data transformation)
    • \n
    • output mode - complete, append or update - since in Result stage we use aggregates, we can only use Complete or Update out put mode
    • \n
    \n
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509912259110_1463706147","id":"20171105-210419_1857413462","dateCreated":"2017-11-05T21:04:19+0100","dateStarted":"2017-11-11T16:33:38+0100","dateFinished":"2017-11-11T16:33:42+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:285"},{"text":"val query = data_stream_cleaned.writeStream\n .format(\"memory\")\n .queryName(\"demo\")\n .trigger(ProcessingTime(\"30 seconds\")) // means that that spark will look for new data only every minute\n .outputMode(\"complete\") // could also be append or update\n .start()","user":"anonymous","dateUpdated":"2017-11-11T16:34:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"query: org.apache.spark.sql.streaming.StreamingQuery = Streaming Query demo [id = 824a35c8-af8d-4b57-83c4-4cbf3c1f9418, runId = 5dcdee01-de96-403e-997e-350bfd5d8f2c] [state = ACTIVE]\n"}]},"apps":[],"jobName":"paragraph_1509873755450_-691017020","id":"20171105-102235_1294551369","dateCreated":"2017-11-05T10:22:35+0100","dateStarted":"2017-11-11T16:34:18+0100","dateFinished":"2017-11-11T16:34:22+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:286"},{"text":"query.status","user":"anonymous","dateUpdated":"2017-11-11T16:34:25+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res10: org.apache.spark.sql.streaming.StreamingQueryStatus =\n{\n \"message\" : \"Processing new data\",\n \"isDataAvailable\" : true,\n \"isTriggerActive\" : true\n}\n"}]},"apps":[],"jobName":"paragraph_1510413138022_529774344","id":"20171111-161218_1171458195","dateCreated":"2017-11-11T16:12:18+0100","dateStarted":"2017-11-11T16:34:25+0100","dateFinished":"2017-11-11T16:34:26+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:287"},{"text":"query.explain","user":"anonymous","dateUpdated":"2017-11-11T16:34:35+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"== Physical Plan ==\n*HashAggregate(keys=[location#36], functions=[count(id#40L), sum(cast(followers_count#42 as bigint)), sum(cast(favorite_count#43 as bigint)), sum(cast(retweet_count#44 as bigint))])\n+- Exchange hashpartitioning(location#36, 2)\n +- *HashAggregate(keys=[location#36], functions=[partial_count(id#40L), partial_sum(cast(followers_count#42 as bigint)), partial_sum(cast(favorite_count#43 as bigint)), partial_sum(cast(retweet_count#44 as bigint))])\n +- *Project [cast(cast(id#33 as decimal(20,0)) as bigint) AS id#40L, cast(cast(followers_count#35 as decimal(20,0)) as int) AS followers_count#42, location#36, cast(cast(favorite_count#37 as decimal(20,0)) as int) AS favorite_count#43, cast(cast(retweet_count#38 as decimal(20,0)) as int) AS retweet_count#44]\n +- *Filter (isnotnull(created_at#34) && (cast(cast(created_at#34 as timestamp) as string) > cast(current_batch_timestamp(1510414470012, DateType) as string)))\n +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).id, true) AS id#33, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).created_at, true) AS created_at#34, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).followers_count, true) AS followers_count#35, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).location, true) AS location#36, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).favorite_count, true) AS favorite_count#37, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line155526911422.$read$$iw$$iw$tweet, true], top level Product input object).retweet_count, true) AS retweet_count#38]\n +- *MapElements , obj#32: $line155526911422.$read$$iw$$iw$tweet\n +- *MapElements , obj#22: [Ljava.lang.String;\n +- *DeserializeToObject string_value#15.toString, obj#21: java.lang.String\n +- *Project [cast(value#141 as string) AS string_value#15]\n +- Scan ExistingRDD[key#140,value#141,topic#142,partition#143,offset#144L,timestamp#145,timestampType#146]\n"}]},"apps":[],"jobName":"paragraph_1510413000978_-581196590","id":"20171111-161000_1529325108","dateCreated":"2017-11-11T16:10:00+0100","dateStarted":"2017-11-11T16:34:35+0100","dateFinished":"2017-11-11T16:34:36+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:288"},{"text":"%spark.sql \n\nselect * from demo","user":"anonymous","dateUpdated":"2017-11-11T16:39:15+0100","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"location\tcount(id)\tsum(followers_count)\tsum(favorite_count)\tsum(retweet_count)\nYokohama\t2\t1110\t0\t115\nAngola, Luanda \t3\t1572\t0\t94221\nハッピーワールド\t4\t384\t0\t0\n推しが定まりません。そらワンいって考えます\t4\t6932\t0\t0\nTallinn, Estonia\t4\t248\t0\t464\n\t233\t310223\t0\t1344636\nOhio\t2\t28806\t0\t0\n京都市右京区\t3\t4272\t0\t0\n孤独\t3\t1059\t0\t0\nGeorgia, USA\t4\t3448\t0\t0\nManaus / São Paulo\t4\t1056\t0\t0\n当てます。\t3\t1299\t0\t0\nTriplex\t4\t7780\t0\t88\nPhilly✈️Tuskegee👉🏾D.C.\t3\t4842\t0\t3\nnew jersey\t4\t864\t0\t0\nさいたま市\t4\t3384\t0\t0\n日本 東京\t3\t2337\t0\t0\nD.O. 도경수\t3\t1428\t0\t3018\n千葉 船橋市\t4\t156\t0\t0\nEntroncamento, Portugal\t4\t1392\t0\t2473\nAntwerpen, Belgium\t3\t1629\t0\t0\nチ ン パ ン .🌷 ¦ヘッダー ⇒ フミ !!!\t3\t669\t0\t208\nEau Claire, Wisconsin\t3\t254721\t0\t1\n@無所属 お仕事依頼DMへ📩\t3\t2406\t0\t0\nJeje l'omo Eko nlo\t3\t1008\t0\t0\nstaten island, new york\t3\t348\t0\t0\nlagos\t3\t3450\t0\t234\nAtlanta GA Metro\t3\t273\t0\t0\nBerks. UK\t3\t13383\t0\t66\nΑνδραβιδα\t3\t7398\t0\t5172\nFinland, no polar bears here\t3\t609\t0\t0\nGuayaquil, Ecuador\t3\t1827\t0\t120\n내가 하는 말이 있다면 그건 거짓말\t4\t20\t0\t0\nLouisville, KY\t3\t2250\t0\t0\nottawa\t4\t1024\t0\t0\n‏ً\t4\t11140\t0\t122960\ncairo\t2\t706\t0\t0\nSan Francisco, CA\t3\t1167\t0\t0\n🇧🇸 // 🇯🇲\t3\t3522\t0\t111\nNorwalk, CA\t3\t765\t0\t0\n京都 - 大阪\t4\t2968\t0\t341\nBeen all around the world\t3\t1161\t0\t43983\nUSA\t6\t31782\t0\t0\n호그와트\t3\t183\t0\t1077\nSeabrook, TX\t3\t1023\t0\t9860\nNein. \t3\t5421\t0\t0\nかるわかのストレス清掃員♪\t3\t372\t0\t0\nRochester/London, UK\t3\t7557\t0\t12\nbuenos aires argentina\t3\t444\t0\t0\nHoseok's heart~\t4\t7268\t0\t0\nNorth West, England\t3\t909\t0\t1431\nGloucester, UK\t3\t5424\t0\t0\n愛知県岡崎市\t4\t1964\t0\t0\nアップルパイの中\t3\t759\t0\t0\nMakati City\t3\t3\t0\t129771\nCorrientes\t3\t411\t0\t1521\nYら,猫好き,車好き,ゲーム好きさんと繋がりたい🗿\t3\t630\t0\t0\n{May - Sofi - Dians - Paula}\t3\t48723\t4\t1\nToronto\t4\t3744\t0\t0\nどすこいパレード\t4\t1044\t0\t98\nMalaysia\t4\t1264\t0\t48736\nJapan Tokyo\t4\t32520\t0\t0\n#Adelaido 👑\t2\t2094\t6\t2\nJapan\t3\t126603\t0\t0\nPilipínas\t3\t1362\t0\t20979\nManaus, Brasil\t3\t3222\t0\t0\n127.0.0.1\t3\t5997\t0\t4143\nFigueres\t3\t4485\t0\t0\n運動会する\t4\t256\t0\t0\nRepublic of the Philippines\t3\t1347\t0\t54\nアバリスの膝の上\t3\t630\t0\t0\nPortugal\t2\t1476\t0\t1241\nOhio, USA\t4\t1332\t0\t0\nlincoln, england\t4\t2844\t0\t4\nCalifornia, USA\t3\t1281\t0\t402\nLondon, Ontario\t3\t7467\t0\t0\nOrlando, FL\t4\t1056\t0\t0\nLos Angeles, CA\t7\t29565\t0\t775\nAtlanta, GA\t3\t279\t0\t318\nmismuertos city\t3\t1449\t0\t57932\nカルデアの冷蔵庫\t3\t141\t0\t0\nBarueri, Brasil\t3\t2607\t0\t3\nahead\t3\t831\t1\t1\nparadise with drake\t3\t3426\t0\t0\nmiami\t3\t504\t0\t0\n栃木県\t4\t316\t0\t0\nJHB/SF\t4\t11420\t0\t0\n京都府 京都市 \t3\t18\t0\t0\nMichigan \t4\t3272\t0\t0\nAvellaneda, Buenos Aires\t3\t774\t0\t10923\n대한민국\t3\t63\t0\t1068\n日本\t19\t6294\t0\t231\nether \t4\t88\t0\t16\nPunjab, Pakistan\t2\t2088\t0\t0\nUnited Kingdom\t4\t1740\t0\t0\nToday's Market Movement\t3\t586\t0\t9\nOn Melancholy Hill\t3\t498\t0\t0\nLos Angeles\t4\t612\t0\t0\nMadrid, Comunidad de Madrid\t3\t7578\t0\t375\nMakati City, National Capital \t3\t660\t0\t36\nSingapore \t3\t588\t0\t204\nIndia\t3\t342\t0\t0\nNorth Riding/Samrand\t3\t315\t0\t29427\nBinghamton University\t3\t2166\t0\t0\nGurgaon, India\t4\t18184\t0\t0\n広島県\t3\t642\t0\t206\nみんなの心の中☆\t2\t204\t0\t0\n"}]},"apps":[],"jobName":"paragraph_1509873913879_-781586253","id":"20171105-102513_1944185364","dateCreated":"2017-11-05T10:25:13+0100","dateStarted":"2017-11-11T16:39:16+0100","dateFinished":"2017-11-11T16:39:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:289"},{"text":"%md\n\n### Connecting to redshift cluster\n- defining JDBC connection to connect to redshift","user":"anonymous","dateUpdated":"2017-11-11T14:30:16+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Connecting to redshift cluster

\n
    \n
  • defining JDBC connection to connect to redshift
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509909816175_369050675","id":"20171105-202336_1741159108","dateCreated":"2017-11-05T20:23:36+0100","dateStarted":"2017-11-11T14:30:16+0100","dateFinished":"2017-11-11T14:30:16+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:290"},{"text":"//create properties object\nClass.forName(\"com.amazon.redshift.jdbc42.Driver\")\n\nval prop = new java.util.Properties\nprop.setProperty(\"driver\", \"com.amazon.redshift.jdbc42.Driver\")\nprop.setProperty(\"user\", \"dorian\")\nprop.setProperty(\"password\", \"Demo1234\") \n\n//jdbc mysql url - destination database is named \"data\"\nval url = \"jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\"\n\n//destination database table \nval table = \"speed_layer\"","user":"anonymous","dateUpdated":"2017-11-11T16:36:02+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"res13: Class[_] = class com.amazon.redshift.jdbc42.Driver\nprop: java.util.Properties = {}\nres15: Object = null\nres16: Object = null\nres17: Object = null\nurl: String = jdbc:redshift://data-warehouse.c3glymsgdgty.us-east-1.redshift.amazonaws.com:5439/lambda\ntable: String = speed_layer\n"}]},"apps":[],"jobName":"paragraph_1509874084488_-1803262626","id":"20171105-102804_443283701","dateCreated":"2017-11-05T10:28:04+0100","dateStarted":"2017-11-11T16:36:02+0100","dateFinished":"2017-11-11T16:36:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:291"},{"text":"%md \n\n### Exporting data to Redshift\n- \"overwriting\" the table with results of query stored in memory as result of the speed layer\n- scheduling the function to run every hour\n","user":"anonymous","dateUpdated":"2017-11-11T14:30:18+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n

Exporting data to Redshift

\n
    \n
  • “overwriting” the table with results of query stored in memory as result of the speed layer
  • \n
  • scheduling the function to run every hour
  • \n
\n
"}]},"apps":[],"jobName":"paragraph_1509881434795_1558195972","id":"20171105-123034_774658214","dateCreated":"2017-11-05T12:30:34+0100","dateStarted":"2017-11-11T14:30:18+0100","dateFinished":"2017-11-11T14:30:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:292"},{"text":"val df = spark.sql(\"select * from demo\")\n\n//write data from spark dataframe to database\ndf.write.mode(\"overwrite\").jdbc(url, table, prop)\n","user":"anonymous","dateUpdated":"2017-11-11T16:39:19+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"df: org.apache.spark.sql.DataFrame = [location: string, count(id): bigint ... 3 more fields]\n"}]},"apps":[],"jobName":"paragraph_1509878690443_-615786910","id":"20171105-114450_965590151","dateCreated":"2017-11-05T11:44:50+0100","dateStarted":"2017-11-11T16:39:19+0100","dateFinished":"2017-11-11T16:39:30+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:293"},{"text":"","user":"anonymous","dateUpdated":"2017-11-11T16:37:05+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1509949807769_1425690320","id":"20171106-073007_1469584911","dateCreated":"2017-11-06T07:30:07+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:294"}],"name":"Implementing the speed layer of lambda architecture using Structured Spark Streaming","id":"2CY6MSDVK","angularObjects":{"2CZTSJM9A:shared_process":[],"2CXD9FT1P:shared_process":[],"2CYBUJCZE:shared_process":[],"2CZXYWRND:shared_process":[],"2CWX9E9KA:shared_process":[],"2CWJS6R2N:shared_process":[],"2CW6U7X7Z:shared_process":[],"2CY9X3W1T:shared_process":[],"2CX93H291:shared_process":[],"2CZRYW3SZ:shared_process":[],"2CYT9Z9RC:shared_process":[],"2CY2R49R6:shared_process":[],"2CYQW36AU:shared_process":[],"2CWPRKMXH:shared_process":[],"2CWU95D3A:shared_process":[],"2CXJ7UBRF:shared_process":[],"2CWWTMY7M:shared_process":[],"2CY3EBJAE:shared_process":[],"2CYFQZER9:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}} -------------------------------------------------------------------------------- /Ingesting realtime tweets using Apache Kafka, Tweepy and Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Ingesting realtime tweets using Apache Kafka, Tweepy and Python\n", 8 | "\n", 9 | "### Purpose:\n", 10 | "- main data source for the lambda architecture pipeline\n", 11 | "- uses twitter streaming API to simulate new events coming in every minute\n", 12 | "- Kafka Producer sends the tweets as records to the Kafka Broker\n", 13 | "\n", 14 | "### Contents: \n", 15 | "- [Twitter setup](#1)\n", 16 | "- [Defining the Kafka producer](#2)\n", 17 | "- [Producing and sending records to the Kafka Broker](#3)\n", 18 | "- [Deployment](#4)" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### Required libraries" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import tweepy\n", 35 | "import time\n", 36 | "from kafka import KafkaConsumer, KafkaProducer" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "\n", 44 | "### Twitter setup\n", 45 | "- getting the API object using authorization information\n", 46 | "- you can find more details on how to get the authorization here:\n", 47 | "https://developer.twitter.com/en/docs/basics/authentication/overview" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "# twitter setup\n", 57 | "consumer_key = \"1\"\n", 58 | "consumer_secret = \"2\"\n", 59 | "access_token = \"3\"\n", 60 | "access_token_secret = \"4\"\n", 61 | "# Creating the authentication object\n", 62 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", 63 | "# Setting your access token and secret\n", 64 | "auth.set_access_token(access_token, access_token_secret)\n", 65 | "# Creating the API object by passing in auth information\n", 66 | "api = tweepy.API(auth) \n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "A helper function to normalize the time a tweet was created with the time of our system" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "from datetime import datetime, timedelta\n", 83 | "\n", 84 | "def normalize_timestamp(time):\n", 85 | " mytime = datetime.strptime(time, \"%Y-%m-%d %H:%M:%S\")\n", 86 | " mytime += timedelta(hours=1) # the tweets are timestamped in GMT timezone, while I am in +1 timezone\n", 87 | " return (mytime.strftime(\"%Y-%m-%d %H:%M:%S\")) " 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "\n", 95 | "### Defining the Kafka producer\n", 96 | "- specify the Kafka Broker\n", 97 | "- specify the topic name\n", 98 | "- optional: specify partitioning strategy" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 4, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "producer = KafkaProducer(bootstrap_servers='localhost:9092')\n", 108 | "topic_name = 'tweets-lambda1'" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "\n", 116 | "### Producing and sending records to the Kafka Broker\n", 117 | "- querying the Twitter API Object\n", 118 | "- extracting relevant information from the response\n", 119 | "- formatting and sending the data to proper topic on the Kafka Broker\n", 120 | "- resulting tweets have following attributes:\n", 121 | " - id \n", 122 | " - created_at\n", 123 | " - followers_count\n", 124 | " - location\n", 125 | " - favorite_count\n", 126 | " - retweet_count" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 6, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "def get_twitter_data():\n", 136 | " res = api.search(\"Apple OR iphone OR iPhone\")\n", 137 | " for i in res:\n", 138 | " record = ''\n", 139 | " record += str(i.user.id_str)\n", 140 | " record += ';'\n", 141 | " record += str(normalize_timestamp(str(i.created_at)))\n", 142 | " record += ';'\n", 143 | " record += str(i.user.followers_count)\n", 144 | " record += ';'\n", 145 | " record += str(i.user.location)\n", 146 | " record += ';'\n", 147 | " record += str(i.favorite_count)\n", 148 | " record += ';'\n", 149 | " record += str(i.retweet_count)\n", 150 | " record += ';'\n", 151 | " producer.send(topic_name, str.encode(record))" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 9, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "get_twitter_data()" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "\n", 168 | "### Deployment \n", 169 | "- perform the task every couple of minutes and wait in between" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 11, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "def periodic_work(interval):\n", 179 | " while True:\n", 180 | " get_twitter_data()\n", 181 | " #interval should be an integer, the number of seconds to wait\n", 182 | " time.sleep(interval)\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "periodic_work(60 * 0.1) # get data every couple of minutes" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [] 200 | } 201 | ], 202 | "metadata": { 203 | "kernelspec": { 204 | "display_name": "Python 3", 205 | "language": "python", 206 | "name": "python3" 207 | }, 208 | "language_info": { 209 | "codemirror_mode": { 210 | "name": "ipython", 211 | "version": 3 212 | }, 213 | "file_extension": ".py", 214 | "mimetype": "text/x-python", 215 | "name": "python", 216 | "nbconvert_exporter": "python", 217 | "pygments_lexer": "ipython3", 218 | "version": "3.6.1" 219 | } 220 | }, 221 | "nbformat": 4, 222 | "nbformat_minor": 2 223 | } 224 | -------------------------------------------------------------------------------- /Lambda_architecture-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dorianbg/lambda-architecture-demo/879a4a2d7b1cb3496c0bb5c63eed07a66d834b33/Lambda_architecture-2.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Developing Lambda Architecture 2 | 3 |

This is a repository for the code found in my series of blog posts on implementing the Lambda Architecture::

4 |