├── README.md ├── header.jpg ├── run.sh └── stream.py /README.md: -------------------------------------------------------------------------------- 1 | # Apache Spark Kinesis Consumer 2 | 3 | > Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark 4 | 5 | Code from: [Processing IoT realtime data - Medium](https://medium.com/@iamvsouza/processing-grandparents-realtime-data-d6b8c99e0b43) 6 | 7 |

8 | 9 |

10 | 11 | 12 | ## Usage example 13 | 14 | You need to set Amazon Credentials on your enviroment. 15 | 16 | ```shell 17 | export AWS_ACCESS_KEY_ID="" 18 | export AWS_ACCESS_KEY="" 19 | export AWS_SECRET_ACCESS_KEY="" 20 | export AWS_SECRET_KEY="" 21 | ``` 22 | 23 | ## Dependencies 24 | 25 | Must be included on `--packages` flag. 26 | 27 | `org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1` 28 | 29 | ## Setup 30 | 31 | __How run Kinesis locally?__ 32 | 33 | A few months ago I created a Docker image with Kinesalite (amazin project to simulate Amazon Kinesis), you can use 34 | this image, or run [Kinesalite]() directly. 35 | 36 | `docker run -d -p 4567:4567 vsouza/kinesis-local -p 4567 --createStreaMs 5` 37 | 38 | check the [project](https://github.com/vsouza/docker-Kinesis-local) 39 | 40 | __I should have DynamoDB too?__ 41 | 42 | Yes, :cry: . The AWS SDK Kinesis module make checkpoints of your Kinesis tunnel, and store this on DynamoDB. You don't 43 | need to create tables or else, the SDK will create for you. 44 | 45 | *Remember to configure your throughput value in DynamoDB correctly* 46 | 47 | ## License 48 | 49 | [MIT License](http://vsouza.mit-license.org/) © Vinicius Souza 50 | -------------------------------------------------------------------------------- /header.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vsouza/spark-kinesis-redshift/f979954e982865966e20403dbe9b0857df18d7ea/header.jpg -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | export AWS_ACCESS_KEY_ID="" 2 | export AWS_ACCESS_KEY="" 3 | export AWS_SECRET_ACCESS_KEY="" 4 | export AWS_SECRET_KEY="" 5 | 6 | ../spark-1.6.2/bin/spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1,com.databricks:spark-redshift_2.10:0.6.0 --jars RedshiftJDBC41-1.1.10.1010.jar stream.py 7 | -------------------------------------------------------------------------------- /stream.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from pyspark import SparkContext 3 | from pyspark.streaming import StreamingContext 4 | from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream 5 | import datetime 6 | import json 7 | from pyspark.sql import SQLContext, Row 8 | from pyspark.sql.types import * 9 | 10 | aws_region = 'us-east-1' 11 | kinesis_stream = 'stream_name' 12 | kinesis_endpoint = 'https://kinesis.us-east-1.amazonaws.com/' 13 | kinesis_app_name = 'app_name' 14 | kinesis_initial_position = InitialPositionInStream.LATEST 15 | kinesis_checkpoint_interval = 5 16 | spark_batch_interval = 5 17 | 18 | 19 | 20 | if __name__ == "__main__": 21 | spark_context = SparkContext(appName=kinesis_app_name) 22 | spark_streaming_context = StreamingContext(spark_context, spark_batch_interval) 23 | sql_context = SQLContext(spark_context) 24 | 25 | kinesis_stream = KinesisUtils.createStream( 26 | spark_streaming_context, kinesis_app_name, kinesis_stream, kinesis_endpoint, 27 | aws_region, kinesis_initial_position, kinesis_checkpoint_interval) 28 | 29 | kinesis_stream.pprint() 30 | py_rdd = kinesis_stream.map(lambda x: json.loads(x)) 31 | 32 | def process(time, rdd): 33 | print("========= %s =========" % str(time)) 34 | try: 35 | 36 | sqlContext = getSqlContextInstance(rdd.context) 37 | schema = StructType([ 38 | StructField('user_id', IntegerType(), True), 39 | StructField('username', StringType(), True), 40 | StructField('first_name', StringType(), True), 41 | StructField('surname', StringType(), True), 42 | StructField('age', IntegerType(), True), 43 | ]) 44 | df = sqlContext.createDataFrame(rdd, schema) 45 | df.registerTempTable("activity_log") 46 | df.write \ 47 | .format("com.databricks.spark.redshift") \ 48 | .option("url", "jdbc:redshiftURL.com:5439/database?user=USERNAME&password=PASSWORD") \ 49 | .option("dbtable", "activity_log") \ 50 | .option("tempdir", "s3n://spark-temp-data/") \ 51 | .mode("append") \ 52 | .save() 53 | except Exception as e: 54 | print(e) 55 | pass 56 | 57 | 58 | py_rdd.foreachRDD(process) 59 | spark_streaming_context.start() 60 | spark_streaming_context.awaitTermination() 61 | spark_streaming_context.stop() 62 | --------------------------------------------------------------------------------