├── README.md
├── header.jpg
├── run.sh
└── stream.py
/README.md:
--------------------------------------------------------------------------------
1 | # Apache Spark Kinesis Consumer
2 |
3 | > Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
4 |
5 | Code from: [Processing IoT realtime data - Medium](https://medium.com/@iamvsouza/processing-grandparents-realtime-data-d6b8c99e0b43)
6 |
7 |
8 |
9 |
10 |
11 |
12 | ## Usage example
13 |
14 | You need to set Amazon Credentials on your enviroment.
15 |
16 | ```shell
17 | export AWS_ACCESS_KEY_ID=""
18 | export AWS_ACCESS_KEY=""
19 | export AWS_SECRET_ACCESS_KEY=""
20 | export AWS_SECRET_KEY=""
21 | ```
22 |
23 | ## Dependencies
24 |
25 | Must be included on `--packages` flag.
26 |
27 | `org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1`
28 |
29 | ## Setup
30 |
31 | __How run Kinesis locally?__
32 |
33 | A few months ago I created a Docker image with Kinesalite (amazin project to simulate Amazon Kinesis), you can use
34 | this image, or run [Kinesalite]() directly.
35 |
36 | `docker run -d -p 4567:4567 vsouza/kinesis-local -p 4567 --createStreaMs 5`
37 |
38 | check the [project](https://github.com/vsouza/docker-Kinesis-local)
39 |
40 | __I should have DynamoDB too?__
41 |
42 | Yes, :cry: . The AWS SDK Kinesis module make checkpoints of your Kinesis tunnel, and store this on DynamoDB. You don't
43 | need to create tables or else, the SDK will create for you.
44 |
45 | *Remember to configure your throughput value in DynamoDB correctly*
46 |
47 | ## License
48 |
49 | [MIT License](http://vsouza.mit-license.org/) © Vinicius Souza
50 |
--------------------------------------------------------------------------------
/header.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vsouza/spark-kinesis-redshift/f979954e982865966e20403dbe9b0857df18d7ea/header.jpg
--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
1 | export AWS_ACCESS_KEY_ID=""
2 | export AWS_ACCESS_KEY=""
3 | export AWS_SECRET_ACCESS_KEY=""
4 | export AWS_SECRET_KEY=""
5 |
6 | ../spark-1.6.2/bin/spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1,com.databricks:spark-redshift_2.10:0.6.0 --jars RedshiftJDBC41-1.1.10.1010.jar stream.py
7 |
--------------------------------------------------------------------------------
/stream.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 | from pyspark import SparkContext
3 | from pyspark.streaming import StreamingContext
4 | from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
5 | import datetime
6 | import json
7 | from pyspark.sql import SQLContext, Row
8 | from pyspark.sql.types import *
9 |
10 | aws_region = 'us-east-1'
11 | kinesis_stream = 'stream_name'
12 | kinesis_endpoint = 'https://kinesis.us-east-1.amazonaws.com/'
13 | kinesis_app_name = 'app_name'
14 | kinesis_initial_position = InitialPositionInStream.LATEST
15 | kinesis_checkpoint_interval = 5
16 | spark_batch_interval = 5
17 |
18 |
19 |
20 | if __name__ == "__main__":
21 | spark_context = SparkContext(appName=kinesis_app_name)
22 | spark_streaming_context = StreamingContext(spark_context, spark_batch_interval)
23 | sql_context = SQLContext(spark_context)
24 |
25 | kinesis_stream = KinesisUtils.createStream(
26 | spark_streaming_context, kinesis_app_name, kinesis_stream, kinesis_endpoint,
27 | aws_region, kinesis_initial_position, kinesis_checkpoint_interval)
28 |
29 | kinesis_stream.pprint()
30 | py_rdd = kinesis_stream.map(lambda x: json.loads(x))
31 |
32 | def process(time, rdd):
33 | print("========= %s =========" % str(time))
34 | try:
35 |
36 | sqlContext = getSqlContextInstance(rdd.context)
37 | schema = StructType([
38 | StructField('user_id', IntegerType(), True),
39 | StructField('username', StringType(), True),
40 | StructField('first_name', StringType(), True),
41 | StructField('surname', StringType(), True),
42 | StructField('age', IntegerType(), True),
43 | ])
44 | df = sqlContext.createDataFrame(rdd, schema)
45 | df.registerTempTable("activity_log")
46 | df.write \
47 | .format("com.databricks.spark.redshift") \
48 | .option("url", "jdbc:redshiftURL.com:5439/database?user=USERNAME&password=PASSWORD") \
49 | .option("dbtable", "activity_log") \
50 | .option("tempdir", "s3n://spark-temp-data/") \
51 | .mode("append") \
52 | .save()
53 | except Exception as e:
54 | print(e)
55 | pass
56 |
57 |
58 | py_rdd.foreachRDD(process)
59 | spark_streaming_context.start()
60 | spark_streaming_context.awaitTermination()
61 | spark_streaming_context.stop()
62 |
--------------------------------------------------------------------------------