├── README.md ├── conf ├── master │ └── spark-defaults.conf └── worker │ └── spark-defaults.conf ├── docker-compose.yml ├── images ├── project-architecture.png └── screenshot.png ├── kafka_producer ├── Dockerfile ├── requirements.txt ├── start.sh └── twitter_kafka_producer.py ├── spark-streaming-kafka-cassandra ├── Dockerfile ├── build.sbt ├── project │ ├── build.properties │ └── plugins.sbt ├── src │ ├── main │ │ └── scala │ │ │ └── org │ │ │ └── sevenmob │ │ │ ├── geocode │ │ │ ├── Formats.scala │ │ │ ├── Geocode.scala │ │ │ ├── Parameters.scala │ │ │ └── Response.scala │ │ │ └── spark │ │ │ └── streaming │ │ │ ├── CustomUUIDSerializer.scala │ │ │ ├── Formats.scala │ │ │ ├── RealtimeIpProcessing.scala │ │ │ ├── SampleTwitterData.scala │ │ │ └── StreamingExamples.scala │ └── test │ │ ├── resources │ │ └── loremipsum.txt │ │ └── scala │ │ └── org │ │ └── sevenmob │ │ └── geocode │ │ └── GeocodeSpec.scala ├── start.sh └── version.sbt └── webserver ├── Dockerfile ├── requirements.txt ├── start.sh ├── templates └── index.html └── webserver.py /README.md: -------------------------------------------------------------------------------- 1 | DATA-PROCESSING-PIPELINE 2 | ============== 3 | 4 | ## Description 5 | 6 | Build a powerful *Real-Time Data Processing Pipeline & Visualization* solution using Docker Machine and Compose, Kafka, Cassandra and Spark in 5 steps. 7 | 8 | See below the project's architecture: 9 | 10 | ![Docker Architecture](images/project-architecture.png "Project Architecture") 11 | 12 | ## What's happening under the hood? 13 | We connect to the twitter streaming API (https://dev.twitter.com/streaming/overview) and start to listen to events based on a list of keywords, these events are forwarded directly to Kafka (no parsing). In the middle, there is a spark job collecting those events, converting them to Spark SQL context (http://spark.apache.org/sql/) which filters the kafka message and extract only the fields of interest which in this case are: *user.location, text and user.profile_image_url*, once we have that, we convert the *location* into coordinates (lat,lng) using the google geoconding API (https://developers.google.com/maps/documentation/geocoding/intro) and persist the data into Cassandra. 14 | 15 | Finally, there is a web application running that is fetching data from Cassandra and rendering the tweets of interest on the world map. 16 | 17 | ![Project Screenshot](images/screenshot.png "Docker Hackday Project") 18 | 19 | ### Some Interesting Project Stats 20 | ##### Number of Containers: 8 21 | ##### Number of Open Source Projects Used: 8 22 | ##### Number of Programming Languages Used: 4 (Python, Bash, Scala, Java) 23 | 24 | ## Pre-requisites 25 | 26 | ### Docker (https://docs.docker.com/installation/) 27 | ``` 28 | $ wget -qO- https://get.docker.com/ | sh 29 | ``` 30 | 31 | ### Docker Machine (https://docs.docker.com/machine/install-machine/) 32 | ``` 33 | $ curl -L https://github.com/docker/machine/releases/download/v0.8.2/docker-machine_`uname -s`-amd64 > /usr/local/bin/docker-machine 34 | $ chmod +x /usr/local/bin/docker-machine 35 | ``` 36 | 37 | ### Docker Compose (https://docs.docker.com/compose/install/) 38 | ``` 39 | $ curl -L https://github.com/docker/compose/releases/download/1.8.1/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose 40 | $ chmod +x /usr/local/bin/docker-compose 41 | ``` 42 | 43 | ## Project Installation / Usage 44 | ### Step 1: Create a VM with Docker 45 | If you already have a VM running or if you are on Linux, you can skip this step. Otherwise, the steps are the following: 46 | 47 | #### On Digital Ocean 48 | ##### a) Create a Digital Ocean Token 49 | You need to create a personal access token under “Apps & API” in the Digital Ocean Control Panel. 50 | 51 | ##### b) Grab your access token, then run docker-machine create with these details: 52 | ``` 53 | $ docker-machine create --driver digitalocean --digitalocean-access-token= Docker-VM 54 | ``` 55 | #### On VirtualBox 56 | You just need to run: 57 | ``` 58 | $ docker-machine create -d virtualbox --virtualbox-memory 2048 Docker-VM 59 | ``` 60 | #### On Microsoft Azure 61 | #### a) Create certificate 62 | ``` 63 | $ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem 64 | $ openssl pkcs12 -export -out mycert.pfx -in mycert.pem -name "My Certificate" 65 | $ openssl x509 -inform pem -in mycert.pem -outform der -out mycert.cer 66 | ``` 67 | #### b) Upload the certificate to Microft Azure 68 | Go to the Azure portal, go to the “Settings” page (you can find the link at the bottom of the left sidebar - you need to scroll), then “Management Certificates” and upload mycert.cer. 69 | #### c) Grab your subscription ID from the portal (SUBSCRIPTIONS tab), then run docker-machine create with these details: 70 | ``` 71 | $ docker-machine create -d azure --azure-subscription-id="SUB_ID" --azure-subscription-cert="mycert.pem" azure-size="Medium" Docker-VM 72 | ``` 73 | #### d) Expose Port 80 74 | When viewing your VM in the resource group you've created, scroll down to click Endpoints to view the endpoints on the VM. Add a new *endpoint* that exposes the port 80 and give it some name. 75 | 76 | #### Access the VM 77 | By default *docker-machine* will spin up an Ubuntu 14.04 instance on all cloud providers, as we are running multiple JAVA based applications that consumes a lot of memory, on the *docker-machine create* commands above I added an extra parameter to reserve at least 2GB of memory. The command below will ssh into the VM using your *ssh public key* 78 | ``` 79 | $ docker-machine ssh Docker-VM 80 | ``` 81 | 82 | ### Step 2: Getting Twitter API keys 83 | 84 | In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements: 85 |
 86 | Create a twitter account if you do not already have one.
 87 | Go to https://apps.twitter.com/ and log in with your twitter credentials.
 88 | Click "Create New App"
 89 | Fill out the form, agree to the terms, and click "Create your Twitter application"
 90 | In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
 91 | Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".
 92 | 
93 | 94 | ### Step 3: Clone this repo and update the docker-compose.yml file (https://docs.docker.com/compose/yml/) 95 | First you need to clone this repo: 96 | ``` 97 | $ git clone git@github.com:rogaha/data-processing-pipeline.git 98 | ``` 99 | Then, we need to update the kafka advertized host name, the twitter API credentials and the keywords you want to track. Below are the enviroment variables that need to be updated: 100 | ``` 101 | KAFKA_ADVERTISED_HOST_NAME: "" (public IP or the IP of your local VM) 102 | ACCESS_TOKEN: "" 103 | ACCESS_TOKEN_SECRET: "" 104 | CONSUMER_KEY: "" 105 | CONSUMER_SECRET: "" 106 | KEYWORDS_LIST: "" 107 | GOOGLE_GEOCODING_API_KEY: "." (use "." to ignore it) 108 | ``` 109 | In order to get the *public IP* of your Digital Ocean droplet you can run from the VM: 110 | ``` 111 | $ /sbin/ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}' 112 | ``` 113 | 114 | The *KEYWORDS_LIST* shoud be a comma separated string, such as: "python, scala, golang" 115 | 116 | ### Step 4: Start All the Containers 117 | With docker-compose you can just run: 118 | ``` 119 | $ docker-compose up -d 120 | ``` 121 | The output should be: 122 | ``` 123 | Creating dataprocessingpipeline_zookeeper_1... 124 | Creating dataprocessingpipeline_sparkmaster_1... 125 | Creating dataprocessingpipeline_kafka_1... 126 | Creating dataprocessingpipeline_twitterkafkaproducer_1... 127 | Creating dataprocessingpipeline_cassandra_1... 128 | Creating dataprocessingpipeline_sparkjob_1... 129 | Creating dataprocessingpipeline_webserver_1... 130 | Creating dataprocessingpipeline_sparkworker_1... 131 | ``` 132 | 133 | After that you should wait a few seconds, I've a 15 seconds delay before starting the spark-job, kafka producer and webcontainer containers, in order to make all the dependencies are up and running. 134 | ### Step 5: Access the IP/Hostname of your machine from your browser 135 | I've cloned this repo, updated the environment variables and started the containers on Azure. 136 | 137 | ## Open Source Projects Used 138 | 139 | #### Docker (https://github.com/docker/docker) 140 | An open platform for distributed applications for developers and sysadmins 141 | #### Docker Machine (https://github.com/docker/machine) 142 | Lets you create Docker hosts on your computer, on cloud providers, and inside your own data center 143 | #### Docker Compose (https://github.com/docker/compose) 144 | Tool for defining and running multi-container applications with Docker 145 | #### Apache Spark / Spark SQL (https://github.com/apache/spark) 146 | A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD) 147 | #### Apache Kafka (https://github.com/apache/kafka) 148 | A fast and scalable pub-sub messaging service 149 | #### Apache Kookeeper (https://github.com/apache/zookeeper) 150 | A distributed configuration service, synchronization service, and naming registry for large distributed systems 151 | #### Apache Cassandra (https://github.com/apache/cassandra) 152 | Scalable, high-available and distributed columnar NoSQL database 153 | #### D3 (https://github.com/mbostock/d3) 154 | A JavaScript visualization library for HTML and SVG. 155 | -------------------------------------------------------------------------------- /conf/master/spark-defaults.conf: -------------------------------------------------------------------------------- 1 | # Default system properties included when running spark-submit. 2 | # This is useful for setting default environmental settings. 3 | 4 | spark.driver.port 7001 5 | spark.fileserver.port 7002 6 | spark.broadcast.port 7003 7 | spark.replClassServer.port 7004 8 | spark.blockManager.port 7005 9 | spark.executor.port 7006 10 | 11 | spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory 12 | spark.port.maxRetries 4 -------------------------------------------------------------------------------- /conf/worker/spark-defaults.conf: -------------------------------------------------------------------------------- 1 | # Default system properties included when running spark-submit. 2 | # This is useful for setting default environmental settings. 3 | 4 | #spark.driver.port 7101 5 | spark.fileserver.port 7012 6 | spark.broadcast.port 7013 7 | spark.replClassServer.port 7014 8 | spark.blockManager.port 7015 9 | spark.executor.port 7016 10 | 11 | spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory 12 | spark.port.maxRetries 4 -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | zookeeper: 2 | image: wurstmeister/zookeeper 3 | ports: 4 | - "2181" 5 | 6 | kafka: 7 | image: wurstmeister/kafka:0.8.2.1 8 | ports: 9 | - "9092:9092" 10 | links: 11 | - zookeeper:zk 12 | environment: 13 | # constraint:com.docker.network.driver.overlay.bind_interface=eth0 14 | KAFKA_ADVERTISED_HOST_NAME: "" 15 | KAFKA_CREATE_TOPICS: "tweets" 16 | volumes: 17 | - /var/run/docker.sock:/var/run/docker.sock 18 | 19 | sparkmaster: 20 | image: gettyimages/spark:1.4.1-hadoop-2.6 21 | command: /usr/spark/bin/spark-class org.apache.spark.deploy.master.Master --ip master 22 | hostname: master 23 | environment: 24 | SPARK_CONF_DIR: /conf 25 | ports: 26 | - "4040:4040" 27 | - "6066:6066" 28 | - "7077:7077" 29 | - "8080:8080" 30 | volumes: 31 | - ./conf/master:/conf 32 | - ./data:/tmp/data 33 | 34 | cassandra: 35 | image: cassandra:2.2.0 36 | hostname: cassandra 37 | ports: 38 | - "9042:9042" 39 | 40 | sparkworker: 41 | image: gettyimages/spark:1.4.1-hadoop-2.6 42 | command: /usr/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 43 | hostname: worker 44 | environment: 45 | SPARK_CONF_DIR: /conf 46 | SPARK_WORKER_CORES: 1 47 | SPARK_WORKER_MEMORY: 1g 48 | SPARK_WORKER_PORT: 8881 49 | SPARK_WORKER_WEBUI_PORT: 8081 50 | links: 51 | - kafka 52 | - sparkmaster 53 | - cassandra 54 | ports: 55 | - "8081" 56 | volumes: 57 | - ./conf/worker:/conf 58 | - ./data:/tmp/data 59 | 60 | twitterkafkaproducer: 61 | image: rogaha/twitter-kafka-producer 62 | restart: always 63 | command: /start.sh 64 | hostname: twitterkafkaproducer 65 | environment: 66 | SPARK_CONF_DIR: /conf 67 | ACCESS_TOKEN: "" 68 | ACCESS_TOKEN_SECRET: "" 69 | CONSUMER_KEY: "" 70 | CONSUMER_SECRET: "" 71 | KEYWORDS_LIST: "" 72 | KAFKA_TOPIC_NAME: "tweets" 73 | links: 74 | - kafka 75 | 76 | webserver: 77 | image: rogaha/twitter-demo-webserver 78 | restart: always 79 | command: /start.sh 80 | hostname: webserver 81 | links: 82 | - cassandra 83 | ports: 84 | - "80:5000" 85 | 86 | sparkjob: 87 | image: rogaha/spark-job 88 | restart: always 89 | command: /spark-job/start.sh 90 | hostname: spark-job 91 | environment: 92 | SPARK_CONF_DIR: /conf 93 | KAFKA_TOPIC_NAME: "tweets" 94 | GOOGLE_GEOCODING_API_KEY: "." 95 | links: 96 | - kafka 97 | - sparkmaster 98 | - cassandra 99 | 100 | -------------------------------------------------------------------------------- /images/project-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rogaha/data-processing-pipeline/051473115d367090ab65b59dd25dc482c03c10fe/images/project-architecture.png -------------------------------------------------------------------------------- /images/screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rogaha/data-processing-pipeline/051473115d367090ab65b59dd25dc482c03c10fe/images/screenshot.png -------------------------------------------------------------------------------- /kafka_producer/Dockerfile: -------------------------------------------------------------------------------- 1 | # Author: Roberto Gandolfo Hashioka 2 | # Date: 07/22/2015 3 | 4 | FROM stackbrew/ubuntu:14.04 5 | MAINTAINER Roberto G. Hashioka "roberto_hashioka@hotmail.com" 6 | 7 | # Install Pip 8 | RUN apt-get update 9 | RUN apt-get install -y python-pip 10 | 11 | # Install and configure python packages 12 | ADD requirements.txt /build/ 13 | RUN pip install -r /build/requirements.txt 14 | 15 | # Copy python app 16 | ADD ./twitter_kafka_producer.py / 17 | ADD ./start.sh / 18 | 19 | # Start the Kafka producer process 20 | CMD ["/usr/bin/python","/twitter_kafka_producer.py"] -------------------------------------------------------------------------------- /kafka_producer/requirements.txt: -------------------------------------------------------------------------------- 1 | kafka-python 2 | tweepy -------------------------------------------------------------------------------- /kafka_producer/start.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Wait 10 seconds before trying to connect to kafka 4 | echo "Sleeping 10 seconds..." 5 | sleep 10 6 | 7 | exec /twitter_kafka_producer.py -------------------------------------------------------------------------------- /kafka_producer/twitter_kafka_producer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | from kafka.client import KafkaClient 5 | from kafka.producer import SimpleProducer 6 | from time import sleep 7 | 8 | #Import the necessary methods from tweepy library 9 | from tweepy.streaming import StreamListener 10 | from tweepy import OAuthHandler 11 | from tweepy import Stream 12 | 13 | # List of the interested topics 14 | KEYWORDS_LIST = [] 15 | KEYWORDS_LIST += filter( 16 | bool, os.environ.get('KEYWORDS_LIST', '').split(',')) 17 | 18 | #Variables that contains the user credentials to access Twitter API 19 | ACCESS_TOKEN = os.environ.get('ACCESS_TOKEN') 20 | ACCESS_TOKEN_SECRET = os.environ.get('ACCESS_TOKEN_SECRET') 21 | CONSUMER_KEY = os.environ.get('CONSUMER_KEY') 22 | CONSUMER_SECRET = os.environ.get('CONSUMER_SECRET') 23 | 24 | # Kafka Configurations 25 | KAFKA_ENV_KAFKA_ADVERTISED_HOST_NAME = os.environ.get('KAFKA_ENV_KAFKA_ADVERTISED_HOST_NAME') 26 | KAFKA_PORT_9092_TCP_PORT = os.environ.get('KAFKA_PORT_9092_TCP_PORT') 27 | KAFKA_TOPIC_NAME = os.environ.get('KAFKA_TOPIC_NAME') 28 | 29 | client = KafkaClient('{0}:{1}'.format(KAFKA_ENV_KAFKA_ADVERTISED_HOST_NAME, KAFKA_PORT_9092_TCP_PORT)) 30 | producer = SimpleProducer(client) 31 | 32 | #This is a basic listener that just prints received tweets to stdout. 33 | class StdOutListener(StreamListener): 34 | 35 | def on_data(self, data): 36 | print "message sent to Kafka" 37 | producer.send_messages(KAFKA_TOPIC_NAME, str(data)) 38 | return True 39 | 40 | 41 | class Producer(object): 42 | 43 | def run(self, filters): 44 | #This handles Twitter authetification and the connection to Twitter Streaming API 45 | l = StdOutListener() 46 | auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) 47 | auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) 48 | stream = Stream(auth, l) 49 | #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby' 50 | stream.filter(track=filters) 51 | 52 | 53 | def main(): 54 | twitter_producer = Producer() 55 | twitter_producer.run(KEYWORDS_LIST) 56 | 57 | 58 | if __name__ == "__main__": 59 | main() 60 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/Dockerfile: -------------------------------------------------------------------------------- 1 | # Author: Roberto Gandolfo Hashioka 2 | # Date: 07/22/2015 3 | 4 | FROM gettyimages/spark:1.4.0-hadoop-2.6 5 | MAINTAINER Roberto G. Hashioka "roberto_hashioka@hotmail.com" 6 | 7 | # Add SBT package for Spark development 8 | RUN echo "deb http://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list 9 | RUN apt-get update 10 | RUN apt-get install -y --force-yes sbt git \ 11 | && apt-get clean \ 12 | && rm -rf /var/lib/apt/lists/* 13 | 14 | # Copy the project source code 15 | ADD ./project /spark-job/project 16 | ADD ./src /spark-job/src 17 | ADD ./build.sbt /spark-job/ 18 | ADD ./version.sbt /spark-job/ 19 | ADD ./start.sh /spark-job/ 20 | 21 | WORKDIR /spark-job 22 | 23 | # Compile the spark job 24 | RUN sbt assembly 25 | 26 | CMD ["./start.sh"] -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/build.sbt: -------------------------------------------------------------------------------- 1 | import AssemblyKeys._ 2 | 3 | name := "TwitterProcessingPipeLine" 4 | 5 | version := "0.1" 6 | 7 | scalaVersion := "2.10.4" 8 | 9 | resolvers += "scalaz-bintray" at "https://dl.bintray.com/scalaz/releases" 10 | 11 | // kafka streaming related dependencies 12 | libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" 13 | libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.0" 14 | libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.0" 15 | libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.4.0" 16 | libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0-M1" 17 | libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11" 18 | libraryDependencies += "com.eaio.uuid" % "uuid" % "3.2" 19 | 20 | // geocode related dependencies 21 | libraryDependencies ++= Seq( 22 | "net.databinder.dispatch" %% "dispatch-core" % "0.11.2", 23 | "io.spray" %% "spray-json" % "1.2.6", 24 | "org.specs2" %% "specs2-core" % "3.6.1" % "test" 25 | ) 26 | 27 | assemblySettings 28 | 29 | mergeStrategy in assembly := { 30 | case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard 31 | case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard 32 | case "log4j.properties" => MergeStrategy.discard 33 | case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines 34 | case "reference.conf" => MergeStrategy.concat 35 | case _ => MergeStrategy.first 36 | } 37 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/project/build.properties: -------------------------------------------------------------------------------- 1 | sbt.version=0.13.8 2 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/project/plugins.sbt: -------------------------------------------------------------------------------- 1 | resolvers += "TNM" at "http://nexus.thenewmotion.com/content/groups/public" 2 | 3 | addSbtPlugin("com.thenewmotion" % "sbt-build-seed" % "0.9.1") 4 | addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2") 5 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/geocode/Formats.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.geocode 2 | 3 | import spray.json._, DefaultJsonProtocol._ 4 | 5 | 6 | private[geocode] object Formats { 7 | 8 | implicit val ResponseStatusFmt: JsonFormat[Status] = 9 | lift((js: JsValue) => js match { 10 | case JsString("OK") => Ok 11 | case JsString("ZERO_RESULTS") => ZeroResults 12 | case JsString("OVER_QUERY_LIMIT") => OverQuotaLimit 13 | case JsString("REQUEST_DENIED") => Denied 14 | case JsString("INVALID_REQUEST") => InvalidRequest 15 | case x => deserializationError(s"Expected response status, but got $x") 16 | }) 17 | 18 | implicit val AddressFmt = jsonFormat3(Address) 19 | implicit val PointFmt = jsonFormat2(Point) 20 | implicit val RectangleFmt = jsonFormat2(Rectangle) 21 | implicit val GeometryFmt = jsonFormat4(Geometry) 22 | implicit val ResponseResultFmt = jsonFormat4(Result) 23 | implicit val GeocodeResponseFmt = jsonFormat2(Response) 24 | 25 | val read = safeReader[Response].read _ 26 | 27 | def parseResult(s: String): Either[Error, List[Result]] = 28 | read(s.parseJson) match { 29 | case Right(response) => response.allResults 30 | case Left(e) => Left(OtherError(e.getMessage)) 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/geocode/Geocode.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.geocode 2 | 3 | import scala.concurrent.ExecutionContext.Implicits.global 4 | import scala.concurrent.ExecutionContext 5 | import dispatch._ 6 | import Formats._ 7 | import scala.concurrent.duration._ 8 | 9 | 10 | class Geocode(http: Http = Http) { 11 | 12 | /** 13 | * This call to google service is limited 14 | * @see https://developers.google.com/maps/documentation/geocoding/#Limits 15 | */ 16 | def ?(p: Parameters)(implicit ec: ExecutionContext): Future[Either[Error, List[Result]]] = { 17 | import p._ 18 | val parameters = List( 19 | "address" -> s"$address", 20 | "key" -> s"$apikey", 21 | "sensor" -> "false" 22 | ) 23 | val req = url("https://maps.googleapis.com/maps/api/geocode/json") < x.head.geometry.location 34 | case Left(x) => Point(0,0) 35 | } 36 | } 37 | 38 | /** 39 | * This trait is a collection methods that perform a call 40 | * to the Google geocode WebService. 41 | * Just import an ExecutionContext and create an implicit GeocodeClient 42 | * to call this functions. 43 | */ 44 | trait GeocodeCalls { 45 | 46 | import scala.concurrent.Await 47 | import scala.concurrent.duration._ 48 | 49 | /** 50 | * This call to google service is limited 51 | * @see https://developers.google.com/maps/documentation/geocoding/#Limits 52 | */ 53 | def callGeocode(p: Parameters, d: Duration) 54 | (implicit ec: ExecutionContext, client: Geocode) 55 | : Either[Error, List[Result]] = { 56 | Await.result(client ? p, d) 57 | } 58 | 59 | } 60 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/geocode/Parameters.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.geocode 2 | 3 | 4 | case class Parameters(address: String, apikey: String) 5 | 6 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/geocode/Response.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.geocode 2 | 3 | case class Address( 4 | long_name: String, 5 | short_name: String, 6 | types: List[String] 7 | ) 8 | 9 | case class Point( 10 | lat: Double, 11 | lng: Double 12 | ) 13 | 14 | case class Rectangle( 15 | northeast: Point, 16 | southwest: Point 17 | ) 18 | 19 | case class Geometry ( 20 | bounds: Option[Rectangle], 21 | location: Point, 22 | location_type: String, 23 | viewport: Rectangle 24 | ) 25 | 26 | case class Result( 27 | address_components: List[Address], 28 | formatted_address: String, 29 | geometry: Geometry, 30 | types: List[String] 31 | ) 32 | 33 | case class Response( 34 | results: List[Result], 35 | status: Status 36 | ) { 37 | def allResults = status match { 38 | case Ok => Right(results) 39 | case e: Error => Left(e) 40 | } 41 | } 42 | 43 | /** @see https://developers.google.com/maps/documentation/geocoding/#StatusCodes */ 44 | sealed trait Status 45 | 46 | case object Ok extends Status 47 | 48 | sealed trait Error extends Status 49 | case object ZeroResults extends Error 50 | case object OverQuotaLimit extends Error 51 | case object Denied extends Error 52 | case object InvalidRequest extends Error 53 | case class OtherError(description: String) extends Error 54 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/spark/streaming/CustomUUIDSerializer.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.spark.streaming 2 | 3 | import org.json4s.CustomSerializer 4 | import org.json4s.JsonAST.{JString, JNull} 5 | 6 | import java.util.UUID 7 | 8 | case object UUIDSerialiser extends CustomSerializer[java.util.UUID](format => ( 9 | { 10 | case JString(s) => UUID.fromString(s) 11 | case JNull => null 12 | }, 13 | { 14 | case x: UUID => JString(x.toString) 15 | } 16 | ) 17 | ) 18 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/spark/streaming/Formats.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.spark.streaming 2 | 3 | import java.util.UUID 4 | 5 | 6 | // Define the tweet class fields 7 | case class Tweet(title:String, 8 | retweets:BigInt, 9 | favorites:BigInt, 10 | location:String, 11 | id: UUID, 12 | lat: Double, 13 | lon: Double, 14 | id_str: String, 15 | profile_image_url: String) 16 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/spark/streaming/RealtimeIpProcessing.scala: -------------------------------------------------------------------------------- 1 | /* 2 | * Licensed to the Apache Software Foundation (ASF) under one or more 3 | * contributor license agreements. See the NOTICE file distributed with 4 | * this work for additional information regarding copyright ownership. 5 | * The ASF licenses this file to You under the Apache License, Version 2.0 6 | * (the "License"); you may not use this file except in compliance with 7 | * the License. You may obtain a copy of the License at 8 | * 9 | * http://www.apache.org/licenses/LICENSE-2.0 10 | * 11 | * Unless required by applicable law or agreed to in writing, software 12 | * distributed under the License is distributed on an "AS IS" BASIS, 13 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | * See the License for the specific language governing permissions and 15 | * limitations under the License. 16 | */ 17 | 18 | // scalastyle:off println 19 | package org.sevenmob.spark.streaming 20 | 21 | import kafka.serializer.StringDecoder 22 | 23 | import org.apache.spark.streaming._ 24 | import com.datastax.spark.connector._ 25 | import com.datastax.spark.connector.streaming._ 26 | import com.datastax.spark.connector.cql.CassandraConnector 27 | import org.apache.spark.streaming.kafka._ 28 | import org.apache.spark.SparkConf 29 | import org.apache.spark.rdd.RDD 30 | import org.apache.spark.{SparkContext, SparkConf} 31 | import org.apache.spark.sql.{Row, SQLContext} 32 | 33 | //import org.json4s_ 34 | //import org.json4s.jackson.JsonMethods._ 35 | //import org.json4s.native.JsonParser 36 | 37 | import com.eaio.uuid.UUIDGen 38 | 39 | import org.sevenmob.geocode._ 40 | 41 | /* 42 | * Consumes messages from one or more topics in Kafka process and send them to cassandra. 43 | * Usage: DirectKafkaProcessing 44 | * is a list of one or more Kafka brokers 45 | * is a list of one or more kafka topics to consume from 46 | * is hostname or IP to any of the cassandra nodes 47 | * Google Geoconding API Key 48 | * 49 | * Example: 50 | * $ bin/run-example org.sevenmob.spark.streaming.DirectKafkaProcessing broker1-host:port,broker2-host:port \ 51 | * topic1,topic2 cassandra-host apikey123 52 | */ 53 | object DirectKafkaProcessing { 54 | 55 | def main(args: Array[String]) { 56 | if (args.length < 4) { 57 | System.err.println(s""" 58 | |Usage: DirectKafkaProcessing 59 | | is a list of one or more Kafka brokers 60 | | is a list of one or more kafka topics to consume from 61 | | is hostname or IP to any of the cassandra nodes 62 | | Google geoconding API Key 63 | | 64 | """.stripMargin) 65 | System.exit(1) 66 | } 67 | 68 | //implicit val formats = DefaultFormats + UUIDSerialiser 69 | 70 | StreamingExamples.setStreamingLogLevels() 71 | 72 | val Array(brokers, topics, cassandraHost, googleAPIKey) = args 73 | 74 | // if googleAPIKey is equal to "." convert it to empty string 75 | val apiKey = googleAPIKey.replaceAll(".", "") 76 | 77 | // Define a simple cache to avoid unnecessary API calls 78 | val cache = collection.mutable.Map[String, Point]() 79 | def cachedLocation(location: String) = cache.getOrElseUpdate(location, GeocodeObj ? (Parameters(location, apiKey))) 80 | 81 | // Create context with 2 second batch interval 82 | val conf = new SparkConf().setAppName("DirectKafkaProcessing") 83 | .setMaster("local[*]") 84 | .set("spark.cassandra.connection.host", cassandraHost) 85 | .set("spark.cleaner.ttl", "5000") 86 | val sc = new SparkContext(conf) 87 | val ssc = new StreamingContext(sc, Seconds(2)) 88 | val sqlContext = new SQLContext(sc) 89 | 90 | val keySpaceName = "twitter" 91 | val tableName = "tweets" 92 | 93 | /* Cassandra setup */ 94 | CassandraConnector(conf).withSessionDo { session => 95 | session.execute("CREATE KEYSPACE IF NOT EXISTS " + keySpaceName + " WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") 96 | session.execute("CREATE TABLE IF NOT EXISTS " + keySpaceName + "." + tableName + " (id timeuuid, title text, favorites int, retweets int, location text, lat double, lon double, profile_image_url text, id_str text, PRIMARY KEY (id_str, id))") 97 | } 98 | 99 | // Create direct kafka stream with brokers and topics and save the results to Cassandra 100 | val topicsSet = topics.split(",").toSet 101 | val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) 102 | val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( 103 | ssc, kafkaParams, topicsSet) 104 | .map(_._2) 105 | 106 | //val parquetTable = "tweets8" 107 | 108 | stream.foreachRDD { rdd => 109 | /* this check is here to handle the empty collection error 110 | after the 3 items in the static sample data set are processed */ 111 | if (rdd.toLocalIterator.nonEmpty) { 112 | sqlContext.jsonRDD(rdd).registerTempTable("Tweets") 113 | //jsonData.write.parquet("data/" + parquetTable + "/key=" + java.lang.System.currentTimeMillis()) 114 | //val extraFieldsRDD = sc.parallelize(""" {"lat":0.0,"lon":0.0} """ :: Nil) 115 | //val extraJsonFields = sqlContext.read.json(extraFieldsRDD) 116 | //extraJsonFields.write.parquet("data/" + parquetTable + "/key=" + java.lang.System.currentTimeMillis()) 117 | //val parquetData = sqlContext.read.parquet("data/" + parquetTable) 118 | //parquetData.registerTempTable("Tweets") 119 | val tweetData = sqlContext.sql("""SELECT user.location, 120 | text, 121 | user.profile_image_url FROM Tweets""") 122 | val address = tweetData.map(t => t(0)).collect().head 123 | val id = java.util.UUID.fromString(new com.eaio.uuid.UUID().toString()) 124 | val text = tweetData.map(t => t(1)).collect().head 125 | val profileUrl = tweetData.map(t => t(2)).collect().head 126 | val p = cachedLocation(address.toString) 127 | tweetData.show() 128 | val collection = sc.parallelize(Seq(Tweet(text.toString, 129 | 0,0,address.toString, id, 130 | p.lat,p.lng, "id_str", profileUrl.toString))) 131 | collection.saveToCassandra("twitter","tweets") 132 | } 133 | } 134 | 135 | // .map { case (_, v) => 136 | // import org.sevenmob.spark.streaming.UUIDSerialiser 137 | // implicit val formats = DefaultFormats + UUIDSerialiser 138 | // JsonParser.parse(v) 139 | // } 140 | // val address = for { 141 | // JObject(child) <- jsonData 142 | // JField("location", JString(location)) <- child 143 | // } yield location 144 | // jsonData.print() 145 | // address.print() 146 | // val p = GeocodeObj ? (Parameters(address.toString, "")) 147 | // val extraJsonFields = parse("{\"lat\":" + p.lat + ", \"lon\": " + p.lng + "}") 148 | //.saveToCassandra("twitter","tweets") 149 | //val finalJson = jsonData ~ ("height" -> 175) 150 | // val finalJson = jsonData merge extraJsonFields 151 | //println(extraJsonFields) 152 | 153 | // Start the computation 154 | ssc.start() 155 | ssc.awaitTermination() 156 | } 157 | } 158 | 159 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/spark/streaming/SampleTwitterData.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.spark.streaming 2 | 3 | object SampleTwitterData { 4 | val jsonStr = """ { 5 | "lat": 0.0, 6 | "lon": 0.0, 7 | "retweeted_status": { 8 | "created_at": "Sun Jul 19 02:04:44 +0000 2015", 9 | "id": 622587848269520896, 10 | "id_str": "622587848269520896", 11 | "text": "Awesome contributions! Thanks everyone. https:\/\/t.co\/t7ygZUl1lS", 12 | "source": "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e", 13 | "truncated": false, 14 | "in_reply_to_status_id": null, 15 | "in_reply_to_status_id_str": null, 16 | "in_reply_to_user_id": null, 17 | "in_reply_to_user_id_str": null, 18 | "in_reply_to_screen_name": null, 19 | "user": { 20 | "id": 776322822, 21 | "id_str": "776322822", 22 | "name": "Christian Smith", 23 | "screen_name": "anvilhacks", 24 | "location": "pacific northwest", 25 | "url": "http:\/\/anvil.io", 26 | "description": "Founder of @AnvilResearch, hacker, musician, nature boy, contrarian. Cohost @readthesource + @TheWebPlatform. #OpenID #OAuth #IoT #golang #nodejs", 27 | "protected": false, 28 | "verified": false, 29 | "followers_count": 256, 30 | "friends_count": 316, 31 | "listed_count": 34, 32 | "favourites_count": 464, 33 | "statuses_count": 913, 34 | "created_at": "Thu Aug 23 16:43:11 +0000 2012", 35 | "utc_offset": -25200, 36 | "time_zone": "Arizona", 37 | "geo_enabled": false, 38 | "lang": "en", 39 | "contributors_enabled": false, 40 | "is_translator": false, 41 | "profile_background_color": "C0DEED", 42 | "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png", 43 | "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png", 44 | "profile_background_tile": false, 45 | "profile_link_color": "0084B4", 46 | "profile_sidebar_border_color": "C0DEED", 47 | "profile_sidebar_fill_color": "DDEEF6", 48 | "profile_text_color": "333333", 49 | "profile_use_background_image": true, 50 | "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/608392782554775552\/DPzmgNXJ_normal.jpg", 51 | "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/608392782554775552\/DPzmgNXJ_normal.jpg", 52 | "profile_banner_url": "https:\/\/pbs.twimg.com\/profile_banners\/776322822\/1401989083", 53 | "default_profile": true, 54 | "default_profile_image": false, 55 | "following": null, 56 | "follow_request_sent": null, 57 | "notifications": null 58 | }, 59 | "geo": null, 60 | "coordinates": null, 61 | "place": null, 62 | "contributors": null, 63 | "quoted_status_id": 622587083497033728, 64 | "quoted_status_id_str": "622587083497033728", 65 | "quoted_status": { 66 | "created_at": "Sun Jul 19 02:01:42 +0000 2015", 67 | "id": 622587083497033728, 68 | "id_str": "622587083497033728", 69 | "text": "Weekend #hackathons w #opensource team kick ass! Thx @VartanSimonian @oreng @adi_ads #nodejs #docker #anvilconnect http:\/\/t.co\/v8wDozwqil", 70 | "source": "\u003ca href=\"http:\/\/www.hootsuite.com\" rel=\"nofollow\"\u003eHootsuite\u003c\/a\u003e", 71 | "truncated": false, 72 | "in_reply_to_status_id": null, 73 | "in_reply_to_status_id_str": null, 74 | "in_reply_to_user_id": null, 75 | "in_reply_to_user_id_str": null, 76 | "in_reply_to_screen_name": null, 77 | "user": { 78 | "id": 3161807689, 79 | "id_str": "3161807689", 80 | "name": "Anvil Research, Inc.", 81 | "screen_name": "AnvilResearch", 82 | "location": "", 83 | "url": "http:\/\/anvil.io", 84 | "description": "We're building the definitive identity and access hub. It's the only thing you'll need to connect everything. And it's completely open source.", 85 | "protected": false, 86 | "verified": false, 87 | "followers_count": 44, 88 | "friends_count": 179, 89 | "listed_count": 12, 90 | "favourites_count": 10, 91 | "statuses_count": 28, 92 | "created_at": "Sat Apr 18 02:04:50 +0000 2015", 93 | "utc_offset": -25200, 94 | "time_zone": "Pacific Time (US & Canada)", 95 | "geo_enabled": false, 96 | "lang": "en", 97 | "contributors_enabled": false, 98 | "is_translator": false, 99 | "profile_background_color": "C0DEED", 100 | "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png", 101 | "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png", 102 | "profile_background_tile": false, 103 | "profile_link_color": "0084B4", 104 | "profile_sidebar_border_color": "C0DEED", 105 | "profile_sidebar_fill_color": "DDEEF6", 106 | "profile_text_color": "333333", 107 | "profile_use_background_image": true, 108 | "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/618931858579939329\/PEloKln1_normal.png", 109 | "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/618931858579939329\/PEloKln1_normal.png", 110 | "profile_banner_url": "https:\/\/pbs.twimg.com\/profile_banners\/3161807689\/1436400902", 111 | "default_profile": true, 112 | "default_profile_image": false, 113 | "following": null, 114 | "follow_request_sent": null, 115 | "notifications": null 116 | }, 117 | "geo": null, 118 | "coordinates": null, 119 | "place": null, 120 | "contributors": null, 121 | "retweet_count": 0, 122 | "favorite_count": 0, 123 | "entities": { 124 | "hashtags": [ 125 | { 126 | "text": "hackathons", 127 | "indices": [ 128 | 8, 129 | 19 130 | ] 131 | }, 132 | { 133 | "text": "opensource", 134 | "indices": [ 135 | 22, 136 | 33 137 | ] 138 | }, 139 | { 140 | "text": "nodejs", 141 | "indices": [ 142 | 85, 143 | 92 144 | ] 145 | }, 146 | { 147 | "text": "docker", 148 | "indices": [ 149 | 93, 150 | 100 151 | ] 152 | }, 153 | { 154 | "text": "anvilconnect", 155 | "indices": [ 156 | 101, 157 | 114 158 | ] 159 | } 160 | ], 161 | "trends": [ 162 | 163 | ], 164 | "urls": [ 165 | { 166 | "url": "http:\/\/t.co\/v8wDozwqil", 167 | "expanded_url": "http:\/\/bit.ly\/1Thxhl3", 168 | "display_url": "bit.ly\/1Thxhl3", 169 | "indices": [ 170 | 115, 171 | 137 172 | ] 173 | } 174 | ], 175 | "user_mentions": [ 176 | { 177 | "screen_name": "VartanSimonian", 178 | "name": "Vartan Simonian", 179 | "id": 12768142, 180 | "id_str": "12768142", 181 | "indices": [ 182 | 53, 183 | 68 184 | ] 185 | }, 186 | { 187 | "screen_name": "oreng", 188 | "name": "oreng", 189 | "id": 8399312, 190 | "id_str": "8399312", 191 | "indices": [ 192 | 69, 193 | 75 194 | ] 195 | }, 196 | { 197 | "screen_name": "adi_ads", 198 | "name": "Adi", 199 | "id": 51087762, 200 | "id_str": "51087762", 201 | "indices": [ 202 | 76, 203 | 84 204 | ] 205 | } 206 | ], 207 | "symbols": [ 208 | 209 | ] 210 | }, 211 | "favorited": false, 212 | "retweeted": false, 213 | "possibly_sensitive": false, 214 | "filter_level": "low", 215 | "lang": "en" 216 | }, 217 | "retweet_count": 2, 218 | "favorite_count": 0, 219 | "entities": { 220 | "hashtags": [ 221 | 222 | ], 223 | "trends": [ 224 | 225 | ], 226 | "urls": [ 227 | { 228 | "url": "https:\/\/t.co\/t7ygZUl1lS", 229 | "expanded_url": "https:\/\/twitter.com\/AnvilResearch\/status\/622587083497033728", 230 | "display_url": "twitter.com\/AnvilResearch\/\u2026", 231 | "indices": [ 232 | 40, 233 | 63 234 | ] 235 | } 236 | ], 237 | "user_mentions": [ 238 | 239 | ], 240 | "symbols": [ 241 | 242 | ] 243 | }, 244 | "favorited": false, 245 | "retweeted": false, 246 | "possibly_sensitive": false, 247 | "filter_level": "low", 248 | "lang": "en" 249 | } 250 | } 251 | """ 252 | } 253 | 254 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/main/scala/org/sevenmob/spark/streaming/StreamingExamples.scala: -------------------------------------------------------------------------------- 1 | /* 2 | * Licensed to the Apache Software Foundation (ASF) under one or more 3 | * contributor license agreements. See the NOTICE file distributed with 4 | * this work for additional information regarding copyright ownership. 5 | * The ASF licenses this file to You under the Apache License, Version 2.0 6 | * (the "License"); you may not use this file except in compliance with 7 | * the License. You may obtain a copy of the License at 8 | * 9 | * http://www.apache.org/licenses/LICENSE-2.0 10 | * 11 | * Unless required by applicable law or agreed to in writing, software 12 | * distributed under the License is distributed on an "AS IS" BASIS, 13 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | * See the License for the specific language governing permissions and 15 | * limitations under the License. 16 | */ 17 | 18 | package org.sevenmob.spark.streaming 19 | 20 | import org.apache.spark.Logging 21 | 22 | import org.apache.log4j.{Level, Logger} 23 | 24 | /** Utility functions for Spark Streaming examples. */ 25 | object StreamingExamples extends Logging { 26 | 27 | /** Set reasonable logging levels for streaming if the user has not configured log4j. */ 28 | def setStreamingLogLevels() { 29 | val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements 30 | if (!log4jInitialized) { 31 | // We first log something to initialize Spark's default logging, then we override the 32 | // logging level. 33 | logInfo("Setting log level to [WARN] for streaming example." + 34 | " To override add a custom log4j.properties to the classpath.") 35 | Logger.getRootLogger.setLevel(Level.WARN) 36 | } 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/test/resources/loremipsum.txt: -------------------------------------------------------------------------------- 1 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus nec egestas tellus. Nunc efficitur nunc nunc. Fusce quis tortor sapien. Cras finibus nisl eu eros tincidunt, eget laoreet velit porta. Morbi pellentesque volutpat mollis. Quisque maximus tellus ut magna vulputate, at pharetra turpis ultricies. Donec eu quam justo. Suspendisse sit amet sollicitudin orci. Vivamus pulvinar sem in risus pulvinar dignissim. Nulla sit amet laoreet eros. Nullam sit amet erat dignissim, vulputate sapien at, tincidunt enim. Etiam nunc neque, condimentum eu dui at, vestibulum ornare odio. 2 | Fusce sed dolor pulvinar, euismod mauris eu, elementum purus. In gravida sollicitudin quam nec ultricies. Aenean vel nisl eget metus lobortis luctus a at erat. Suspendisse ut ipsum quam. Mauris id justo non ligula aliquam tristique. Phasellus volutpat quam at neque fringilla, sed condimentum diam maximus. Proin ut quam aliquet, convallis elit at, dignissim sem. Nam eu arcu purus. 3 | Cras et ligula ac mauris fringilla semper. Mauris interdum magna rhoncus pretium varius. Nulla fermentum est erat, eu interdum erat sodales nec. Quisque ornare suscipit eros, at tempus diam dapibus tristique. Morbi malesuada nibh ac justo faucibus volutpat. Curabitur nec lacus non neque euismod pharetra. Suspendisse odio ipsum, sodales vitae sapien ut, porta feugiat enim. Aliquam erat volutpat. Fusce elementum posuere dolor id auctor. Donec in ante pulvinar, malesuada purus non, tincidunt dui. Maecenas mollis in augue vitae vulputate. Donec condimentum fringilla auctor. 4 | Aenean efficitur metus justo, posuere placerat urna efficitur eget. Nullam et est eu nibh dapibus fringilla. Praesent lobortis tincidunt odio, nec dapibus odio faucibus sit amet. In faucibus, magna eu tincidunt consequat, velit risus bibendum ligula, nec aliquam nisl dui sodales ligula. Integer at dapibus metus, id pellentesque mauris. Vivamus eleifend nisi id mollis dapibus. Donec ut ex sed mauris consectetur feugiat. Quisque viverra quam purus, eu ornare massa iaculis vitae. Praesent fringilla dui nec arcu feugiat, ac posuere dui ullamcorper. Suspendisse nec velit a ipsum euismod malesuada eu non nibh. Mauris aliquam quis quam sit amet condimentum. 5 | Donec a sem dapibus, pretium elit at, fermentum dui. Etiam arcu ex, imperdiet tempor ex a, 6 | convallis condimentum erat. Aliquam ullamcorper ultricies eros, vitae cursus ligula viverra in. Quisque et viverra sem, eget vehicula metus. Nam rutrum leo quam, a vestibulum diam auctor at. Integer diam leo, consectetur eget rhoncus ac, facilisis sit amet tellus. Duis mattis placerat vulputate. Nunc eu aliquet tellus, in varius erat. Pellentesque elementum cursus dolor, condimentum consectetur enim sagittis ac. Donec vehicula ut mauris non porttitor. Vivamus rutrum nunc et egestas vulputate. Proin nec tempor velit. Aliquam eget augue mollis, cursus arcu sed, tincidunt nulla. Aenean feugiat arcu eu mauris cursus gravida.# 7 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/src/test/scala/org/sevenmob/geocode/GeocodeSpec.scala: -------------------------------------------------------------------------------- 1 | package org.sevenmob.geocode 2 | 3 | import org.specs2.mutable.Spec 4 | import scala.concurrent._, duration._, ExecutionContext.Implicits.global 5 | 6 | 7 | class GeocodeSpec extends Spec { 8 | "Geocode" should { 9 | "find data by address" in { 10 | val geocode = new Geocode() 11 | 12 | def ?(x: Parameters) = Await.result(geocode ? x, 3.seconds) 13 | 14 | ?(Parameters("London", "")) must beRight 15 | } 16 | } 17 | } 18 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/start.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Wait 15 seconds before trying to connect to cassandra 4 | echo "Sleeping 15 seconds..." 5 | sleep 15 6 | 7 | # Start spark job 8 | spark-submit --class org.sevenmob.spark.streaming.DirectKafkaProcessing \ 9 | --master local[*] target/scala-2.10/TwitterProcessingPipeLine-assembly-0.1.jar \ 10 | $KAFKA_ENV_KAFKA_ADVERTISED_HOST_NAME:$KAFKA_PORT_9092_TCP_PORT \ 11 | $KAFKA_TOPIC_NAME \ 12 | $CASSANDRA_PORT_9042_TCP_ADDR \ 13 | $GOOGLE_GEOCODING_API_KEY 14 | -------------------------------------------------------------------------------- /spark-streaming-kafka-cassandra/version.sbt: -------------------------------------------------------------------------------- 1 | version in ThisBuild := "2.1.2-SNAPSHOT" -------------------------------------------------------------------------------- /webserver/Dockerfile: -------------------------------------------------------------------------------- 1 | # Author: Roberto Gandolfo Hashioka 2 | # Date: 07/22/2015 3 | 4 | FROM stackbrew/ubuntu:14.04 5 | MAINTAINER Roberto G. Hashioka "roberto_hashioka@hotmail.com" 6 | 7 | # Install Pip 8 | RUN apt-get update 9 | RUN apt-get install -y python-pip 10 | 11 | # Install and configure python packages 12 | ADD requirements.txt /build/ 13 | RUN pip install -r /build/requirements.txt 14 | 15 | # Copy python app 16 | ADD ./webserver.py / 17 | ADD ./start.sh / 18 | ADD ./templates /templates 19 | 20 | EXPOSE 5000 21 | 22 | # Start the Kafka producer process 23 | CMD ["./webserver.py"] 24 | 25 | 26 | -------------------------------------------------------------------------------- /webserver/requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | flask 3 | cassandra-driver 4 | -------------------------------------------------------------------------------- /webserver/start.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Wait 15 seconds before trying to connect to cassandra 4 | echo "Sleeping 15 seconds..." 5 | sleep 15 6 | 7 | exec /webserver.py -------------------------------------------------------------------------------- /webserver/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Docker HackDay Project 7 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 |
67 |
68 |

Last Minutes

69 |
70 |
71 | {#
72 | 73 |
#} 74 |
    75 |
    76 | 77 |
    78 | 79 | 233 | 234 | -------------------------------------------------------------------------------- /webserver/webserver.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os 3 | from time import sleep 4 | import datetime 5 | import logging, sys 6 | from flask.json import JSONEncoder 7 | from flask import jsonify, request 8 | from uuid import UUID 9 | 10 | logging.basicConfig(stream=sys.stderr) 11 | 12 | from flask import Flask, render_template, Response 13 | from cassandra.cluster import Cluster 14 | from cassandra.query import dict_factory 15 | 16 | CASSANDRA_PORT_9042_TCP_ADDR = os.environ.get('CASSANDRA_PORT_9042_TCP_ADDR') 17 | 18 | cluster = Cluster([CASSANDRA_PORT_9042_TCP_ADDR]) 19 | session = cluster.connect() 20 | 21 | class UUIDEncoder(JSONEncoder): 22 | """ JSONEconder subclass used by the json render function. 23 | This is different from BaseJSONEoncoder since it also addresses 24 | encoding of UUID 25 | """ 26 | 27 | def default(self, obj): 28 | if isinstance(obj, UUID): 29 | return str(obj) 30 | else: 31 | # delegate rendering to base class method (the base class 32 | # will properly render ObjectIds, datetimes, etc.) 33 | return super(UUIDEncoder, self).default(obj) 34 | 35 | 36 | app = Flask(__name__) 37 | app.json_encoder = UUIDEncoder 38 | 39 | @app.route('/') 40 | def index(): 41 | return render_template('index.html') 42 | 43 | @app.route('/get_values') 44 | def get_values(): 45 | lookback = request.args.get('lookback', 10) 46 | 47 | session.set_keyspace('twitter') 48 | session.row_factory = dict_factory 49 | date_time = datetime.datetime.now() - datetime.timedelta(minutes=int(lookback)) 50 | date_str = date_time.strftime("%Y-%m-%d %H:%M:%S-0000") 51 | rows = session.execute("select id, title,lat,lon, location, profile_image_url from tweets where id >= maxTimeuuid('{0}') and id_str = 'id_str'".format(date_str)) 52 | return jsonify(results=rows) 53 | 54 | if __name__ == '__main__': 55 | app.run(host='0.0.0.0') 56 | --------------------------------------------------------------------------------