├── .gitignore ├── README.md ├── SparkTuning.md ├── Vagrantfile ├── build.sbt ├── project ├── assembly.sbt ├── build.properties └── plugins.sbt └── src ├── main ├── resources │ ├── GeoLite2-City.mmdb │ ├── GeoLite2-Country.mmdb │ ├── broker-defaults.properties │ ├── consumer-defaults.properties │ ├── log4j.properties │ ├── producer-defaults.properties │ └── reference.conf └── scala │ └── com │ └── cloudwick │ ├── cassandra │ ├── Cassandra.scala │ ├── CassandraExecutionContext.scala │ ├── CassandraLocationVisitServiceModule.scala │ ├── CassandraLogVolumeServiceModule.scala │ ├── CassandraService.scala │ ├── CassandraStatusCountServiceModule.scala │ ├── ConfigurableCassandraManager.scala │ ├── schema │ │ ├── LocationVisit.scala │ │ ├── LogVolume.scala │ │ └── StatusCount.scala │ └── service │ │ ├── LocationVisitServiceModule.scala │ │ ├── LogVolumeServiceModule.scala │ │ └── StatusCountServiceModule.scala │ ├── logging │ └── Logging.scala │ └── spark │ ├── embedded │ ├── KafkaServer.scala │ └── ZookeeperServer.scala │ ├── examples │ ├── core │ │ ├── WordCount.scala │ │ ├── WordCountRunner.scala │ │ └── package.scala │ └── streaming │ │ ├── kafka │ │ ├── KafkaWordCount.scala │ │ └── StatefulKafkaWordCount.scala │ │ ├── kinesis │ │ └── KinesisWordCount.scala │ │ └── local │ │ ├── NetworkWordCount.scala │ │ ├── NetworkWordCountRunner.scala │ │ ├── NetworkWordCountWindowed.scala │ │ ├── NetworkWordCountWindowedRunner.scala │ │ └── RecoverableNetworkWordCount.scala │ └── loganalysis │ ├── LogAnalyzer.scala │ ├── LogAnalyzerRunner.scala │ ├── LogAnalyzerStreamingRunner.scala │ └── LogEvent.scala └── test └── scala ├── com └── cloudwick │ └── spark │ ├── examples │ ├── core │ │ └── WordCountSpec.scala │ └── streaming │ │ └── local │ │ ├── NetworkWordCountSpec.scala │ │ └── NetworkWordCountWindowedSpec.scala │ ├── loganalysis │ └── LogAnalyzerSpec.scala │ └── sparkspec │ ├── SparkSpec.scala │ ├── SparkSqlSpec.scala │ └── SparkStreamingSpec.scala └── org └── apache └── spark └── streaming └── ClockWrapper.scala /.gitignore: -------------------------------------------------------------------------------- 1 | *.class 2 | *.log 3 | wiki 4 | *.sc 5 | src/main/scala/com/cloudwick/Random.scala 6 | 7 | # sbt specific 8 | dist/* 9 | target/ 10 | lib_managed/ 11 | src_managed/ 12 | project/boot/ 13 | project/plugins/project/ 14 | project/project 15 | project/target 16 | .project/ 17 | .cache 18 | .classpath 19 | .settings 20 | 21 | # IDE specific 22 | .idea 23 | .idea_modules 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cloudwick Spark CodeBase 2 | 3 | This repository is a collection of Spark examples & use-case implementations for various components of the Spark eco-system including Spark-Core, Spark-Streaming, Spark-SQL, Spark-MLLib. 4 | 5 | ## What does this repository contains ? 6 | 7 | * Spark core examples 8 | * [WordCount](src/main/scala/com/cloudwick/spark/examples/core/WordCountRunner.scala) 9 | * Spark streaming examples 10 | * [NetworkWordCount](src/main/scala/com/cloudwick/spark/examples/streaming/local/NetworkWordCount.scala) 11 | * [NetworkWordCountWindowed](src/main/scala/com/cloudwick/spark/examples/streaming/local/NetworkWordCountWindowed.scala) 12 | * [StatefulKafkaWordCount](src/main/scala/com/cloudwick/spark/examples/streaming/kafka/StatefulKafkaWordCount.scala) 13 | * Spark core use-cases 14 | * [LognAnalytics](src/main/scala/com/cloudwick/spark/loganalysis/LogAnalyzerRunner.scala) 15 | * Spark streaming use-cases 16 | * [LogAnalytics](src/main/scala/com/cloudwick/spark/loganalysis/LogAnalyzerStreamingRunner.scala) 17 | A simple spark streaming use-case to perform apache log analysis which could read data from Kafka & Kinesis performs some analysis and persists the result's to cassandra. 18 | * Testing 19 | * ScalaTest spec traits for Spark core, streaming and SQL API(s) 20 | * Embedded [Kafka](src/main/scala/com/cloudwick/spark/embedded/KafkaServer.scala) and [Zookeeper](src/main/scala/com/cloudwick/spark/embedded/ZookeeperServer.scala) embedded server instances for testing 21 | 22 | ## How to download ? 23 | 24 | Simplest way is to clone the repository: 25 | 26 | ``` 27 | git clone https://github.com/cloudwicklabs/spark_codebase.git 28 | ``` 29 | 30 | ## How to run these ? 31 | 32 | To run any of these examples or use-cases you have to package them using a uber-jar (most of the examples depend of external dependencies, hence have to be packaged as a assembly jar). 33 | 34 | ### Building an assembly jar 35 | 36 | From the project's home directory 37 | 38 | ``` 39 | sbt assembly 40 | ``` 41 | 42 | ### Running using `spark-submit` 43 | 44 | [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html) is the simplest way to submit a spark application to the cluster and supports all the cluster manager's like stand-alone, yarn and mesos. 45 | 46 | Each of the main class has documentation on how to run it. 47 | -------------------------------------------------------------------------------- /SparkTuning.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cloudwicklabs/spark_codebase/e75b066165056f0a169690cf3f4e1b311a3bae8b/SparkTuning.md -------------------------------------------------------------------------------- /Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | # vi: set ft=ruby : 3 | 4 | $script = <