├── .gitignore ├── README.md ├── build.sbt ├── images └── screenshot.png └── src └── main └── scala └── com └── slouc └── sparkintro └── Main.scala /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | .cache 3 | .classpath 4 | project 5 | target 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Quick guide on setting up Spark (using Scala) 2 | 3 | This guide doesn't explain stuff like: 4 | - how and why Spark works 5 | - what are RDDs, actions or transformations 6 | - how to set up JDK on your machine or how to install Scala 7 | 8 | It is focused on getting Spark up and running under 10-15 minutes, with as little digressions as possible. There are more detailed tutorials out there, in case that's what you need (you can start from [this one](https://github.com/mbonaci/mbo-spark) or [this one](https://github.com/deanwampler/spark-scala-tutorial)). 9 | 10 | ## Setting up the project 11 | 12 | First of all, download [sbt](https://github.com/sbt/sbt). Now you can import the project into your favourite IDE. 13 | 14 | #### Eclipse 15 | 16 | You will need to "eclipsify" the project. Download [sbteclipse](https://github.com/typesafehub/sbteclipse). I would recommend getting sbt 0.13+ and adding the following to the global sbt file at `~/.sbt/plugins/plugins.sbt` (instead of editting the project-specific file): 17 | 18 | `addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0")` 19 | 20 | Now create a folder which you will use for your Spark project and place the build.sbt found in this repo inside it. Then simply run sbt and use the *eclipse* command to turn it into an Eclipse project (I am assuming that you have Scala set up in your Eclipse IDE): 21 | 22 | ` > eclipse` 23 | 24 | You can now simply import the created project as existing project into Eclipse workspace. 25 | 26 | #### IntelliJ 27 | If you're using IntelliJ then all you need to do is to import the project. Select `File -> New -> Project from existing sources` and import it as SBT project. 28 | 29 | Note: You will need Scala 2.10 for these dependencies to work. At the time of writing this, Scala 2.11 was still not supported by Spark. 30 | 31 | ## Just run the code 32 | 33 | Spark is meant to be run on clusters. But to get you going, you can simply run it on your local machine with some ad-hoc configuration. Make sure to change spark master address to "local" in `Main.scala`: 34 | 35 | val conf = new SparkConf().setAppName("sparkintro").setMaster("local") 36 | val sc = new SparkContext(conf) 37 | 38 | Also remove the `% "provided"` part from the `build.sbt` if you want `sbt` to fetch the library for you. Later on, when we will have our own binary distribution of Spark for running master and slaves, we will need the "provided" part because we will want to run a JAR on the Spark cluster (where Spark dependency is already available). 39 | 40 | Code that performs the calculation in the `Main.scala` example is taken from an [official examples page](https://spark.apache.org/examples.html). 41 | 42 | ## Setting up the local cluster 43 | Of course, there's no fun in simply running a Spark job without a special dedicated cluster. We will now see how to run the cluster (master + 4 slaves). It's still not the full power of Spark since they're all running on a local machine, but once you get the hang of this, it should be fairly easy to scale out into the cloud or wherever you want. 44 | 45 | First of all, you will need a binary distribution of Spark for this; you can get it [here](http://spark.apache.org/downloads.html) (make sure to select the "pre-built" version). Note that the version you choose doesn't have to be the same as defined in `build.sbt`, but I'm sure you're aware of possible issues that could arise if you code against one version and then run against another one. 46 | 47 | Next step is to prepare the Spark environment. You can use the template config file provided with the distribution. We will also set the default number of workers to four: 48 | 49 | cp ./conf/spark-env.sh.template ./conf/spark-env.sh 50 | echo "export SPARK_WORKER_INSTANCES=4" >> ./conf/spark-env.sh 51 | 52 | OK, we can start the master and slaves. Note that you will need an SSH deamon running on your system. Linux users can see if it's running by issuing: 53 | 54 | service ssh status 55 | 56 | while for OS-X it's: 57 | 58 | launchctl list | grep ssh 59 | 60 | If the service is not running, make sure to install it and run it. For Mac users it's probably enough to enable System Preferences -> Sharing -> Remote Login. Anyways, setting up the SSH deamon is not the point of this text, I'm sure you'll figure it out. 61 | 62 | Once the SSH deamon is all good, we can run our master and slave scripts: 63 | 64 | ./sbin/start-master.sh 65 | ./sbin/start-slaves.sh 66 | 67 | There's a change you will be asked for the password since you're attempting to SSH to yourself. 68 | 69 | Once that's done you can navigate to http://localhost:8080 to check out the state of your brand new Spark cluster. You should see something like: 70 | 71 | ![Screenshot](./images/screenshot.png) 72 | 73 | There are some other convinient script for working with daemons, such as `start-all.sh` and `stop-all.sh` which will start/stop all daemons (both master and slaves). 74 | 75 | ## Running the app on cluster 76 | 77 | Now we need to package our app and feed it to running Spark cluster. 78 | 79 | First you need to set the address of your master node in `Main.scala`: 80 | 81 | .setMaster("[SPARK_ADDRESS]") 82 | 83 | (you can see your `SPARK_ADDRESS` when you navigate to Spark console on http://localhost:8080; in my case it was `spark://sinisas-mbp-2:7077`, as seen in the screenshot) 84 | 85 | Also, if you removed the "provided" part in the `build.sbt`, now is the good time to bring it back, otherwise you will see tons of errors talking about duplicate versions. 86 | 87 | Alright. You can now create the JAR. Easiest way to do it is using [sbt-assembly](https://github.com/sbt/sbt-assembly). You just need to add the `assembly.sbt` file with needed dependency to the `project` folder (this is already done for you in my repo) and run `sbt assembly` task and your shiny app will be created somewhere in `target` folder (if you didn't change anything, it should be something like `target/scala-2.10/sparkintro-assembly-1.0.jar`). 88 | 89 | **Just one more step left**. We need to feed the app to the Spark machinery. 90 | 91 | In the Spark binary directory, issue the following command: 92 | 93 | PATH-TO-SPARK/bin/spark-submit --class CLASSFILE PATH-TO-JAR 94 | 95 | Using default values from this repo, it would be something like (given it's run from project folder): 96 | 97 | ./bin/spark-submit --class com.slouc.sparkintro.Main ./target/scala-2.10/sparkintro-assembly-1.0.jar 98 | 99 | And voila! Keep an eye on that console on port 8080 and you'll notice your hard-working slaves calculating the value of Pi for you. 100 | 101 | 102 | -------------------------------------------------------------------------------- /build.sbt: -------------------------------------------------------------------------------- 1 | name := "sparkintro" 2 | 3 | version := "1.0" 4 | 5 | scalaVersion := "2.10.6" 6 | 7 | libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" 8 | -------------------------------------------------------------------------------- /images/screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/slouc/setting-up-spark/bf2a962ff1fddb661150dd0629743ce378f7af8d/images/screenshot.png -------------------------------------------------------------------------------- /src/main/scala/com/slouc/sparkintro/Main.scala: -------------------------------------------------------------------------------- 1 | package com.slouc.sparkintro 2 | 3 | import org.apache.spark.{SparkConf, SparkContext} 4 | 5 | /** 6 | * Basic Spark implementation of estimation of Pi 7 | * (calculation code available at https://spark.apache.org/examples.html) 8 | * 9 | * @author slouc 10 | * 11 | */ 12 | object Main { 13 | 14 | def main(args: Array[String]) { 15 | 16 | val numSamples = 10 * 1000 * 1000 // ten million samples 17 | val conf = new SparkConf().setAppName("sparkintro").setMaster("spark://sinisas-mbp-2:7077") 18 | val sc = new SparkContext(conf) 19 | 20 | val count = sc.parallelize(1 to numSamples).map { i => 21 | val x = Math.random() 22 | val y = Math.random() 23 | if (x * x + y * y < 1) 1 else 0 24 | }.reduce(_ + _) 25 | 26 | sc.stop 27 | 28 | println("Pi is roughly " + 4.0 * count / numSamples) 29 | 30 | } 31 | } 32 | --------------------------------------------------------------------------------