├── .gitignore ├── spark_word_count.ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | -------------------------------------------------------------------------------- /spark_word_count.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from pyspark import SparkContext\n", 12 | "sc = SparkContext()" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 4, 18 | "metadata": { 19 | "collapsed": false 20 | }, 21 | "outputs": [ 22 | { 23 | "name": "stdout", 24 | "output_type": "stream", 25 | "text": [ 26 | "--2015-07-09 17:21:37-- http://www.gutenberg.org/cache/epub/100/pg100.txt\n", 27 | "Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47\n", 28 | "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.\n", 29 | "HTTP request sent, awaiting response... 200 OK\n", 30 | "Length: 5589889 (5,3M) [text/plain]\n", 31 | "Saving to: ‘pg100.txt’\n", 32 | "\n", 33 | "pg100.txt 100%[=====================>] 5,33M 1,12MB/s in 7,3s \n", 34 | "\n", 35 | "2015-07-09 17:21:47 (748 KB/s) - ‘pg100.txt’ saved [5589889/5589889]\n", 36 | "\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "!wget 'http://www.gutenberg.org/cache/epub/100/pg100.txt'" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Load the shakespeare data into an [RDD](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) by using [`textFile`](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.textFile). " 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 15, 54 | "metadata": { 55 | "collapsed": false 56 | }, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "first line of raw_text:\t \"The Project Gutenberg EBook of The Complete Works of William Shakespeare, by\"\n", 63 | "total number of lines:\t 124787\n" 64 | ] 65 | } 66 | ], 67 | "source": [ 68 | "raw_text = sc.textFile('pg100.txt', 4)\n", 69 | "\n", 70 | "# check whether the data was loaded properly:\n", 71 | "print u'first line of raw_text:\\t \"{}\"'.format(raw_text.first())\n", 72 | "print u'total number of lines:\\t {}'.format(raw_text.count())\n" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "collapsed": true 80 | }, 81 | "outputs": [], 82 | "source": [] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python 2", 88 | "language": "python", 89 | "name": "python2" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 2 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython2", 101 | "version": "2.7.9" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 0 106 | } 107 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spark + pyspark setup guide 2 | 3 | This is guide for installing and configuring an instance of Apache Spark and its python API pyspark on a single machine running ubuntu 15.04. 4 | 5 | -- *Kristian Holsheimer, July 2015* 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 1. [Install Requirements](#requirements) 11 | 12 | 1.1 [Install Java](#requirements-java) 13 | 14 | 1.2 [Install Scala](#requirements-scala) 15 | 16 | 1.3 [Install git](#requirements-git) 17 | 18 | 1.4 [Install py4j](#requirements-py4j) 19 | 20 | 2. [Set Up Apache Spark](#spark) 21 | 22 | 2.1 [Download source](#spark-tarball) 23 | 24 | 2.2 [Compile source](#spark-compile) 25 | 26 | 2.3 [Install files](#spark-install) 27 | 28 | 3. [Examples](#examples) 29 | 30 | 3.1 [Hello World: Word Count](#examples-helloworld) 31 | 32 | --- 33 | 34 | In order to run Spark, we need Scala, which in turn requires Java. So, let's install these requirements first 35 | 36 |
37 | 38 | ## 1 | Install Requirements 39 | 40 | 41 | 42 | ### 1.1 | Install Java 43 | 44 | ```bash 45 | $ sudo apt-add-repository ppa:webupd8team/java 46 | $ sudo apt-get update 47 | $ sudo apt-get install oracle-java7-installer 48 | ``` 49 | 50 | Check if installation was successful by running: 51 | 52 | ```bash 53 | $ java -version 54 | ``` 55 | 56 | The output should be something like: 57 | 58 | ```bash 59 | java version "1.7.0_80" 60 | Java(TM) SE Runtime Environment (build 1.7.0_80-b15) 61 | Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) 62 | ``` 63 | 64 | 65 | 66 | ### 1.2 | Install Scala 67 | 68 | Download and install deb package from scala-lang.org: 69 | 70 | ```bash 71 | $ cd ~/Downloads 72 | $ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb 73 | $ sudo dpkg -i scala-2.11.7.deb 74 | ``` 75 | 76 | ***Note:*** *You may want to check if there's a more recent version. At the time of this writing, 2.11.7 was the most recent stable release. Visit the [Scala download page](http://www.scala-lang.org/download/all.html) to check for updates.* 77 | 78 | Again, let's check whether the installation was successful by running: 79 | ```bash 80 | $ scala -version 81 | ``` 82 | which should return something like: 83 | ```bash 84 | Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL 85 | ``` 86 | 87 | 88 | 89 | ### 1.3 | Install git 90 | 91 | We shall install Apache Spark by building it from source. This procedure depends implicitly on git, thus be sure install git if you haven't already: 92 | ```bash 93 | $ sudo apt-get -y install git 94 | ``` 95 | 96 | 97 | 98 | ### 1.4 | Install py4j 99 | 100 | PySpark requires the `py4j` python package. If you're running a virtual environment, run: 101 | 102 | ```bash 103 | $ pip install py4j 104 | ``` 105 | otherwise, run: 106 | ```bash 107 | $ sudo pip install py4j 108 | ``` 109 | 110 | 111 | 112 | ## 2 | Install Apache Spark 113 | 114 | 115 | 116 | ### 2.1 | Download and extract source tarball 117 | 118 | ```bash 119 | $ cd ~/Downloads 120 | $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz 121 | $ tar xvf spark-1.6.0.tgz 122 | ``` 123 | ***Note:*** *Also here, you may want to check if there's a more recent version: visit the [Spark download page](http://spark.apache.org/downloads.html)*. 124 | 125 | 126 | 127 | ### 2.2 | Compile source 128 | ```bash 129 | $ cd ~/Downloads/spark-1.6.0 130 | $ sbt/sbt assembly 131 | ``` 132 | 133 | This will take a while... (approximately 20 ~ 30 minutes) 134 | 135 | After the dust settles, you can check whether Spark installed correctly by running the following example that should return the number π ≈ 3.14159... 136 | ```bash 137 | $ ./bin/run-example SparkPi 10 138 | ``` 139 | 140 | This should return the line: 141 | ```bash 142 | Pi is roughly 3.14042 143 | ``` 144 | 145 | ***Note:*** *You want to lower the verbosity level of the log4j logger. You can do so by running editing your the log4j properties file (assuming we're still inside the `~/Downloads/spark-1.4.0` folder):* 146 | ```bash 147 | $ cp conf/log4j.properties.template conf/log4j.properties 148 | $ nano conf/log4j.properties 149 | ``` 150 | 151 | *and replace the line:* 152 | 153 | log4j.rootCategory=INFO, console 154 | 155 | *by* 156 | 157 | log4j.rootCategory=ERROR, console 158 | 159 | 160 | 161 | ### 2.3 | Install files 162 | ```bash 163 | $ sudo mv ~/Downloads/spark-1.6.0 /opt/ 164 | $ sudo ln -s /opt/spark-1.6.0 /opt/spark 165 | ``` 166 | 167 | Add this to your path by editing your bashrc file: 168 | ```bash 169 | $ nano ~/.bashrc 170 | ``` 171 | 172 | Add the following lines at the bottom of this file: 173 | ```bash 174 | # needed for Apache Spark 175 | export SPARK_HOME=/opt/spark 176 | export PYTHONPATH=$SPARK_HOME/python 177 | ``` 178 | Restart bash to make use of these changes by running: 179 | ```bash 180 | $ . ~/.bashrc 181 | ``` 182 | 183 | If your ipython instance somehow doesn't find these environment variables for whatever reason, you could also make sure they are set when ipython spins up. Let's add this to our ipython settings by creating a new python script named `load_spark_environment_variables.py` in the default profile startup folder: 184 | ```bash 185 | $ nano ~/.ipython/profile_default/startup/load_spark_environment_variables.py 186 | ``` 187 | and paste the following lines in this file: 188 | ```python 189 | import os 190 | import sys 191 | 192 | if 'SPARK_HOME' not in os.environ: 193 | os.environ['SPARK_HOME'] = '/opt/spark' 194 | 195 | if '/opt/spark/python' not in sys.path: 196 | sys.path.insert(0, '/opt/spark/python') 197 | ``` 198 | 199 | 200 | 201 | ## 3 | Examples 202 | 203 | Now we're finally ready to start running our first PySpark application. Load the spark context by opening up a python interpreter (or ipython / ipython notebook) and running: 204 | 205 | ```python 206 | >>> from pyspark import SparkContext 207 | >>> sc = SparkContext() 208 | ``` 209 | 210 | The spark context variable `sc` is your gateway towards everything sparkly. 211 | 212 | 213 | 214 | 215 | ### 3.1 | Hello World: Word Count 216 | 217 | Check out the notebook [spark_word_count.ipynb](spark_word_count.ipynb). 218 | --------------------------------------------------------------------------------