├── .gitignore
├── spark_word_count.ipynb
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 


--------------------------------------------------------------------------------
/spark_word_count.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from pyspark import SparkContext\n",
 12 |     "sc = SparkContext()"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": 4,
 18 |    "metadata": {
 19 |     "collapsed": false
 20 |    },
 21 |    "outputs": [
 22 |     {
 23 |      "name": "stdout",
 24 |      "output_type": "stream",
 25 |      "text": [
 26 |       "--2015-07-09 17:21:37--  http://www.gutenberg.org/cache/epub/100/pg100.txt\n",
 27 |       "Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47\n",
 28 |       "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.\n",
 29 |       "HTTP request sent, awaiting response... 200 OK\n",
 30 |       "Length: 5589889 (5,3M) [text/plain]\n",
 31 |       "Saving to: ‘pg100.txt’\n",
 32 |       "\n",
 33 |       "pg100.txt           100%[=====================>]   5,33M  1,12MB/s   in 7,3s   \n",
 34 |       "\n",
 35 |       "2015-07-09 17:21:47 (748 KB/s) - ‘pg100.txt’ saved [5589889/5589889]\n",
 36 |       "\n"
 37 |      ]
 38 |     }
 39 |    ],
 40 |    "source": [
 41 |     "!wget 'http://www.gutenberg.org/cache/epub/100/pg100.txt'"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Load the shakespeare data into an [RDD](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) by using [`textFile`](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.textFile). "
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 15,
 54 |    "metadata": {
 55 |     "collapsed": false
 56 |    },
 57 |    "outputs": [
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "first line of raw_text:\t \"﻿The Project Gutenberg EBook of The Complete Works of William Shakespeare, by\"\n",
 63 |       "total number of lines:\t 124787\n"
 64 |      ]
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "raw_text = sc.textFile('pg100.txt', 4)\n",
 69 |     "\n",
 70 |     "# check whether the data was loaded properly:\n",
 71 |     "print u'first line of raw_text:\\t \"{}\"'.format(raw_text.first())\n",
 72 |     "print u'total number of lines:\\t {}'.format(raw_text.count())\n"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {
 79 |     "collapsed": true
 80 |    },
 81 |    "outputs": [],
 82 |    "source": []
 83 |   }
 84 |  ],
 85 |  "metadata": {
 86 |   "kernelspec": {
 87 |    "display_name": "Python 2",
 88 |    "language": "python",
 89 |    "name": "python2"
 90 |   },
 91 |   "language_info": {
 92 |    "codemirror_mode": {
 93 |     "name": "ipython",
 94 |     "version": 2
 95 |    },
 96 |    "file_extension": ".py",
 97 |    "mimetype": "text/x-python",
 98 |    "name": "python",
 99 |    "nbconvert_exporter": "python",
100 |    "pygments_lexer": "ipython2",
101 |    "version": "2.7.9"
102 |   }
103 |  },
104 |  "nbformat": 4,
105 |  "nbformat_minor": 0
106 | }
107 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Spark + pyspark setup guide
  2 | 
  3 | This is guide for installing and configuring an instance of Apache Spark and its python API pyspark on a single machine running ubuntu 15.04.
  4 | 
  5 | -- *Kristian Holsheimer, July 2015*
  6 | 
  7 | ---
  8 | 
  9 | ## Table of Contents
 10 | 1. [Install Requirements](#requirements)
 11 | 
 12 |     1.1 [Install Java](#requirements-java)
 13 | 
 14 |     1.2 [Install Scala](#requirements-scala)
 15 | 
 16 |     1.3 [Install git](#requirements-git)
 17 | 
 18 |     1.4 [Install py4j](#requirements-py4j)
 19 | 
 20 | 2. [Set Up Apache Spark](#spark)
 21 | 
 22 |     2.1 [Download source](#spark-tarball)
 23 | 
 24 |     2.2 [Compile source](#spark-compile)
 25 | 
 26 |     2.3 [Install files](#spark-install)
 27 | 
 28 | 3. [Examples](#examples)
 29 | 
 30 |     3.1 [Hello World: Word Count](#examples-helloworld)
 31 |   
 32 | ---
 33 | 
 34 | In order to run Spark, we need Scala, which in turn requires Java. So, let's install these requirements first
 35 | 
 36 | <div id='requirements'/></div>
 37 | 
 38 | ## 1 | Install Requirements
 39 | 
 40 | <div id='requirements-java'/></div>
 41 | 
 42 | ### 1.1 | Install Java
 43 | 
 44 | ```bash
 45 | $ sudo apt-add-repository ppa:webupd8team/java
 46 | $ sudo apt-get update
 47 | $ sudo apt-get install oracle-java7-installer
 48 | ```
 49 | 
 50 | Check if installation was successful by running:
 51 | 
 52 | ```bash
 53 | $ java -version
 54 | ```
 55 | 
 56 | The output should be something like:
 57 | 
 58 | ```bash
 59 | java version "1.7.0_80"
 60 | Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
 61 | Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
 62 | ```
 63 | 
 64 | <div id='requirements-scala'/></div>
 65 | 
 66 | ### 1.2 | Install Scala
 67 | 
 68 | Download and install deb package from scala-lang.org:
 69 | 
 70 | ```bash
 71 | $ cd ~/Downloads
 72 | $ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
 73 | $ sudo dpkg -i scala-2.11.7.deb
 74 | ```
 75 | 
 76 | ***Note:*** *You may want to check if there's a more recent version. At the time of this writing, 2.11.7 was the most recent stable release. Visit the [Scala download page](http://www.scala-lang.org/download/all.html) to check for updates.* 
 77 | 
 78 | Again, let's check whether the installation was successful by running:
 79 | ```bash
 80 | $ scala -version
 81 | ```
 82 | which should return something like:
 83 | ```bash
 84 | Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL
 85 | ```
 86 | 
 87 | <div id='requirements-git'/></div>
 88 | 
 89 | ### 1.3 | Install git
 90 | 
 91 | We shall install Apache Spark by building it from source. This procedure depends implicitly on git, thus be sure install git if you haven't already:
 92 | ```bash
 93 | $ sudo apt-get -y install git
 94 | ```
 95 | 
 96 | <div id='requirements-py4j'/></div>
 97 | 
 98 | ### 1.4 | Install py4j
 99 | 
100 | PySpark requires the `py4j` python package. If you're running a virtual environment, run:
101 | 
102 | ```bash
103 | $ pip install py4j
104 | ```
105 | otherwise, run:
106 | ```bash
107 | $ sudo pip install py4j
108 | ```
109 | 
110 | <div id='spark'/></div>
111 | 
112 | ## 2 | Install Apache Spark
113 | 
114 | <div id='spark-tarball'/></div>
115 | 
116 | ### 2.1 | Download and extract source tarball
117 | 
118 | ```bash
119 | $ cd ~/Downloads
120 | $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz
121 | $ tar xvf spark-1.6.0.tgz
122 | ```
123 | ***Note:*** *Also here, you may want to check if there's a more recent version: visit the [Spark download page](http://spark.apache.org/downloads.html)*.
124 | 
125 | <div id='spark-compile'/></div>
126 | 
127 | ### 2.2 | Compile source
128 | ```bash
129 | $ cd ~/Downloads/spark-1.6.0
130 | $ sbt/sbt assembly
131 | ```
132 | 
133 | This will take a while... (approximately 20 ~ 30 minutes)
134 | 
135 | After the dust settles, you can check whether Spark installed correctly by running the following example that should return the number π ≈ 3.14159...
136 | ```bash
137 | $ ./bin/run-example SparkPi 10
138 | ```
139 | 
140 | This should return the line:
141 | ```bash
142 | Pi is roughly 3.14042
143 | ```
144 | 
145 | ***Note:*** *You want to lower the verbosity level of the log4j logger. You can do so by running editing your the log4j properties file (assuming we're still inside the `~/Downloads/spark-1.4.0` folder):*
146 | ```bash
147 | $ cp conf/log4j.properties.template conf/log4j.properties
148 | $ nano conf/log4j.properties
149 | ```
150 | 
151 | *and replace the line:*
152 | 
153 |     log4j.rootCategory=INFO, console
154 | 
155 | *by*
156 | 
157 |     log4j.rootCategory=ERROR, console
158 | 
159 | <div id='spark-install'/></div>
160 | 
161 | ### 2.3 | Install files
162 | ```bash
163 | $ sudo mv ~/Downloads/spark-1.6.0 /opt/
164 | $ sudo ln -s /opt/spark-1.6.0 /opt/spark
165 | ```
166 | 
167 | Add this to your path by editing your bashrc file:
168 | ```bash
169 | $ nano ~/.bashrc
170 | ```
171 | 
172 | Add the following lines at the bottom of this file:
173 | ```bash
174 | # needed for Apache Spark
175 | export SPARK_HOME=/opt/spark
176 | export PYTHONPATH=$SPARK_HOME/python
177 | ```
178 | Restart bash to make use of these changes by running:
179 | ```bash
180 | $ . ~/.bashrc
181 | ```
182 | 
183 | If your ipython instance somehow doesn't find these environment variables for whatever reason, you could also make sure they are set when ipython spins up. Let's add this to our ipython settings by creating a new python script named `load_spark_environment_variables.py` in the default profile startup folder:
184 | ```bash
185 | $ nano ~/.ipython/profile_default/startup/load_spark_environment_variables.py
186 | ```
187 | and paste the following lines in this file:
188 | ```python
189 | import os
190 | import sys
191 | 
192 | if 'SPARK_HOME' not in os.environ:
193 |     os.environ['SPARK_HOME'] = '/opt/spark'
194 | 
195 | if '/opt/spark/python' not in sys.path:
196 |     sys.path.insert(0, '/opt/spark/python')
197 | ```
198 | 
199 | <div id='examples'/></div>
200 | 
201 | ## 3 | Examples
202 | 
203 | Now we're finally ready to start running our first PySpark application. Load the spark context by opening up a python interpreter (or ipython / ipython notebook) and running:
204 | 
205 | ```python
206 | >>> from pyspark import SparkContext
207 | >>> sc = SparkContext()
208 | ```
209 | 
210 | The spark context variable `sc` is your gateway towards everything sparkly.
211 | 
212 | 
213 | <div id='examples-helloworld'/></div>
214 | 
215 | ### 3.1 | Hello World: Word Count
216 | 
217 | Check out the notebook [spark_word_count.ipynb](spark_word_count.ipynb).
218 | 


--------------------------------------------------------------------------------