├── .gitignore
├── README.md
├── images
    └── spark-install-22.png
├── setup_aws.md
└── spark-install.sh


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | .DS_Store
92 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Before doing anything [Requirements]
  2 | 
  3 | **Step 1: AWS Account Setup** Before installing Spark on your computer, be sure to [set up an Amazon Web Services account](setup_aws.md). If you already have an AWS account, make sure that you can log into the [AWS Console](https://console.aws.amazon.com) with your username and password.
  4 | 
  5 |  * [AWS Setup Instructions](setup_aws.md)
  6 | 
  7 | **Step 2: Software Installation** Before you dive into these installation instructions, you need to have some software installed. Here's a table of all the software you need to install, plus the online tutorials to do so.
  8 | 
  9 | ### Requirements for Mac
 10 | 
 11 | | Name | Description | Installation Guide |
 12 | | :-- | :-- | :-- |
 13 | | Brew | The package installation soft for mac. Very helpful for this installation and in life in general. | [brew install](http://brew.sh/) |
 14 | | Anaconda | A distribution of python, with packaged modules and libraries. **Note: we recommend installing Anaconda 2 (for python 2.7)** | [anaconda install](https://docs.continuum.io/anaconda/install) |
 15 | | JDK 8 | Java Development Kit, used in both Hadoop and Spark. | just use `brew cask install java` |
 16 | 
 17 | ### Requirements for Linux
 18 | 
 19 | | Name | Description | Installation Guide |
 20 | | :-- | :-- | :-- |
 21 | | Anaconda | A distribution of python, with packaged modules and libraries. **Note: we recommend installing Anaconda 2 (for python 2.7)** | [anaconda install](https://docs.continuum.io/anaconda/install) |
 22 | | JDK 8 | Java Development Kit, used in both Hadoop and Spark. | [install for linux](https://docs.oracle.com/javase/8/docs/technotes/guides/install/linux_jdk.html) |
 23 | 
 24 | 
 25 | 
 26 | # 1. Spark / PySpark Installation
 27 | 
 28 | We are going to install Spark+Hadoop. Use the Part that corresponds to your configuration:
 29 | - 1.1. Installing Spark+Hadoop on Mac with no prior installation
 30 | - 1.2. Installing Spark+Hadoop on Linux with no prior installation
 31 | - 1.3. Use Spark+Hadoop from a prior installation
 32 | 
 33 | We'll do most of these steps from the command line. So, open a terminal and jump in !
 34 | 
 35 | **NOTE**: If you would prefer to jump right into using spark you can use the `spark-install.sh` script provided in this repo which will automatically perform the installation and set any necessary environment variables for you.  This script will install `spark-2.2.0-bin-hadoop2.7`.
 36 | 
 37 | ## 1.1. Installing Spark+Hadoop on MAC with no prior installation (using brew)
 38 | 
 39 | Be sure you have brew updated before starting: use `brew update` to update brew and brew packages to their last version.
 40 | 
 41 | 1\. Use `brew install hadoop` to install Hadoop (version 2.8.0 as of July 2017)
 42 | 
 43 | 2\. Check the hadoop installation directory by using the command:
 44 | 
 45 | ```bash
 46 | brew info hadoop
 47 | ```
 48 | 
 49 | 3\. Use `brew install apache-spark` to install Spark (version 2.2.0 as of July 2017)
 50 | 
 51 | 4\. Check the installation directory by using the command:
 52 | 
 53 | ```
 54 | brew info apache-spark
 55 | ```
 56 | 
 57 | 5\. You're done ! You can now go to the section 2 for setting up your environment and run your Spark scripts.
 58 | 
 59 | 
 60 | ## 1.2. Installing Spark+Hadoop on Linux with no prior installation
 61 | 
 62 | 1\. Go to [Apache Spark Download page](http://spark.apache.org/downloads.html). Choose the latest Spark release (2.2.0), and the package type "Pre-built for Hadoop 2.7 and later". Click on the link "Download Spark" to get the `tgz` package of the latest Spark release. On July 2017 this file was `spark-2.2.0-bin-hadoop2.7.tgz` so we will be using that in the rest of these guidelines but feel free to adapt to your version.
 63 | 
 64 | > ![spark installation website](images/spark-install-22.png)
 65 | 
 66 | 2\. Uncompress that file into `/usr/local` by typing:
 67 | 
 68 | ```
 69 | sudo tar xvzf spark-2.2.0-bin-hadoop2.7.tgz -C /usr/local/
 70 | ```
 71 | 
 72 | 3\. Create a shorter symlink of the directory that was just created using:
 73 | 
 74 | ```
 75 | sudo ln -s /usr/local/spark-2.2.0-bin-hadoop2.7 /usr/local/spark
 76 | ```
 77 | 
 78 | 4\. Go to [Apache Hadoop Download page](http://hadoop.apache.org/releases.html#Download). On the table above, click on the latest version below 3 (2.8.1 as of July 2017). Click as to download the *binary* version `tar.gz` archive, choose a mirror and download the file unto your computer.
 79 | 
 80 | 5\. Uncompress that file into `/usr/local` by typing:
 81 | 
 82 | ```
 83 | sudo tar xvzf /path_to_file/hadoop-2.8.1.tar.gz -C /usr/local/
 84 | ```
 85 | 
 86 | 6\. Create a shorter symlink of this directory using:
 87 | 
 88 | ```
 89 | sudo ln -s /usr/local/hadoop-2.8.1 /usr/local/hadoop
 90 | ```
 91 | 
 92 | 
 93 | ## 1.3. Using a prior installation of Spark+Hadoop
 94 | 
 95 | We strongly recommend you update your installation to the must recent version of Spark. As of July 2017 we used Spark 2.2.0 and Hadoop 2.8.0.
 96 | 
 97 | If you want to use another version there, all you have to do is to locate your installation directories for Spark and Hadoop, and use that in the next section 2.1 for setting up your environment.
 98 | 
 99 | 
100 | # 2. Setting up your environment
101 | 
102 | ## 2.1. Environment variables
103 | 
104 | To run Spark scripts you have to properly setup your shell environment: setting up environment variables, verifying your AWS credentials, etc.
105 | 
106 | 1\. Edit your `~/.bash_profile` to add/edit the following lines depending on your configuration. This addition will setup your environment variables `SPARK_HOME` and `HADOOP_HOME` to point out to the directories used to install Spark and Hadoop.
107 | 
108 | **For a Mac/Brew installation**, copy/paste the following lines into your `~/.bash_profile`:
109 | ```bash
110 | export SPARK_HOME=`brew info apache-spark | grep /usr | tail -n 1 | cut -f 1 -d " "`/libexec
111 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
112 | 
113 | export HADOOP_HOME=`brew info hadoop | grep /usr | head -n 1 | cut -f 1 -d " "`/libexec
114 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
115 | ```
116 | 
117 | **For the Linux installation described above**, copy/paste the following lines into your `~/.bash_profile`:
118 | ```bash
119 | export SPARK_HOME=/usr/local/spark
120 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
121 | 
122 | export HADOOP_HOME=/usr/local/hadoop
123 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
124 | ```
125 | 
126 | **For any other installation**, find what directories your Spark and Hadoop installation where, adapt the following lines to your configuration and put that into your `~/.bash_profile`:
127 | ```bash
128 | export SPARK_HOME=###COMPLETE HERE###
129 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
130 | 
131 | export HADOOP_HOME=###COMPLETE HERE###
132 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
133 | ```
134 | 
135 | While you're in `~/.bash_profile`, be sure to have two environment variables for your AWS keys. We'll use that in the assignments. Be sure you have the following lines set up (with the actual values of your AWS credentials):
136 | 
137 | ```bash
138 | export AWS_ACCESS_KEY_ID='put your access key here'
139 | export AWS_SECRET_ACCESS_KEY='put your secret access key here'
140 | ```
141 | 
142 | **Note**: After any modification to your `.bash_profile`, for your terminal to take these changes into account, you need to run `source ~/.bash_profile` from the command line. They will be automatically taken into account next time you open a new terminal.
143 | 
144 | ## 2.2. Python environment
145 | 
146 | 1\. Back to the command line, install py4j using `pip install py4j`.
147 | 
148 | 2\. To check if everything's ok, start an `ipython` console and type `import pyspark`. This will do nothing in practice, that's ok: **if it did not throw any error, then you are good to go.**
149 | 
150 | 
151 | # 3. How to run Spark python scripts
152 | 
153 | ## 3.1. How to run Spark/Python from a Jupyter Notebook
154 | 
155 | Running Spark from a jupyter notebook can require you to launch jupyter with a specific setup so that it connects seamlessly with the Spark Driver. We recommend you create a shell script `jupyspark.sh` designed specifically for doing that.
156 | 
157 | 1\. Create a file called `jupyspark.sh` somewhere under your `$PATH`, or in a directory of your liking (I usually use a `scripts/` directory under my home directory). In this file, you'll copy/paste the following lines:
158 | 
159 | ```bash
160 | #!/bin/bash
161 | export PYSPARK_DRIVER_PYTHON=jupyter
162 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888"
163 | 
164 | ${SPARK_HOME}/bin/pyspark \
165 | --master local[4] \
166 | --executor-memory 1G \
167 | --driver-memory 1G \
168 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
169 | --packages com.databricks:spark-csv_2.11:1.5.0 \
170 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
171 | --packages org.apache.hadoop:hadoop-aws:2.8.0
172 | ```
173 | 
174 | 
175 | Save the file. Make it executable by doing `chmod 711 jupyspark.sh`. Now, whenever you want to launch a spark jupyter notebook run this script by typing `jupyspark.sh` in your terminal.
176 | 
177 | Here's how to read that script... Basically, we are going to use `pyspark` (an executable from your Spark installation) to run `jupyter` with a proper Spark context.
178 | 
179 | The first two lines :
180 | ```bash
181 | export PYSPARK_DRIVER_PYTHON=jupyter
182 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888"
183 | ```
184 | will set up two environment variables for `pyspark` to execute `jupyter`.
185 | 
186 | Note: If you are installing Spark on a Virtual Machine and would like to access jupyter from your host browser, you should set the NotebookApp.ip flag to `--NotebookApp.ip='0.0.0.0'` so that your VM's jupyter server will accept external connections. You can then access jupyter notebook from the host machine on port 8888.
187 | 
188 | The next line:
189 | ```bash
190 | ${SPARK_HOME}/bin/pyspark \
191 | ```
192 | is part of a multiline long command to run `pyspark` with all necessary packages and options.
193 | 
194 | The next 3 lines:
195 | ```bash
196 | --master local[4] \
197 | --executor-memory 1G \
198 | --driver-memory 1G \
199 | ```
200 | Set up the options for `pyspark` to execute locally, using all 4 cores of your computer and setting up the memory usage for Spark Driver and Executor.
201 | 
202 | The next line:
203 | ```bash
204 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
205 | ```
206 | Is creating a directory to store Sparl SQL dataframes. This is non-necessary but has been solving a common error for loading and processing S3 data into Spark Dataframes.
207 | 
208 | The final 3 lines:
209 | ```bash
210 | --packages com.databricks:spark-csv_2.11:1.5.0 \
211 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
212 | --packages org.apache.hadoop:hadoop-aws:2.8.0
213 | ```
214 | add specific packages to `pyspark` to load. These packages are necessary to access AWS S3 repositories from Spark/Python and to read csv files.
215 | 
216 | **Note**: You can adapt these parameters to your own liking. See Spark page on [Submitting applications](http://spark.apache.org/docs/latest/submitting-applications.html) to tune these parameters.
217 | 
218 | 2\. Now run this script. It will open a notebook home page in your browser. From there, create a new notebook and copy/pate the following commands in your notebook:
219 | 
220 | ```python
221 | import pyspark as ps
222 | 
223 | spark = ps.sql.SparkSession.builder.getOrCreate()
224 | ```
225 | 
226 | What these lines do is to try to connect to the Spark Driver by creating a new [`SparkSession` instance](http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/SparkSession.html).
227 | 
228 | After this point, you can use `spark` as a unique entry point for [reading files and doing spark things](https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html).
229 | 
230 | 
231 | ## 3.2. How to run Spark/Python from command line with via `spark-submit`
232 | 
233 | Instead of using jupyter notebook, if you want to run your python script (using Spark) from the command line, you will need to use an executable from the Spark suite called `spark-submit`. Again, this executable requires some options that we propose to put into a script to use whenever you need to launch a Spark-based python script.
234 | 
235 | 
236 | 1\. Create a script you would call `localsparksubmit.sh`, put it somewhere handy. Copy/paste the following content in this file:
237 | 
238 | ```bash
239 | #!/bin/bash
240 | ${SPARK_HOME}/bin/spark-submit \
241 | --master local[4] \
242 | --executor-memory 1G \
243 | --driver-memory 1G \
244 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
245 | --packages com.databricks:spark-csv_2.11:1.5.0 \
246 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
247 | --packages org.apache.hadoop:hadoop-aws:2.8.0 \
248 | $@
249 | ```
250 | 
251 | See the previous section 2.1 for an explanation of these values. The final line here `$@` means that whatever you gave as an argument to this `localsparksubmit.sh` script will be used as a last argument in this command.
252 | 
253 | 2\. Whenever you want to run your script (called for instance `script.py`), you would do it by typing `localsparksubmit.sh script.py` from the command line. Make sure you put `localsparksubmit.sh` somewhere under your `$PATH`, or in a directory of your linking.
254 | 
255 | **Note**: You can adapt these parameters to your own setup. See Spark page on [Submitting applications](http://spark.apache.org/docs/latest/submitting-applications.html) to tune these parameters.
256 | 
257 | 
258 | # 4. Testing your installation
259 | 
260 | 1\. Open a new jupyter notebook (from the `jupyspark.sh` script provided above) and paste the following code:
261 | 
262 | ```python
263 | import pyspark as ps
264 | import random
265 | 
266 | spark = ps.sql.SparkSession.builder \
267 |         .appName("rdd test") \
268 |         .getOrCreate()
269 | 
270 | random.seed(1)
271 | 
272 | def sample(p):
273 |     x, y = random.random(), random.random()
274 |     return 1 if x*x + y*y < 1 else 0
275 | 
276 | count = spark.sparkContext.parallelize(range(0, 10000000)).map(sample) \
277 |              .reduce(lambda a, b: a + b)
278 | 
279 | print("Pi is (very) roughly {}".format(4.0 * count / 10000000))
280 | ```
281 | 
282 | It should output the following result :
283 | 
284 | ```
285 | Pi is (very) roughly 3.141317
286 | ```
287 | 
288 | 2\. Create a python script called `testspark.py` and paste the same lines above in it. Run this script from the command line using `localsparksubmit.sh testspark.py`. It should output the same result as above.
289 | 


--------------------------------------------------------------------------------
/images/spark-install-22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GalvanizeDataScience/spark-install/96952f3daa3634d2353b42a31d0aa143e59c2409/images/spark-install-22.png


--------------------------------------------------------------------------------
/setup_aws.md:
--------------------------------------------------------------------------------
 1 | ## Setting Up Amazon Web Services (AWS)
 2 | 
 3 | To use Spark on data that is too big to fit on our personal computers, we can use Amazon Web Services (AWS).
 4 | 
 5 | **If you are setting up AWS for a Galvanize workshop:**
 6 | It is essential to set up your AWS account *before* the workshop begins. Amazon may take several hours (up to 24 hours in extreme cases) to approve your new AWS account, so you may not be able to participate fully if you wait until the day of the event.
 7 | 
 8 | <br>
 9 | 
10 | ## Part 1: Create an AWS account
11 | 
12 | Go to [http://aws.amazon.com/](http://aws.amazon.com/) and sign up: 
13 | 
14 | - You may sign in using your existing Amazon account or you can create a new account by selecting
15 |   **Create a free account** using the button at the right, then selecting **I am a new user**.
16 |   
17 | - Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
18 | 
19 | - Once you have created an Amazon Web Services Account, you may need to accept a telephone call to verify your identity. Some people have used Google Voice successfully if you don't have or don't want to give a mobile number.
20 |   
21 | - Once you have an account, go to [http://aws.amazon.com/](http://aws.amazon.com/) and sign in. You will work primarily from the Amazon Management Console.
22 | 
23 | ## Part 2: Verify your AWS account
24 | 
25 | - Log into the [Amazon EC2 Console](https://console.aws.amazon.com/ec2) and click on the "Launch an Instance" button. If you are taken to **Step 1: Choose an Amazon Machine Image (AMI)** then your account is verified and ready to go. If not, please follow any additional verification steps indicated.
26 | 
27 | ## Part 3: Notify Instructors
28 | 
29 | - If you are setting up Spark to prepare for a Galvanize workshop, please notify your instructors in the workshop Slack channel (or by e-mail) when you have successfully created your AWS account. If you have any questions about completing the process, please reach out to us and we will help.
30 | 


--------------------------------------------------------------------------------
/spark-install.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Run this by typing `bash spark-install.sh`
  4 | 
  5 | echo "DOWNLOADING SPARK"
  6 | 
  7 | # Specify your shell config file
  8 | # Aliases will be appended to this file
  9 | SHELL_PROFILE="$HOME/.bashrc"
 10 | 
 11 | # Set the install location, $HOME is set by default
 12 | SPARK_INSTALL_LOCATION=$HOME
 13 | 
 14 | # Specify the URL to download Spark from
 15 | SPARK_URL=http://apache.mirrors.tds.net/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
 16 | 
 17 | # The Spark folder name should be the same as the name of the file being downloaded as specified in the SPARK_URL
 18 | SPARK_FOLDER_NAME=spark-2.1.0-bin-hadoop2.7.tgz
 19 | 
 20 | # Find the proper md5 hash from the Apache site
 21 | SPARK_MD5=50e73f255f9bde50789ad5bd657c7a71
 22 | 
 23 | # Print Disclaimer prior to running script
 24 | echo "DISCLAIMER: This is an automated script for installing Spark but you should feel responsible for what you're doing!"
 25 | echo "This script will install Spark to your home directory, modify your PATH, and add environment variables to your SHELL config file"
 26 | read -r -p "Proceed? [y/N] " response
 27 | if [[ ! $response =~ ^([yY][eE][sS]|[yY])$ ]]
 28 | then
 29 |     echo "Aborting..."
 30 |     exit 1
 31 | fi
 32 | 
 33 | # Verify that $SHELL_PROFILE is pointing to the proper file
 34 | read -r -p "Is $SHELL_PROFILE your shell profile? [y/N] " response
 35 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]
 36 | then
 37 |     echo "All relevent aliases will be added to this file; THIS IS IMPORTANT!"
 38 |     read -r -p "To verify, please type in the name of your shell profile (just the file name):  " response
 39 |     if [[ ! $response == $(basename $SHELL_PROFILE) ]]
 40 |     then
 41 |         echo "What you typed doesn't match $(basename $SHELL_PROFILE)!"
 42 |         echo "Please double check what shell profile you are using and alter spark-install accordingly!"
 43 |         exit 1
 44 |     fi
 45 | else
 46 |     echo "Please alter the spark-install.sh script to specify the correct file"
 47 |     exit 1
 48 | fi
 49 | 
 50 | # Create scripts folder for storing jupyspark.sh and localsparksubmit.sh
 51 | read -r -p "Would you like to create a scripts folder in your Home directory? [y/N] " response
 52 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
 53 |     if [[ ! -d $HOME/scripts ]]
 54 |     then
 55 |         mkdir $HOME/scripts
 56 |         echo "export PATH=\$PATH:$HOME/scripts" >> $SHELL_PROFILE
 57 |     else
 58 |         echo "scripts folder already exists! Verify this folder has been added to your PATH"
 59 |     fi
 60 | else
 61 |     if [[ ! -d $HOME/scripts ]]
 62 |     then
 63 |         echo "Installing without installing jupyspark.sh and localsparksubmit.sh"
 64 |     fi
 65 | fi
 66 | 
 67 | if [[ -d $HOME/scripts ]];
 68 | then
 69 |     # Create jupyspark.sh script in your scripts folder
 70 |     read -r -p "Would you like to create the jupyspark.sh script for launching a local jupyter spark server? [y/N] " response
 71 |     if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
 72 |         echo "#!/bin/bash
 73 |         export PYSPARK_DRIVER_PYTHON=jupyter
 74 |         export PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888\"
 75 | 
 76 |         \${SPARK_HOME}/bin/pyspark \
 77 |         --master local[4] \
 78 |         --executor-memory 1G \
 79 |         --driver-memory 1G \
 80 |         --conf spark.sql.warehouse.dir=\"file:///tmp/spark-warehouse\" \
 81 |         --packages com.databricks:spark-csv_2.11:1.5.0 \
 82 |         --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
 83 |         --packages org.apache.hadoop:hadoop-aws:2.7.3" > $HOME/scripts/jupyspark.sh
 84 | 
 85 |         chmod +x $HOME/scripts/jupyspark.sh
 86 |     fi
 87 | 
 88 |     # Create localsparksubmit.sh script in your scripts folder
 89 |     read -r -p "Would you like to create the localsparksubmit.sh script for submittiing local python scripts through spark-submit? [y/N] " response
 90 |     if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
 91 |         echo "#!/bin/bash
 92 |         \${SPARK_HOME}/bin/spark-submit \
 93 |         --master local[4] \
 94 |         --executor-memory 1G \
 95 |         --driver-memory 1G \
 96 |         --conf spark.sql.warehouse.dir=\"file:///tmp/spark-warehouse\" \
 97 |         --packages com.databricks:spark-csv_2.11:1.5.0 \
 98 |         --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
 99 |         --packages org.apache.hadoop:hadoop-aws:2.7.3 \
100 |         \$1" > $HOME/scripts/localsparksubmit.sh
101 | 
102 |         chmod +x $HOME/scripts/localsparksubmit.sh
103 |     fi
104 | fi
105 | 
106 | # Check to see if JDK is installed
107 | javac -version 2> /dev/null
108 | if [ ! $? -eq 0 ]
109 | then
110 |     # Install JDK
111 |     if [[ $(uname -s) = "Darwin" ]]
112 |     then
113 |         echo "Downloading JDK..."
114 |         brew install Caskroom/cask/java
115 |     elif [[ $(uname -s) = "Linux" ]]
116 |     then
117 |         echo "Downloading JDK..."
118 |         sudo add-apt-repository ppa:webupd8team/java
119 |         sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
120 |         sudo apt-get update
121 |         sudo apt-get install oracle-java8-installer
122 |     fi
123 | fi
124 | 
125 | SUCCESSFUL_SPARK_INSTALL=0
126 | SPARK_INSTALL_TRY=0
127 | 
128 | if [[ $(uname -s) = "Darwin" ]]
129 | then
130 |     echo -e "\n\tDetected Mac OS X as the Operating System\n"
131 | 
132 |     while [ $SUCCESSFUL_SPARK_INSTALL -eq 0 ]
133 |     do
134 |         curl $SPARK_URL > $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
135 |         # Check MD5 Hash
136 |         if [[ $(openssl md5 $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/^.* //") == "$SPARK_MD5" ]]
137 |         then
138 |             # Unzip
139 |             tar -xzf $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME -C $SPARK_INSTALL_LOCATION
140 |             # Remove the compressed file
141 |             rm $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
142 |             # Install py4j
143 |             pip install py4j
144 |             SUCCESSFUL_SPARK_INSTALL=1
145 |         else
146 |             echo 'ERROR: Spark MD5 Hash does not match'
147 |             echo "$(openssl md5 $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/^.* //") != $SPARK_MD5"
148 |             if [ $SPARK_INSTALL_TRY -lt 3 ]
149 |             then
150 |                 echo -e '\nTrying Spark Install Again...\n'
151 |                 SPARK_INSTALL_TRY=$[$SPARK_INSTALL_TRY+1]
152 |                 echo $SPARK_INSTALL_TRY
153 |             else
154 |                 echo -e '\nSPARK INSTALL FAILED\n'
155 |                 echo -e 'Check the MD5 Hash and run again'
156 |                 exit 1
157 |             fi
158 |         fi
159 |     done
160 | elif [[ $(uname -s) = "Linux" ]]
161 | then
162 |     echo -e "\n\tDetected Linux as the Operating System\n"
163 | 
164 |     while [ $SUCCESSFUL_SPARK_INSTALL -eq 0 ]
165 |     do
166 |         curl $SPARK_URL > $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
167 |         # Check MD5 Hash
168 |         if [[ $(md5sum $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/ .*$//") == "$SPARK_MD5" ]]
169 |         then
170 |             # Unzip
171 |             tar -xzf $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME -C $SPARK_INSTALL_LOCATION
172 |             # Remove the compressed file
173 |             rm $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
174 |             # Install py4j
175 |             pip install py4j
176 |             SUCCESSFUL_SPARK_INSTALL=1
177 |         else
178 |             echo 'ERROR: Spark MD5 Hash does not match'
179 |             echo "$(md5sum $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/ .*$//") != $SPARK_MD5"
180 |             if [ $SPARK_INSTALL_TRY -lt 3 ]
181 |             then
182 |                 echo -e '\nTrying Spark Install Again...\n'
183 |                 SPARK_INSTALL_TRY=$[$SPARK_INSTALL_TRY+1]
184 |                 echo $SPARK_INSTALL_TRY
185 |             else
186 |                 echo -e '\nSPARK INSTALL FAILED\n'
187 |                 echo -e 'Check the MD5 Hash and run again'
188 |                 exit 1
189 |             fi
190 |         fi
191 |     done
192 | else
193 |     echo "Unable to detect Operating System"
194 |     exit 1
195 | fi
196 | 
197 | # Remove extension from spark folder name
198 | SPARK_FOLDER_NAME=$(echo $SPARK_FOLDER_NAME | sed -e "s/.tgz$//")
199 | 
200 | echo "
201 | # Spark variables
202 | export SPARK_HOME=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME\"
203 | export PYTHONPATH=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME/python/:$PYTHONPATH\"
204 | 
205 | # Spark 2
206 | export PYSPARK_DRIVER_PYTHON=ipython
207 | export PATH=\$SPARK_HOME/bin:\$PATH
208 | alias pyspark=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME/bin/pyspark \
209 |     --conf spark.sql.warehouse.dir='file:///tmp/spark-warehouse' \
210 |     --packages com.databricks:spark-csv_2.11:1.5.0 \
211 |     --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
212 |     --packages org.apache.hadoop:hadoop-aws:2.7.3\"" >> $SHELL_PROFILE
213 | 
214 | source $SHELL_PROFILE
215 | 
216 | echo "INSTALL COMPLETE"
217 | echo "Please refer to Step 4 at https://github.com/zipfian/spark-install for testing your installation"
218 | 


--------------------------------------------------------------------------------