├── .gitignore
├── ApacheSparkTutorial-01-SetupLearningEnvironment.ipynb
├── LICENSE
├── README.md
└── data
└── people.json
/.gitignore:
--------------------------------------------------------------------------------
1 | *.class
2 | *.log
3 |
--------------------------------------------------------------------------------
/ApacheSparkTutorial-01-SetupLearningEnvironment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
Apache Spark Tutorial @ Learning Journal
\n",
8 | "\n",
9 | "---"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## Spark-01 - Setup Learning Environment\n",
17 | "\n",
18 | "This tutorial outlines the step by step process to setup standalone Spark on a Linux Machine. It goes further and integrates Jupyter Notebook for Spark. By the end of this tutorial, you will be able to access Apache Spark using following methods.\n",
19 | "1. Spark Shell\n",
20 | "2. PySpark\n",
21 | "3. Scala Notebook\n",
22 | "4. Python Notebook\n",
23 | "\n",
24 | "A complete video compilation of this tutorial is available @ [Youtube](https://www.youtube.com/watch?v=AYZCpxYVxH4&list=PLkz1SCf5iB4dXiPdFD4hXwheRGRwhmd6K&index=1)."
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "### 1. Download and Install JDK"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {
37 | "collapsed": true
38 | },
39 | "source": [
40 | "``` bash\n",
41 | "wget -c --header \"Cookie: oraclelicense=accept-securebackup-cookie\" http://download.oracle.com/otn-pub/java/jdk/8u144-b01/090f390dda5b47b9b721c7dfaa008135/jdk-8u144-linux-x64.rpm\n",
42 | "yum localinstall jdk-8u121-linux-x64.rpm\n",
43 | "```"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "### 2. Download and Install Spark"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {
56 | "collapsed": true
57 | },
58 | "source": [
59 | "``` bash\n",
60 | "wget -c https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz\n",
61 | "mkdir ~/spark\n",
62 | "tar -zxvf spark-2.2.0-bin-hadoop2.6.tgz -C ~/spark/\n",
63 | "```"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### 3. Set Environment variables"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "Add following lines in your ***.bash_profile*** or ***.bashrc***"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {
83 | "collapsed": true
84 | },
85 | "source": [
86 | "```bash\n",
87 | "export SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7\n",
88 | "export PATH=$PATH:$SPARK_HOME/bin\n",
89 | "```"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "### 4. Test Spark Installation"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "You can start Spark shell and Python shell by typing below commands"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {
109 | "collapsed": true
110 | },
111 | "source": [
112 | "```bash\n",
113 | "#To start spark shell\n",
114 | "spark-shell\n",
115 | "#To start python shell\n",
116 | "pyspark\n",
117 | "```"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "Test Spark by typing below code on Spark shell. The data file used in this tutorial is available in this repo. Look for *data* folder."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 4,
130 | "metadata": {},
131 | "outputs": [
132 | {
133 | "name": "stdout",
134 | "output_type": "stream",
135 | "text": [
136 | "+-----+---+\n",
137 | "| name|age|\n",
138 | "+-----+---+\n",
139 | "| Andy| 30|\n",
140 | "|Abdul| 22|\n",
141 | "|Robin| 45|\n",
142 | "|Larry| 37|\n",
143 | "+-----+---+\n",
144 | "\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "val df = spark.read.json(\"data/people.json\")\n",
150 | "df.filter(\"age > 21\").select(\"name\",\"age\").show()"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 5,
156 | "metadata": {},
157 | "outputs": [
158 | {
159 | "name": "stdout",
160 | "output_type": "stream",
161 | "text": [
162 | "+---+-----+\n",
163 | "|age| name|\n",
164 | "+---+-----+\n",
165 | "| 30| Andy|\n",
166 | "| 22|Abdul|\n",
167 | "| 45|Robin|\n",
168 | "| 37|Larry|\n",
169 | "+---+-----+\n",
170 | "\n"
171 | ]
172 | }
173 | ],
174 | "source": [
175 | "df.createOrReplaceTempView(\"people\")\n",
176 | "spark.sql(\"SELECT * FROM people where age > 21\").show()"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "### 5. Download and Install Anaconda"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {
189 | "collapsed": true
190 | },
191 | "source": [
192 | "```bash\n",
193 | "wget -c https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh\n",
194 | "bash Anaconda3-5.0.0.1-Linux-x86_64.sh\n",
195 | "```"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "### 6. Download and Install Apache Toree"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {
208 | "collapsed": true
209 | },
210 | "source": [
211 | "```bash\n",
212 | "wget -c https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz\n",
213 | "pip install toree-0.2.0.dev1.tar.gz\n",
214 | "jupyter toree install --spark_home=$SPARK_HOME --interpreters=Scala,PySpark,SQL --user\n",
215 | "```"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "### 7. Start Jupyter Server ( for local machine)"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {
228 | "collapsed": true
229 | },
230 | "source": [
231 | "```bash\n",
232 | "jupyter notebook --no-browser\n",
233 | "```"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "The Jypyter server will show you a http link with a token. Copy and paste the link in your browser."
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "### 8. Start Jupyter Server ( for a VM in Google cloud )"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "If you are using a VM in Google cloud, perform following steps before starting your Jupyter Server.\n",
255 | "1. Upgrade external IP address of your VM to a static IP\n",
256 | "2. Add a firewall rule to open TCP 8888 port\n",
257 | "\n",
258 | "Start your Jupyter Server using below command."
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {
264 | "collapsed": true
265 | },
266 | "source": [
267 | "```bash\n",
268 | "jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser\n",
269 | "```"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {
275 | "collapsed": true
276 | },
277 | "source": [
278 | "The Jupiter server will give you a URL.\n",
279 | "\n",
280 | "Copy the URL and replace the 0.0.0.0 by your VM's external IP address. Paste the new URL into your browser."
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "### 9. Test your connection using below code."
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 6,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "+-----+---+\n",
300 | "| name|age|\n",
301 | "+-----+---+\n",
302 | "| Andy| 30|\n",
303 | "|Abdul| 22|\n",
304 | "|Robin| 45|\n",
305 | "|Larry| 37|\n",
306 | "+-----+---+\n",
307 | "\n"
308 | ]
309 | }
310 | ],
311 | "source": [
312 | "val df = spark.read.json(\"data/people.json\")\n",
313 | "df.filter(\"age > 21\").select(\"name\",\"age\").show()"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": 7,
319 | "metadata": {
320 | "collapsed": true
321 | },
322 | "outputs": [],
323 | "source": [
324 | "df.createOrReplaceTempView(\"people\")"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": 8,
330 | "metadata": {},
331 | "outputs": [
332 | {
333 | "data": {
334 | "text/plain": [
335 | "+---+-----+\n",
336 | "|age| name|\n",
337 | "+---+-----+\n",
338 | "| 30| Andy|\n",
339 | "| 22|Abdul|\n",
340 | "+---+-----+\n",
341 | "\n"
342 | ]
343 | },
344 | "execution_count": 8,
345 | "metadata": {},
346 | "output_type": "execute_result"
347 | }
348 | ],
349 | "source": [
350 | "%%sql\n",
351 | "select * from people where name like 'A%'"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "### 10. What's Next"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "1. Get more details about [How to use Jupyter Notebooks](https://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/examples_index.html)\n",
366 | "2. Download this Notebook from nbviewer [NB Viewer](https://nbviewer.jupyter.org/github/LearningJournal/Spark-Tutorials/blob/master/ApacheSparkTutorial-01-SetupLearningEnvironment.ipynb)\n",
367 | "3. Get more video tutorials - [Learning Journal @ Youtube](http://www.youtube.com/learningjournalin)"
368 | ]
369 | }
370 | ],
371 | "metadata": {
372 | "kernelspec": {
373 | "display_name": "Apache Toree - Scala",
374 | "language": "scala",
375 | "name": "apache_toree_scala"
376 | },
377 | "language_info": {
378 | "file_extension": ".scala",
379 | "name": "scala",
380 | "version": "2.11.8"
381 | }
382 | },
383 | "nbformat": 4,
384 | "nbformat_minor": 2
385 | }
386 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Apache Spark Tutorials
2 | Example Code & Notebooks for Apache Spark Tutorials @ Learning Journal
3 | Visit https://learningjournal.guru/ for Video Tutorials.
4 |
--------------------------------------------------------------------------------
/data/people.json:
--------------------------------------------------------------------------------
1 | {"name":"Michael"}
2 | {"name":"Andy", "age":30}
3 | {"name":"Justin", "age":19}
4 | {"name":"Abdul", "age":22}
5 | {"name":"Robin", "age":45}
6 | {"name":"Tina", "age":18}
7 | {"name":"Larry", "age":37}
--------------------------------------------------------------------------------