├── .gitignore
├── LICENSE
├── README.md
├── images
├── WatchingGHRepo.png
└── spark-dashboard.png
├── notebooks
├── 2D7GTTE62
│ └── note.json
├── 2D87SDVHH
│ └── note.json
├── SparkWorkshopCHOData.json
└── SparkWorkshopShellIntro.json
├── sample-data
├── cmoa
│ ├── .DS_Store
│ ├── README
│ └── cmoa.csv
├── penn museum
│ ├── README
│ └── all-20180121.zip
└── small-sample.csv
├── slides
└── SparkInTheDark101.pdf
├── survey-responses.csv
└── worksheets
├── functions-list.md
├── working-with-cho-data.md
├── working-with-spark-shell.md
└── working-with-zeppelin.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 |
3 | # Byte-compiled / optimized / DLL files
4 | __pycache__/
5 | *.py[cod]
6 | *$py.class
7 | metastore_db/
8 |
9 | # C extensions
10 | *.so
11 |
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | .eggs/
21 | lib/
22 | lib64/
23 | parts/
24 | sdist/
25 | var/
26 | wheels/
27 | *.egg-info/
28 | .installed.cfg
29 | *.egg
30 |
31 | # PyInstaller
32 | # Usually these files are written by a python script from a template
33 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
34 | *.manifest
35 | *.spec
36 |
37 | # Installer logs
38 | pip-log.txt
39 | pip-delete-this-directory.txt
40 |
41 | # Unit test / coverage reports
42 | htmlcov/
43 | .tox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | .hypothesis/
51 |
52 | # Translations
53 | *.mo
54 | *.pot
55 |
56 | # Django stuff:
57 | *.log
58 | local_settings.py
59 |
60 | # Flask stuff:
61 | instance/
62 | .webassets-cache
63 |
64 | # Scrapy stuff:
65 | .scrapy
66 |
67 | # Sphinx documentation
68 | docs/_build/
69 |
70 | # PyBuilder
71 | target/
72 |
73 | # Jupyter Notebook
74 | .ipynb_checkpoints
75 |
76 | # pyenv
77 | .python-version
78 |
79 | # celery beat schedule file
80 | celerybeat-schedule
81 |
82 | # SageMath parsed files
83 | *.sage.py
84 |
85 | # dotenv
86 | .env
87 |
88 | # virtualenv
89 | .venv
90 | venv/
91 | ENV/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 spark4lib
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Code4Lib 2018 Spark in the Dark 101 Workshop
2 |
3 | Welcome to the open repository, documentation and materials for the Code4Lib 2018 Spark in the Dark 101 Workshop!
4 |
5 | * When: Tuesday, February 13th, 2018, 9:00 AM - 12:00 PM
6 | * Where: [**West End Neighborhood Library**](http://2018.code4lib.org/workshops/spark-in-the-dark-101)
7 | * Workshop Materials: [github.com/spark4lib/code4lib2018](https://github.com/spark4lib/code4lib2018)
8 | * Workshop Slides: [bit.ly/C4L18SparkSlides](http://bit.ly/C4L18SparkSlides) or [in this Repository](slides/SparkInTheDark101.pdf)
9 |
10 | ## About the Workshop
11 |
12 | This is an introductory session on [Apache Spark](https://spark.apache.org/), a framework for large-scale data processing. We will introduce high level concepts around Spark, including how Spark execution works and it’s relationship to the other technologies for working with Big Data. Following this introduction to the theory and background, we will walk workshop participants through hands-on usage of spark-shell, Zeppelin notebooks, and Spark SQL for processing library data. The workshop will wrap up with use cases and demos for leveraging Spark within cultural heritage institutions and information organizations, connecting the building blocks learned to current projects in the real world.
13 |
14 | This workshop is a registration-only workshop as part of [Code4Lib 2018 in Washington, D.C.](http://2018.code4lib.org/). For registration information or other conference-level logistical questions, please check the [Code4Lib 2018 Registration page](http://2018.code4lib.org/general-info/attend). If you have questions about the workshop specifically, you can [contact the workshop leaders using the information below](#contact-before-during-after-the-workshop).
15 |
16 | We ask that all participants come to the workshop ready to dive in by reviewing the information in this document. If you want a sneak peak of the workshop's contents, feel free to also watch this repository (you'll be notified of updates). To watch this repository, sign into GitHub, go this repository's home URL, and click the following button:
17 |
18 | 
19 |
20 | ## Workshop Schedule
21 |
22 | Time | Topic | Leader(s)
23 | ------------------ | -------------------------------------------------------------- | ------------------------------------------
24 | **9-9:10 AM** | Workshop Introduction, Logistics, Goals (10 minutes) | [Christina](mailto:cmharlow@stanford.edu)
25 | **9:10-9:25 AM** | Spark Theory: Optimal use cases for Spark (15 minutes) | [Audrey](mailto:audrey@dp.la)
26 | **9:25-9:40 AM** | Spark Theory: Spark Architecture (sparkitecture?) (15 minutes) | [Michael](mailto:michael@dp.la)
27 | **9:40-9:55 AM** | Spark Theory: RDD vs. DataFrame APIs (15 minutes) | [Scott](mailto:scott@dp.la)
28 | **9:55-10:15 AM** | Spark Practice: Env/setup (20 minutes) | [Mark](mailto:mb@dp.la) & [Justin](mailto:jcoyne@stanford.edu)
29 | **10:15-10:30 AM** | break (15 minutes) | n/a
30 | **10:30-10:50 AM** | Spark Practice: Working with spark-shell (20 minutes) | [Christina](mailto:cmharlow@stanford.edu) & [Audrey](mailto:audrey@dp.la)
31 | **10:50-11:10 AM** | Spark Practice: Working with zeppelin (20 minutes) | [Christina](mailto:cmharlow@stanford.edu) & [Audrey](mailto:audrey@dp.la)
32 | **11:10-11:45 AM** | Spark Practice: Interacting with Real World Data (20 minutes) | Whole Group
33 | **11:45-Noon** | Examples & Wrap-Up (30 minutes) | Whole Group
34 |
35 | ## Contact Before, During, After the Workshop
36 |
37 | If you have questions or concerns leading up to or after the workshop, please open an issue on this GitHub repository, particularly with any questions dealing with workshop preparation or any installation issues. This allows multiple workshop leaders to respond as able, and other participants can also learn (since we're sure the same questions will come up multiple times): https://github.com/spark4lib/code4lib2018/issues (this will require that you login or create a free account with GitHub).
38 |
39 | During the workshop, we will indicate the best ways to get help or communicate a question/comment - however, this workshop is intended to be informal, so feel free to speak up or indicate you have a question at any time.
40 |
41 | ## Our Expectations of You
42 |
43 | To keep this workshop a safe and inclusive space, we ask that you review and follow the [Code4Lib 2018 Code of Conduct](http://2018.code4lib.org/conduct/) and [the Recurse Center Social Rules (aka Hacker School Rules)](https://www.recurse.com/manual#sub-sec-social-rules).
44 |
45 | ## Participant Requirements
46 |
47 | We request that all participants:
48 | - You should also fill out [this pre-workshop survey](https://goo.gl/forms/Ps9KhjnsauMbGdpv2) by **February 9th** in order to help the facilitators best plan.
49 | - You should bring a laptop with a modern web browser and at least 4Gb memory (at we strongly recommend, if you want to play with more than intro data, at least 6Gb);
50 | - On that laptop, please already have installed the latest, stable version of [Docker Community Edition](https://www.docker.com/community-edition) installed;
51 | - Also on that laptop, please have bookmarked or have pulled down the latest version of this GitHub repository.
52 |
53 | We will be sending out an email with the specific Docker image information before Monday.
54 |
55 | If you have any issues with the above, please contact us ASAP using the [communication methods detailed above](#contact-before-during-after-the-workshop).
56 |
57 | ### Running the Docker Container
58 | 1. You need this GitHub repository locally. Clone it (and run `git pull origin master`) or download it, and make sure you have the latest copy.
59 | 2. Change into the top level directory of this Git repository.
60 | 3. With the latest, stable version of Docker Community Edition installed, go to your favorite shell and run: `docker pull mbdpla/sparkworkshop:latest`
61 | 4. In that same top level directory of this Git repository, now run: `docker run -p 8080:8080 -v $PWD:/code4lib2018 -e ZEPPELIN_NOTEBOOK_DIR='/code4lib2018/notebooks' mbdpla/sparkworkshop:latest`
62 |
63 | This should download and start up our Zeppelin Docker image on your machine. Check if it is running by opening and web browser and going to http://localhost:8080. This should show Zeppelin Notebook homepage with 2 notebooks loaded. **It will save Notebooks directly to your GitHub repository directory, so be aware of that!**
64 |
65 | We recommend re-pulling both this repository & the docker image Monday evening, if possible, to make sure you get the latest representation of the work.
66 |
67 | The day of, we will also bring thumbdrives with our workshop Docker image on it.
68 |
69 | ### Backup Option
70 |
71 | - If wifi works, try a [free Databricks Community account online](https://databricks.com/try-databricks). This gives you a hosted Zeppelin notebook with Spark infrastructure, for free.
72 | - If you’re feeling up to it, [get Zeppelin installed locally on your computer](https://zeppelin.apache.org/docs/0.7.3/install/install.html). Zeppelin comes with Spark.
73 |
74 | Either way, you’ll need to pull the [data](sample-data/) & [notebooks](notebooks/) from this Github repository.
75 |
--------------------------------------------------------------------------------
/images/WatchingGHRepo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/images/WatchingGHRepo.png
--------------------------------------------------------------------------------
/images/spark-dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/images/spark-dashboard.png
--------------------------------------------------------------------------------
/notebooks/SparkWorkshopCHOData.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"title":"Spark & CHO Data - Spark Workshop, Code4Lib 2018","text":"%md\nSpark Workshop, code4lib 2018\nMaterials at [github.com/spark4lib/code4lib2018](https://github.com/spark4lib/code4lib2018)\n\nThe following steps should also be available, with other notes, here: [https://github.com/spark4lib/code4lib2018/tree/master/worksheets/](https://github.com/spark4lib/code4lib2018/tree/master/worksheets) .","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Spark Workshop, code4lib 2018
Materials at github.com/spark4lib/code4lib2018
\n
The following steps should also be available, with other notes, here: https://github.com/spark4lib/code4lib2018/tree/master/worksheets/ .
\n
"}]},"apps":[],"jobName":"paragraph_1518416784089_991338542","id":"20180131-144721_1911570621","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:456"},{"title":"Introduction","text":"%md\nIn this session, we want you to apply what we learned in Spark Shell Introduction (Spark SQL, PySpark, & Zeppelin) for analyzing some provided Cultural Heritage Organization (CHO) data. \n\nYou will be working in small groups on this.","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
In this session, we want you to apply what we learned in Spark Shell Introduction (Spark SQL, PySpark, & Zeppelin) for analyzing some provided Cultural Heritage Organization (CHO) data.
\n
You will be working in small groups on this.
\n
"}]},"apps":[],"jobName":"paragraph_1518416784097_975948586","id":"20180211-041642_2058564180","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:457"},{"title":"Environment Set-up Test","text":"%spark\n\nprintln(\"Spark Version: \" + sc.version)\n\nval penndata = sc.textFile(\"penn.csv\")\nval cmoadata = sc.textFile(\"cmoa.csv\")\nprintln(\"penn count: \" + penndata.count)\nprintln(\"cmoadata count: \" + cmoadata.count)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784097_975948586","id":"20180131-144538_432654498","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:458"},{"title":"Preview CHO Data Options: CMOA","text":"%sh\n\ncat code4lib2018/sample-data/cmoa/README\nhead code4lib2018/sample-data/cmoa/cmoa.csv","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"sh","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sh","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784098_977102833","id":"20180210-010216_1168213332","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:459"},{"title":"Preview CHO Data Options: Penn","text":"%sh\n\ncat code4lib2018/sample-data/penn\\ museum/README\nunzip code4lib2018/sample-data/penn\\ museum/all-20180121.zip -d code4lib2018/sample-data/penn\\ museum/\nhead code4lib2018/sample-data/penn\\ museum/all-20180121.csv","dateUpdated":"2018-02-12T06:26:59+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"sh","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sh","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784098_977102833","id":"20180212-044729_2120047837","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:460","user":"anonymous","dateFinished":"2018-02-12T06:27:01+0000","dateStarted":"2018-02-12T06:27:00+0000"},{"title":"","text":"%md\n# Part 1: Creating a Dataframe Instance from CHO CSV","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 1: Creating a Dataframe Instance from CHO CSV
\n"}]},"apps":[],"jobName":"paragraph_1518416784098_977102833","id":"20180211-043612_1888524633","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:461"},{"title":"Read a CSV File into a Dataframe Instance with Headers","text":"pennDf = spark.read.csv(\"code4lib2018/sample-data/penn\\ museum/all-20180121.csv\", header=True, inferSchema=True)\npennDf.printSchema()","dateUpdated":"2018-02-12T06:27:03+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784099_976718084","id":"20180210-002750_1470779955","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:462","user":"anonymous","dateFinished":"2018-02-12T06:27:07+0000","dateStarted":"2018-02-12T06:27:04+0000"},{"title":"Read a CSV File into a Dataframe Instance with Headers","text":"pennDf.show(10, truncate=True)","dateUpdated":"2018-02-12T06:27:09+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784099_976718084","id":"20180211-233520_399031330","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:463","user":"anonymous","dateFinished":"2018-02-12T06:27:10+0000","dateStarted":"2018-02-12T06:27:10+0000"},{"title":"Show Columns","text":"pennDf.columns","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784099_976718084","id":"20180212-045044_481169447","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:464"},{"title":"Check out emuIRN","text":"pennDf.select('emuIRN').distinct().show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784099_976718084","id":"20180211-234002_1351508361","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:465"},{"title":"Check out Object Number","text":"pennDf.select('object_number').distinct().show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784100_974794339","id":"20180211-234030_834783507","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:466"},{"title":"Check out url","text":"pennDf.select('url').distinct().show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784100_974794339","id":"20180211-234038_1787924327","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:467"},{"title":"Create Our Own Schema & Read the CSV into a Dataframe Instance with our Schema","text":"from pyspark.sql.types import *\n\npennSchema = StructType([\n StructField(\"emuIRN\", IntegerType(), False),\n StructField(\"curatorial_section\", StringType(), True),\n StructField(\"object_number\", StringType(), True),\n StructField(\"object_name\", StringType(), True),\n StructField(\"native_name\", StringType(), True),\n StructField(\"culture\", StringType(), True),\n StructField(\"provenience\", StringType(), True),\n StructField(\"material\", StringType(), True),\n StructField(\"period\", StringType(), True),\n StructField(\"date_made\", StringType(), True),\n StructField(\"date_made_early\", StringType(), True),\n StructField(\"date_made_late\", StringType(), True),\n StructField(\"accession_credit_line\", StringType(), True),\n StructField(\"creator\", StringType(), True),\n StructField(\"description\", StringType(), True),\n StructField(\"manufacture_locationlocus\", StringType(), True),\n StructField(\"culture_area\", StringType(), True),\n StructField(\"technique\", StringType(), True),\n StructField(\"iconography\", StringType(), True),\n StructField(\"measurement_height\", StringType(), True),\n StructField(\"measurement_length\", StringType(), True),\n StructField(\"measurement_width\", StringType(), True),\n StructField(\"measurement_outside_diameter\", StringType(), True),\n StructField(\"measurement_tickness\", StringType(), True),\n StructField(\"measurement_unit\", StringType(), True),\n StructField(\"other_numbers\", StringType(), True),\n StructField(\"url\", StringType(), True)])\n\n\npennDf = spark.read.csv(\"code4lib2018/sample-data/penn\\ museum/all-20180121.csv\", header=True, schema=pennSchema)\npennDf.count()","dateUpdated":"2018-02-12T06:27:17+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784101_974409591","id":"20180205-224139_1405421836","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:468","user":"anonymous","dateFinished":"2018-02-12T06:27:19+0000","dateStarted":"2018-02-12T06:27:17+0000"},{"title":"Get Sample Response of our DataFrame","text":"pennDf.sample(False, 0.10, 42).show()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784101_974409591","id":"20180212-045408_436964083","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:469"},{"title":"","text":"%md\n### Stopping Point: Now you need to walk through, quickly assess, select, and load a CHO CSV into a Dataframe.","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Now you need to walk through, quickly assess, select, and load a CHO CSV into a Dataframe.
\n"}]},"apps":[],"jobName":"paragraph_1518416784101_974409591","id":"20180211-204040_1741647870","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:470"},{"title":"Loading CSV into DataFrame Working Space","text":"\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784101_974409591","id":"20180212-045445_2022795863","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:471"},{"title":"","text":"%md\n# Part 2: Simple Analysis of our Dataset via Dataframe API","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 2: Simple Analysis of our Dataset via Dataframe API
\n"}]},"apps":[],"jobName":"paragraph_1518416784102_975563837","id":"20180211-044238_542100528","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:472"},{"title":"Show Specific Columns","text":"pennDf.select(\"curatorial_section\", \"emuIRN\").show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784102_975563837","id":"20180211-044743_1705262796","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:473"},{"title":"Filter Specific Columns ","text":"americanDf = pennDf.filter(pennDf.curatorial_section == 'American')\namericanDf.select('emuIRN', 'url', 'object_name', 'creator').show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784102_975563837","id":"20180210-011244_1927422767","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:474"},{"title":"\"Earliest\" Dates for Objects?","text":"pennDf.orderBy(\"date_made_early\").select('emuIRN', 'date_made', 'object_name', 'object_number', 'culture_area').show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784102_975563837","id":"20180210-212920_1413184855","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:475"},{"title":"\"Earliest\" , Non-Null Dates for Objects?","text":"pennDf.filter(pennDf.date_made.isNotNull()).orderBy(\"date_made_early\").select('emuIRN', 'date_made', 'object_name', 'object_number', 'culture_area').show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180212-050343_962594735","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:476"},{"title":"How Many Items per Curatorial Section?","text":"pennDf.groupBy(\"curatorial_section\").count().show()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180210-011435_1074589917","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:477"},{"title":"Average Height of Objects by Period?","text":"pennDf.groupBy(\"period\").agg({\"measurement_height\": 'mean'}).show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180211-053908_436020978","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:478"},{"title":"Distinct Creators","text":"pennDf.select(\"creator\").distinct().show()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180210-211959_911280658","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:479"},{"title":"Number of Distinct Creators","text":"pennDf.select(\"creator\").distinct().count()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180212-051332_1623401677","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:480"},{"title":"Number of Related Resources per Creator","text":"from pyspark.sql.functions import countDistinct\npennDf.select(\"creator\", \"emuIRN\")\\\n .groupBy(\"creator\")\\\n .agg(countDistinct(\"emuIRN\")\\\n .alias(\"objects\"))\\\n .orderBy(\"objects\", ascending=False).show(10)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":4,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784103_975179088","id":"20180212-051343_572426650","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:481"},{"title":"","text":"%md\n### Stopping Point: Use Grouping, Distinct, Ordering, etc. To Get Analytics on Your CHO Dataset.","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Use Grouping, Distinct, Ordering, etc. To Get Analytics on Your CHO Dataset.
\n"}]},"apps":[],"jobName":"paragraph_1518416784104_973255344","id":"20180211-204432_1327378340","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:482"},{"title":"Your Working Space: Analysis of Data","text":"\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784104_973255344","id":"20180211-204542_862151313","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:483"},{"title":"","text":"%md\n# Part 3: Creating Derivative DataFrames for your CHO","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 3: Creating Derivative DataFrames for your CHO
\n"}]},"apps":[],"jobName":"paragraph_1518416784104_973255344","id":"20180211-052933_847596982","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:484"},{"title":"Get DataFrame with Curatorial, ID, Culture, URL, & Period, Dropping Nulls","text":"curatorialOnlyDf = pennDf.select('curatorial_section', 'emuIRN', 'culture', 'url', 'period').dropna()\ncuratorialOnlyDf.sample(False, 0.01, 42).show()","dateUpdated":"2018-02-12T06:27:29+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784104_973255344","id":"20180210-212150_723012495","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:485","user":"anonymous","dateFinished":"2018-02-12T06:27:30+0000","dateStarted":"2018-02-12T06:27:29+0000"},{"title":"Get DataFrame with Records that Have no Title, Creator or Description","text":"noDescDf = pennDf.filter(pennDf.creator.isNull()).filter(pennDf.description.isNull()).filter(pennDf.object_name.isNull())\nprint(noDescDf.count())\nnoDescDf.select('description', 'creator', 'emuIRN', 'curatorial_section', 'object_name').show(10, truncate=True)","dateUpdated":"2018-02-12T06:27:32+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784104_973255344","id":"20180211-032722_1642010293","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:486","user":"anonymous","dateFinished":"2018-02-12T06:27:33+0000","dateStarted":"2018-02-12T06:27:32+0000"},{"title":"Explode Object Names based on '|' Delimiter","text":"# Note the requirement for the escape character for |\nfrom pyspark.sql.functions import explode\n\nexplodedNamesDf = pennDf.withColumn(\"object_name\", explode(split(\"object_name\", \"\\|\")))\nexplodedNamesDf.show(20)","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784105_972870595","id":"20180210-220422_1187559654","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:487"},{"title":"Create Columns for AD or for BC","text":"pennDf = pennDf.withColumn('AD', pennDf.date_made.like(\"%AD%\"))\npennDf = pennDf.withColumn('BC', pennDf.date_made.like(\"%BC%\"))\npennDf.select('emuIRN', 'AD', 'BC', 'date_made').show()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{"0":{"graph":{"mode":"table","height":478,"optionOpen":false}}},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784105_972870595","id":"20180210-213550_325903017","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:488"},{"title":"Remove Whitespace around Dates (Bc Why Not)","text":"from pyspark.sql.functions import regexp_replace\n\ndateTestDf = pennDf.withColumn(\"date_made_cleaned\", regexp_replace(pennDf.date_made, \"\\s+\", \"\"))\ndateTestDf.select(\"date_made_cleaned\").distinct().show()","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784105_972870595","id":"20180211-035409_418875897","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:489"},{"title":"","text":"%md\n### Stopping Point: Create some Derivative Dataframes with your CHO dataset. Think of what would make a good SQL Query dataset.","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Create some Derivative Dataframes with your CHO dataset. Think of what would make a good SQL Query dataset.
\n"}]},"apps":[],"jobName":"paragraph_1518416784105_972870595","id":"20180211-205149_2050933534","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:490"},{"title":"Your Working Space: Deriving DataFrames","text":"\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784105_972870595","id":"20180211-205238_1685626655","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:491"},{"title":"","text":"%md\n# Part 4: Using SQL to Analyze our Dataset","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 4: Using SQL to Analyze our Dataset
\n"}]},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-070011_2131671710","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:492"},{"title":"Introduction to SQL Queries ","text":"%md\n[Here](http://cse.unl.edu/~sscott/ShowFiles/SQL/CheatSheet/SQLCheatSheet.html) is a SQL Cheatsheet in case you need it.","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Here is a SQL Cheatsheet in case you need it.
\n
"}]},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-040724_834116241","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:493"},{"title":"Creating a Temporary SQL View from Penn DataFrames","text":"curatorialOnlyDf.createOrReplaceTempView(\"curatorialView\")\nnoDescDf.createOrReplaceTempView(\"noDescView\")","dateUpdated":"2018-02-12T06:27:43+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-071952_1262010614","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:494","user":"anonymous","dateFinished":"2018-02-12T06:27:43+0000","dateStarted":"2018-02-12T06:27:43+0000"},{"title":"Query Penn DataFrame View for Number of Cultures per Curatorial Section","text":"sparkQuery = spark.sql(\"\"\"SELECT curatorial_section, COUNT(DISTINCT(culture)) AS NumCultures FROM curatorialView GROUP BY curatorial_section\"\"\")\nfor n in sparkQuery.collect():\n n","dateUpdated":"2018-02-12T06:30:38+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-072058_265727285","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:495","user":"anonymous","dateFinished":"2018-02-12T06:29:53+0000","dateStarted":"2018-02-12T06:29:50+0000"},{"title":"Query Penn DataFrame View for Number of No Title-Creator-Description Records per Curatorial Section","text":"sparkQuery2 = spark.sql(\"\"\"SELECT curatorial_section, COUNT(DISTINCT(emuIRN)) AS NumResources FROM noDescView GROUP BY curatorial_section\"\"\")\nfor n in sparkQuery.collect():\n n","dateUpdated":"2018-02-12T06:30:27+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180212-061229_887304088","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:496","user":"anonymous","dateFinished":"2018-02-12T06:30:23+0000","dateStarted":"2018-02-12T06:30:21+0000"},{"title":"","text":"%md\n### Stopping Point: Create a Temporary View & Run a SQL Query","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Create a Temporary View & Run a SQL Query
\n"}]},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-205631_117220376","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:497"},{"title":"Your Working Space: Create a Spark View from your Dataframe(s)","text":"\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784106_974024842","id":"20180211-205653_111316135","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:498"},{"title":"Your Working Space: Run some Spark SQL Queries","text":"\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-211110_718731458","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:499"},{"title":"","text":"%md\n# Part 5: Simple Data Visualization","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 5: Simple Data Visualization
\n"}]},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-072644_1534169002","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:500"},{"text":"%sql \nSELECT curatorial_section, \nCOUNT(DISTINCT(culture)) AS NumCultures \nFROM curatorialView \nGROUP BY curatorial_section","dateUpdated":"2018-02-12T06:31:56+0000","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{"0":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":true},"helium":{}}},"enabled":true,"editorSetting":{"language":"sql"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-210732_864456388","dateCreated":"2018-02-12T06:26:24+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:501","user":"anonymous","dateFinished":"2018-02-12T06:31:40+0000","dateStarted":"2018-02-12T06:31:38+0000"},{"title":"","text":"%md\n### Stopping Point: From your Temporary Views, Create a Simple Viz","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: From your Temporary Views, Create a Simple Viz
\n"}]},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-211201_433698257","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:502"},{"title":"Your Working Space: Spark View => SQL Query & Viz","text":"%sql\n\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"sql","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sql","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-211029_1012657635","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:503"},{"text":"%md\n","dateUpdated":"2018-02-12T06:26:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518416784107_973640093","id":"20180211-080404_990095708","dateCreated":"2018-02-12T06:26:24+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:504"}],"name":"Spark Workshop: CHO Data","id":"2D6RYGTKS","angularObjects":{"2D8F6TC39:shared_process":[],"2D7NXVY4B:shared_process":[],"2D8A6MQRC:shared_process":[],"2D5PJ3NSA:shared_process":[],"2D6EP7ZA2:shared_process":[],"2D5TJGRY4:shared_process":[],"2D5VZHKP9:shared_process":[],"2D89YMPRJ:shared_process":[],"2D72HVDPJ:shared_process":[],"2D6DB2GSE:shared_process":[],"2D7ZKW5Z8:shared_process":[],"2D6YB5W7T:shared_process":[],"2D6DKU8MK:shared_process":[],"2D4ZW7PCP:shared_process":[],"2D566ZBPK:shared_process":[],"2D866A3V3:shared_process":[],"2D7TTRTEG:shared_process":[],"2D71TQHNV:shared_process":[],"2D57SSSRY:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/notebooks/SparkWorkshopShellIntro.json:
--------------------------------------------------------------------------------
1 | {"paragraphs":[{"title":"Getting Started with Spark - Spark Workshop, Code4Lib 2018","text":"%md\nSpark Workshop, code4lib 2018\nMaterials at [github.com/spark4lib/code4lib2018](https://github.com/spark4lib/code4lib2018)\n\nThe following steps should also be available, with other notes, here: [https://github.com/spark4lib/code4lib2018/tree/master/worksheets/](https://github.com/spark4lib/code4lib2018/tree/master/worksheets) .","dateUpdated":"2018-02-12T01:59:09+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Spark Workshop, code4lib 2018
Materials at github.com/spark4lib/code4lib2018
\n
The following steps should also be available, with other notes, here: https://github.com/spark4lib/code4lib2018/tree/master/worksheets/ .
\n
"}]},"apps":[],"jobName":"paragraph_1518400749666_1428627749","id":"20180131-144721_1911570621","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:5397"},{"title":"Introduction","text":"%md\nIn this session, you will learn Zeppelin, Spark, & Spark SQL basics via primarily the DataFrames API. We are starting with a really simple dataset to focus on the tools. In the next session, we will use an expanded cultural heritage metadata set.\n\n#### Datasets? DataFrames?\n\nA **Dataset** is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.\n\nA **DataFrame** is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)\n\n[[source](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#datasets-and-dataframes)]\n\n#### How to run a paragraph\n\nTo run a paragraph in a Zeppelin notebook, you can either click the `play` button (blue triangle) on the right-hand side or click on the paragraph simply press `Shift + Enter`.\n\n#### What are Zeppelin Interpreters?\n\nIn the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with `%` followed by an interpreter name, e.g. `%spark2` for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!\n\nThroughout this notebook we will use the following interpreters:\n\n- `` - the default interpreter is set to be PySpark for our Workshop Docker Container.\n- `%spark` - Spark interpreter to run Spark code written in Scala\n- `%spark.sql` - Spark SQL interprter (to execute SQL queries against temporary tables in Spark)\n- `%sh` - Shell interpreter to run shell commands\n- `%angular` - Angular interpreter to run Angular and HTML code\n- `%md` - Markdown for displaying formatted text, links, and images\n\nTo learn more about Zeppelin interpreters check out this [link](https://zeppelin.apache.org/docs/0.5.6-incubating/manual/interpreters.html).","dateUpdated":"2018-02-12T01:59:09+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
In this session, you will learn Zeppelin, Spark, & Spark SQL basics via primarily the DataFrames API. We are starting with a really simple dataset to focus on the tools. In the next session, we will use an expanded cultural heritage metadata set.
\n
Datasets? DataFrames?
\n
A Dataset is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.
\n
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)
\n
[source]
\n
How to run a paragraph
\n
To run a paragraph in a Zeppelin notebook, you can either click the play
button (blue triangle) on the right-hand side or click on the paragraph simply press Shift + Enter
.
\n
What are Zeppelin Interpreters?
\n
In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with %
followed by an interpreter name, e.g. %spark2
for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!
\n
Throughout this notebook we will use the following interpreters:
\n
\n - `` - the default interpreter is set to be PySpark for our Workshop Docker Container.
\n %spark
- Spark interpreter to run Spark code written in Scala \n %spark.sql
- Spark SQL interprter (to execute SQL queries against temporary tables in Spark) \n %sh
- Shell interpreter to run shell commands \n %angular
- Angular interpreter to run Angular and HTML code \n %md
- Markdown for displaying formatted text, links, and images \n
\n
To learn more about Zeppelin interpreters check out this link.
\n
"}]},"apps":[],"jobName":"paragraph_1518400749675_1425165009","id":"20180211-041642_2058564180","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5398"},{"title":"Environment Set-up Test","text":"%spark\n\nprintln(\"Spark Version: \" + sc.version)\n\nval penndata = sc.textFile(\"penn.csv\")\nval cmoadata = sc.textFile(\"cmoa.csv\")\nprintln(\"penn count: \" + penndata.count)\nprintln(\"cmoadata count: \" + cmoadata.count)","user":"anonymous","dateUpdated":"2018-02-12T02:26:47+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749675_1425165009","id":"20180131-144538_432654498","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T02:26:47+0000","dateFinished":"2018-02-12T02:27:07+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5399","errorMessage":""},{"title":"Preview Sample Data","text":"%sh\ncat code4lib2018/sample-data/small-sample.csv","user":"anonymous","dateUpdated":"2018-02-12T02:29:11+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"sh","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sh","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749676_1423241265","id":"20180210-010216_1168213332","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T02:29:11+0000","dateFinished":"2018-02-12T02:29:12+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5400","errorMessage":""},{"title":"Our Sample Data","text":"%md\n\nName | Institution Type | Format | URL | Description | Informal Score\n---- | ---------------- | ------ | --- | ----------- | --------------\nCMOA | Museum | \"JSON, CSV\" | https://github.com/cmoa/collection | | 10\nPenn Museum | Museum | \"JSON, CSV, XML\" | https://www.penn.museum/collections/data.php | JSON is poorly structured | 7\nMet Museum | Museum | CSV | | \"¯\\_(ツ)_/¯\n\" | 3\nDigitalNZ Te Puna Web Directory | Library | XML | https://natlib.govt.nz/files/data/tepunawebdirectory.xml | MARC XML | 3\nCanadian Subject Headings | Library | RDF/XML | http://www.collectionscanada.gc.ca/obj/900/f11/040004/csh.rdf | \"Ugh, rdf\" | 4\nDPLA | Aggregator | \"CSV,JSON,XML\" | dp.la | | 100\n","user":"anonymous","dateUpdated":"2018-02-12T02:29:18+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749676_1423241265","id":"20180211-043610_1714070353","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T02:29:18+0000","dateFinished":"2018-02-12T02:29:20+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5401","errorMessage":""},{"title":"","text":"%md\n# Part 1: Creating a Dataframe Instance from CSV","user":"anonymous","dateUpdated":"2018-02-12T02:29:27+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 1: Creating a Dataframe Instance from CSV
\n"}]},"apps":[],"jobName":"paragraph_1518400749676_1423241265","id":"20180211-043612_1888524633","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T02:29:27+0000","dateFinished":"2018-02-12T02:29:27+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5402"},{"title":"Read a CSV File into a Dataframe Instance & View It","text":"print(spark.read.csv(\"small-sample.csv\"))\nspark.read.csv(\"small-sample.csv\").show()","user":"anonymous","dateUpdated":"2018-02-12T02:30:53+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749677_1422856516","id":"20180205-215000_1411168291","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T02:30:53+0000","dateFinished":"2018-02-12T02:30:54+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5403","errorMessage":""},{"title":"Read a CSV File into a Dataframe Instance with Headers","text":"spark.read.csv(\"code4lib2018/sample-data/small-sample.csv\", header=True).show()","user":"anonymous","dateUpdated":"2018-02-12T03:02:57+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749677_1422856516","id":"20180210-002750_1470779955","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:02:57+0000","dateFinished":"2018-02-12T03:02:57+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5404","errorMessage":""},{"title":"Read a CSV File into a Dataframe Instance with DROPMALFORMED mode","text":"spark.read.csv(\"code4lib2018/sample-data/small-sample.csv\", header=True, mode=\"DROPMALFORMED\").show()","user":"anonymous","dateUpdated":"2018-02-12T03:04:01+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749677_1422856516","id":"20180210-002834_476482959","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:04:01+0000","dateFinished":"2018-02-12T03:04:01+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5405","errorMessage":""},{"title":"Read a CSV File into a Dataframe Instance with multiLine...?","text":"spark.read.csv(\"code4lib2018/sample-data/small-sample.csv\", header=True, multiLine=True).show()","user":"anonymous","dateUpdated":"2018-02-12T03:06:51+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":6,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749678_1424010762","id":"20180210-002956_613592773","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:06:51+0000","dateFinished":"2018-02-12T03:06:51+0000","status":"ERROR","progressUpdateIntervalMs":500,"$$hashKey":"object:5406","errorMessage":""},{"title":"Temporary Python Repair of CSV Newlines within PySpark","text":"import csv\n\n# we will see a cleaner way using pyspark to handle this issue later\n# though really, use proper quoting with Spark 2.2.x & multifile=True\n\nwith open('code4lib2018/sample-data/small-sample.csv') as fh:\n test = csv.reader(fh)\n with open('code4lib2018/sample-data/small-sample-stripped.csv', 'w') as fout:\n test_write = csv.writer(fout, quoting=csv.QUOTE_ALL)\n for row in test:\n new_row = [val.replace(\"\\r\\n\", \"\") for val in row]\n test_write.writerow(new_row)","user":"anonymous","dateUpdated":"2018-02-12T03:17:59+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":6,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749678_1424010762","id":"20180210-004851_951565379","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:17:59+0000","dateFinished":"2018-02-12T03:17:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5407","errorMessage":""},{"title":"Check that Our Temporary Repair Worked for our CSV","text":"%sh\ncat 'code4lib2018/sample-data/small-sample-stripped.csv'","user":"anonymous","dateUpdated":"2018-02-12T03:18:40+0000","config":{"lineNumbers":true,"editorSetting":{"language":"sh","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sh","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749678_1424010762","id":"20180210-004958_2013545188","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:18:40+0000","dateFinished":"2018-02-12T03:18:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5408","errorMessage":""},{"title":"Revisit our CSV as a Dataframe & It's Inferred Schema","text":"spark.read.csv(\"code4lib2018/sample-data/small-sample-stripped.csv\", header=True, inferSchema=True).printSchema()","user":"anonymous","dateUpdated":"2018-02-12T03:21:08+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749679_1423626013","id":"20180210-005045_657129235","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:21:08+0000","dateFinished":"2018-02-12T03:21:08+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5409","errorMessage":""},{"title":"Create Our Own Schema & Read the CSV into a Dataframe Instance with our Schema","text":"from pyspark.sql.types import *\n\ncustomSchema = StructType([\n StructField(\"Name\", StringType(), True),\n StructField(\"Institution Type\", StringType(), True),\n StructField(\"Format\", StringType(), True),\n StructField(\"URL\", StringType(), True),\n StructField(\"Description\", StringType(), True),\n StructField(\"Informal Score\", DecimalType(), True)])\n\nsampleDf = spark.read.csv(\"code4lib2018/sample-data/small-sample-stripped.csv\", header=True, schema=customSchema)\nsampleDf.show()","user":"anonymous","dateUpdated":"2018-02-12T03:29:52+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749679_1423626013","id":"20180205-224139_1405421836","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:29:52+0000","dateFinished":"2018-02-12T03:29:53+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5410","errorMessage":""},{"title":"","text":"%md\n### Stopping Point: Have you been able to walk through and load your CSV?","user":"anonymous","dateUpdated":"2018-02-12T03:21:43+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Have you been able to walk through and load your CSV?
\n"}]},"apps":[],"jobName":"paragraph_1518400749679_1423626013","id":"20180211-204040_1741647870","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:21:43+0000","dateFinished":"2018-02-12T03:21:43+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5411"},{"title":"","text":"%md\n# Part 2: Simple Analysis of our Dataset via Dataframe API","user":"anonymous","dateUpdated":"2018-02-12T03:36:43+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 2: Simple Analysis of our Dataset via Dataframe API
\n"}]},"apps":[],"jobName":"paragraph_1518400749680_1434014234","id":"20180211-044238_542100528","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:36:43+0000","dateFinished":"2018-02-12T03:36:43+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5412"},{"title":"Return First n Rows in your Dataframe ( .head(n) )","text":"sampleDf.head(2)","user":"anonymous","dateUpdated":"2018-02-12T03:43:59+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749681_1433629485","id":"20180210-203445_1454325544","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:37:41+0000","dateFinished":"2018-02-12T03:37:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5413","errorMessage":""},{"title":"Show Columns ( .columns() ) & Number of Columns ( .len() ) in your Dataframe","text":"print(sampleDf.columns)\nprint(len(sampleDf.columns))","user":"anonymous","dateUpdated":"2018-02-12T04:36:04+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749681_1433629485","id":"20180210-203724_1788119369","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:39:49+0000","dateFinished":"2018-02-12T03:39:49+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5414","errorMessage":""},{"title":"Show Specific Columns ( .select() )","text":"sampleDf.select(\"Name\").show()\nsampleDf.select(\"Name\", \"Informal Score\").show()","user":"anonymous","dateUpdated":"2018-02-12T03:46:48+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749681_1433629485","id":"20180211-044743_1705262796","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:46:48+0000","dateFinished":"2018-02-12T03:46:48+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5415","errorMessage":""},{"title":"Selecting Columns ( .select() ) versus Apply Filters ( .filter() )","text":"scoresBooleanDf = sampleDf.select('Name', 'URL', sampleDf['Informal Score'] > 5)\nscoresBooleanDf.show()\n\nhighScoresDf = sampleDf.filter(sampleDf['Informal Score'] > 5)\nhighScoresDf.select('Name', 'URL', 'Informal Score').show()","user":"anonymous","dateUpdated":"2018-02-12T03:47:38+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749682_1434783732","id":"20180210-011244_1927422767","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:47:38+0000","dateFinished":"2018-02-12T03:47:39+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5416","errorMessage":""},{"title":"How Many Rows? ( .count() )","text":"numSampleDf = sampleDf.count()\nnumHighScoresDf = highScoresDf.count()\n\nprint(\"Percentage of Datasets with a Score at or above 5: \" + str(float(numHighScoresDf)/float(numSampleDf)*100) + \"%\")","user":"anonymous","dateUpdated":"2018-02-12T03:52:47+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749682_1434783732","id":"20180205-225126_643532571","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:52:47+0000","dateFinished":"2018-02-12T03:52:48+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5417","errorMessage":""},{"title":"Show First n Rows of Dataframe in Show View ( .show(n) )","text":"sampleDf.show(2,truncate=True)","user":"anonymous","dateUpdated":"2018-02-12T03:53:59+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749682_1434783732","id":"20180210-203520_832539356","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:53:59+0000","dateFinished":"2018-02-12T03:53:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5418","errorMessage":""},{"title":"Show Stats Summary of your Dataframe ( .describe() )","text":"sampleDf.describe().show()","user":"anonymous","dateUpdated":"2018-02-12T03:54:18+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749682_1434783732","id":"20180210-203742_684138715","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:54:18+0000","dateFinished":"2018-02-12T03:54:19+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5419","errorMessage":""},{"title":"Show Stats Summary of a Specific Column ( .describe(colName) , DecimalType)","text":"sampleDf.describe('Informal Score').show()","user":"anonymous","dateUpdated":"2018-02-12T03:55:20+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180210-211801_243937278","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:55:20+0000","dateFinished":"2018-02-12T03:55:21+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5420","errorMessage":""},{"title":"Show Stats Summary of a Specific Column ( .describe(colName) , StringType)","text":"sampleDf.describe('URL').show()","user":"anonymous","dateUpdated":"2018-02-12T03:55:23+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180211-073524_2028466772","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:55:23+0000","dateFinished":"2018-02-12T03:55:23+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5421","errorMessage":""},{"title":"","text":"%md\n### Stopping Point: Have you been able to walk through reviewing some Rows & Columns?","user":"anonymous","dateUpdated":"2018-02-12T03:34:43+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Have you been able to walk through reviewing some Rows & Columns?
\n"}]},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180211-204350_627919107","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:34:43+0000","dateFinished":"2018-02-12T03:34:43+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5422"},{"title":"Your Working Space: Analysis of Data I","text":"\n","user":"anonymous","dateUpdated":"2018-02-12T03:55:44+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180211-204610_53073346","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5423"},{"title":"Order Dataframe Views by a Column ( .orderBy(colName) | .orderBy(colName).desc() )","text":"sampleDf.orderBy(\"Informal Score\").show()\nsampleDf.orderBy(sampleDf[\"Informal Score\"].desc()).show()","user":"anonymous","dateUpdated":"2018-02-12T03:56:01+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180210-212920_1413184855","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:56:01+0000","dateFinished":"2018-02-12T03:56:01+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5424","errorMessage":""},{"title":"Group Values by a Column ( .groupBy(colName) ) ","text":"URLGroupDf = sampleDf.groupBy(\"Institution Type\").count()\n\nURLGroupDf.show()\nURLGroupDf.printSchema()","user":"anonymous","dateUpdated":"2018-02-12T03:59:43+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749683_1434398983","id":"20180210-011435_1074589917","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T03:59:43+0000","dateFinished":"2018-02-12T03:59:44+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5425","errorMessage":""},{"title":"Group Values by a Column then Compute an Aggregation ( .agg(expr) )","text":"# Note, this is setting us up for Splitting Values later on\n\nformatGroupDf = sampleDf.groupBy(\"Institution Type\").agg({\"Informal Score\": 'mean'})\nformatGroupDf.show()\nformatGroupDf.printSchema()","user":"anonymous","dateUpdated":"2018-02-12T04:00:39+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":6,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180211-053908_436020978","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:00:39+0000","dateFinished":"2018-02-12T04:00:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5426","errorMessage":""},{"title":"Show & Count Distinct Values in a Selected Column ( .distinct() )","text":"# Note, this is also setting us up for Splitting Values later on\n\ndistinctFormatDf = sampleDf.select(\"Format\").distinct()\ndistinctFormatDf.show()\ndistinctFormatDf.count()","user":"anonymous","dateUpdated":"2018-02-12T04:03:33+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180210-211959_911280658","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:03:33+0000","dateFinished":"2018-02-12T04:03:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5427","errorMessage":""},{"title":"Use Aggregates ( .agg() ) & Sql Functions for Number of Distinct Values in Format ( countDistinct(colName) )","text":"from pyspark.sql.functions import countDistinct\n\ncountDistinctDF = sampleDf.select(\"Name\", \"Institution Type\", \"Format\").groupBy(\"Institution Type\").agg(countDistinct(\"Format\"))\ncountDistinctDF.show()","user":"anonymous","dateUpdated":"2018-02-12T04:05:46+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180211-035942_1079861047","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:04:57+0000","dateFinished":"2018-02-12T04:04:59+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5428","errorMessage":""},{"title":"","text":"%md\n### Stopping Point: Have you been able to group, order, get distinct values by column(s)?","user":"anonymous","dateUpdated":"2018-02-12T04:04:27+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Have you been able to group, order, get distinct values by column(s)?
\n"}]},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180211-204432_1327378340","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:04:27+0000","dateFinished":"2018-02-12T04:04:27+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5429"},{"title":"Your Working Space: Analysis of Data II","text":"\n","user":"anonymous","dateUpdated":"2018-02-12T04:04:29+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180211-204542_862151313","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5430"},{"title":"","text":"%md\n# Part 3: Creating New Dataframes to Expand or Rework our Original Dataset","user":"anonymous","dateUpdated":"2018-02-12T04:07:07+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 3: Creating New Dataframes to Expand or Rework our Original Dataset
\n"}]},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180211-052933_847596982","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:07:07+0000","dateFinished":"2018-02-12T04:07:07+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5431"},{"title":"Create a Dataframe of DeDuplicated Values ( .dropDuplicates() )","text":"scoresDf = sampleDf.select('Informal Score').dropDuplicates()\nscoresDf.show()","user":"anonymous","dateUpdated":"2018-02-12T04:09:24+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180210-212128_106635954","dateCreated":"2018-02-12T01:59:09+0000","dateStarted":"2018-02-12T04:09:25+0000","dateFinished":"2018-02-12T04:09:25+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5432","errorMessage":""},{"title":"Create a Dataframe with Null Values Dropped ( .dropna() ) or Null Values Filled ( .fillna(fillerVal) )","text":"noNullDf = sampleDf.select('Name', 'URL').dropna()\nnoNullDf.show()\n\nnonNullDf = sampleDf.select('Name', 'URL').fillna('No URL')\nnonNullDf.show()","dateUpdated":"2018-02-12T04:09:31+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749684_1432475238","id":"20180210-212150_723012495","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5433","user":"anonymous","dateFinished":"2018-02-12T04:09:32+0000","dateStarted":"2018-02-12T04:09:32+0000"},{"title":"Create Dataframe where URL is Null ( col.isNull() && .filter() )","text":"filterNonNullDF = sampleDf.filter(sampleDf.URL.isNull()).sort(\"Name\")\nfilterNonNullDF.show()","dateUpdated":"2018-02-12T04:10:55+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180211-032722_1642010293","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5434","user":"anonymous","dateFinished":"2018-02-12T04:10:42+0000","dateStarted":"2018-02-12T04:10:42+0000"},{"title":"","text":"%md\n### Stopping Point: Create some Derivative Dataframes","dateUpdated":"2018-02-12T04:10:40+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180211-205119_924209624","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5435","user":"anonymous","dateFinished":"2018-02-12T04:10:40+0000","dateStarted":"2018-02-12T04:10:40+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Create some Derivative Dataframes
\n"}]}},{"title":"Your Working Space: Deriving Data I","text":"\n","dateUpdated":"2018-02-12T04:12:08+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180211-204757_140350285","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5436","user":"anonymous"},{"title":"Create a new DataFrame with 'Format Array' column ( .withColumn(colName, colValue) ) of Split Values ( split(col, delimiter) )","text":"from pyspark.sql.functions import split\n\nsampleDf = sampleDf.withColumn(\"Format Array\", split(sampleDf.Format, \",\"))\nsampleDf.show()\nsampleDf.printSchema()","dateUpdated":"2018-02-12T04:12:11+0000","config":{"lineNumbers":true,"tableHide":false,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180211-054754_1895175973","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5437","user":"anonymous","dateFinished":"2018-02-12T04:12:11+0000","dateStarted":"2018-02-12T04:12:11+0000"},{"title":"Use pyspark.sql.functions Explode ( explode(col) ) & Split ( .split(colName, delimiter) ) to Create a Row for each unique Format Value","text":"from pyspark.sql.functions import explode\n\nsampleDf = sampleDf.withColumn(\"Format\", explode(split(\"Format\", \",\")))\nsampleDf.show()","dateUpdated":"2018-02-12T04:14:17+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180210-220422_1187559654","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5438","user":"anonymous","dateFinished":"2018-02-12T04:14:17+0000","dateStarted":"2018-02-12T04:14:17+0000"},{"title":"Create New Columns ( withColumn(colName, colValue) ) with Boolean if Format Type Present ( .like(Regex) )","text":"sampleDf.select(\"Format\").distinct().show()\nsampleDf = sampleDf.withColumn('CSV', sampleDf.Format.like(\"%CSV%\"))\nsampleDf = sampleDf.withColumn('XML', sampleDf.Format.like(\"%XML%\"))\nsampleDf = sampleDf.withColumn('RDF', sampleDf.Format.like(\"%RDF%\"))\nsampleDf = sampleDf.withColumn('JSON', sampleDf.Format.like(\"%JSON%\"))\nsampleDf.show()","dateUpdated":"2018-02-12T04:15:22+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{"0":{"graph":{"mode":"table","height":478,"optionOpen":false}}},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749685_1432090489","id":"20180210-213550_325903017","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5439","user":"anonymous","dateFinished":"2018-02-12T04:15:23+0000","dateStarted":"2018-02-12T04:15:22+0000"},{"title":"Transform our Dataframe to RDD ( .rdd ) & Map a function ( .map(lambda) )","text":"sampleRdd = sampleDf.select(\"Format\").rdd.map(lambda x: x[0].split(\",\"))\nsampleRdd.take(5)","dateUpdated":"2018-02-12T04:16:25+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749686_1433244736","id":"20180211-055309_116193468","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5440","user":"anonymous","dateFinished":"2018-02-12T04:16:25+0000","dateStarted":"2018-02-12T04:16:25+0000"},{"title":"Drop the Previous Format Array Column","text":"sampleDf = sampleDf.drop('Format Array')\nsampleDf.show()","dateUpdated":"2018-02-12T04:16:53+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749686_1433244736","id":"20180210-213416_1825779467","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5441","user":"anonymous","dateFinished":"2018-02-12T04:16:53+0000","dateStarted":"2018-02-12T04:16:53+0000"},{"title":"Check our Distinct Format Values Now","text":"sampleDf.select(\"Format\").distinct().show()","dateUpdated":"2018-02-12T04:17:20+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749686_1433244736","id":"20180211-060909_1786306955","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5442","user":"anonymous","dateFinished":"2018-02-12T04:17:20+0000","dateStarted":"2018-02-12T04:17:20+0000"},{"title":"Remove Whitespace around Format using Regex ( reqexp_replace(colName, regex to replace, new value) )","text":"from pyspark.sql.functions import regexp_replace\n\nsampleDf = sampleDf.withColumn(\"Format\", regexp_replace(sampleDf.Format, \"\\s+\", \"\"))\nsampleDf.select(\"Format\").distinct().show()","dateUpdated":"2018-02-12T04:17:27+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749686_1433244736","id":"20180211-035409_418875897","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5443","user":"anonymous","dateFinished":"2018-02-12T04:17:27+0000","dateStarted":"2018-02-12T04:17:27+0000"},{"title":"Create Dataframe where Format is CSV ( .where(conditional) )","text":"CSVsampleDf = sampleDf.where((sampleDf.Format == \"CSV\"))\nCSVsampleDf.show()","dateUpdated":"2018-02-12T04:17:59+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{"0":{"graph":{"mode":"table","height":181,"optionOpen":false}}},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180210-215449_392462777","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5444","user":"anonymous","dateFinished":"2018-02-12T04:17:59+0000","dateStarted":"2018-02-12T04:17:59+0000"},{"title":"","text":"%md\n### Stopping Point: Create some Derivative Dataframes to Address Multivalue Format or related Issues","dateUpdated":"2018-02-12T04:17:58+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-205149_2050933534","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5445","user":"anonymous","dateFinished":"2018-02-12T04:17:58+0000","dateStarted":"2018-02-12T04:17:58+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Create some Derivative Dataframes to Address Multivalue Format or related Issues
\n"}]}},{"title":"Your Working Space: Deriving Data II","text":"\n","dateUpdated":"2018-02-12T01:59:09+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-205238_1685626655","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5446"},{"title":"","text":"%md\n# Part 4: Using SQL to Analyze our Dataset","dateUpdated":"2018-02-12T04:18:51+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-070011_2131671710","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5447","user":"anonymous","dateFinished":"2018-02-12T04:18:51+0000","dateStarted":"2018-02-12T04:18:51+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 4: Using SQL to Analyze our Dataset
\n"}]}},{"title":"Introduction to SQL Queries ","text":"%md\nTo have a more dynamic experience, let’s create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to execute SQL queries against it.\n\nNote that the temporary view will reside in memory as long as the Spark session is alive. [Here](http://cse.unl.edu/~sscott/ShowFiles/SQL/CheatSheet/SQLCheatSheet.html) is a SQL Cheatsheet in case you need it.","dateUpdated":"2018-02-12T04:19:26+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-040724_834116241","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5448","user":"anonymous","dateFinished":"2018-02-12T04:19:26+0000","dateStarted":"2018-02-12T04:19:26+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
To have a more dynamic experience, let’s create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to execute SQL queries against it.
\n
Note that the temporary view will reside in memory as long as the Spark session is alive. Here is a SQL Cheatsheet in case you need it.
\n
"}]}},{"title":"Creating a Temporary SQL View from our Sample Dataframe & Running a Simple SQL Query","text":"# Convert SampleDF DataFrame to a temporary view\nsampleDf.createOrReplaceTempView(\"sampleDataView\")\n\nsparkQuery = spark.sql(\"SELECT * FROM sampleDataView LIMIT 20\")\nsparkQuery.collect()","dateUpdated":"2018-02-12T04:19:57+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-071952_1262010614","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5449","user":"anonymous","dateFinished":"2018-02-12T04:19:57+0000","dateStarted":"2018-02-12T04:19:57+0000"},{"title":"Querying our Temporary View for Number of Distinct Formats per Institution Type","text":"sparkQuery = spark.sql(\"\"\"SELECT `Institution Type`, COUNT(DISTINCT(Format)) AS NumFormats FROM sampleDataView GROUP BY `Institution Type`\"\"\")\nsparkQuery.collect()\nfor n in sparkQuery.collect():\n n","dateUpdated":"2018-02-12T04:21:06+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala"},"colWidth":12,"editorMode":"ace/mode/scala","title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749687_1432859987","id":"20180211-072058_265727285","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5450","user":"anonymous","dateFinished":"2018-02-12T04:21:07+0000","dateStarted":"2018-02-12T04:21:06+0000"},{"title":"","text":"%md\n### Stopping Point: Create a Temporary View & Run a SQL Query","dateUpdated":"2018-02-12T04:21:04+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749688_1430936243","id":"20180211-205631_117220376","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5451","user":"anonymous","dateFinished":"2018-02-12T04:21:04+0000","dateStarted":"2018-02-12T04:21:04+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: Create a Temporary View & Run a SQL Query
\n"}]}},{"title":"Your Working Space: Create a Spark View from your Dataframe(s)","text":"\n","dateUpdated":"2018-02-12T01:59:09+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749688_1430936243","id":"20180211-205653_111316135","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5452"},{"title":"Your Working Space: Run some Spark SQL Queries","text":"\n","dateUpdated":"2018-02-12T01:59:09+0000","config":{"lineNumbers":true,"editorSetting":{"language":"scala","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/scala","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749688_1430936243","id":"20180211-211110_718731458","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5453"},{"title":"","text":"%md\n# Part 5: Simple Data Visualization","dateUpdated":"2018-02-12T04:22:00+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749689_1430551494","id":"20180211-072644_1534169002","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5454","user":"anonymous","dateFinished":"2018-02-12T04:22:00+0000","dateStarted":"2018-02-12T04:22:00+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Part 5: Simple Data Visualization
\n"}]}},{"text":"%sql \nSELECT `Institution Type`, COUNT(DISTINCT(Format)) AS NumFormats\nFROM sampleDataView\nGROUP BY `Institution Type`","dateUpdated":"2018-02-12T04:23:41+0000","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{"0":{"graph":{"mode":"table","height":300,"optionOpen":true},"helium":{}}},"enabled":true,"editorSetting":{"language":"sql"}},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749689_1430551494","id":"20180211-210732_864456388","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5455","user":"anonymous","dateFinished":"2018-02-12T04:23:31+0000","dateStarted":"2018-02-12T04:23:30+0000"},{"title":"","text":"%md\n### Stopping Point: From your Temporary Views, Create a Simple Viz","dateUpdated":"2018-02-12T04:23:24+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749689_1430551494","id":"20180211-211201_433698257","dateCreated":"2018-02-12T01:59:09+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:5456","user":"anonymous","dateFinished":"2018-02-12T04:23:24+0000","dateStarted":"2018-02-12T04:23:24+0000","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"\n
Stopping Point: From your Temporary Views, Create a Simple Viz
\n"}]}},{"title":"Your Working Space: Spark View => SQL Query & Viz","text":"%sql\n\n","dateUpdated":"2018-02-12T01:59:09+0000","config":{"lineNumbers":true,"editorSetting":{"language":"sql","editOnDblClick":false},"colWidth":12,"editorMode":"ace/mode/sql","editorHide":false,"title":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749689_1430551494","id":"20180211-211029_1012657635","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5457"},{"text":"%md\n","dateUpdated":"2018-02-12T01:59:09+0000","config":{"tableHide":false,"editorSetting":{"language":"markdown","editOnDblClick":true},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1518400749689_1430551494","id":"20180211-080404_990095708","dateCreated":"2018-02-12T01:59:09+0000","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:5458"}],"name":"Spark Workshop: Shell Intro","id":"2D6C6PTWP","angularObjects":{"2D8F6TC39:shared_process":[],"2D7NXVY4B:shared_process":[],"2D8A6MQRC:shared_process":[],"2D5PJ3NSA:shared_process":[],"2D6EP7ZA2:shared_process":[],"2D5TJGRY4:shared_process":[],"2D5VZHKP9:shared_process":[],"2D89YMPRJ:shared_process":[],"2D72HVDPJ:shared_process":[],"2D6DB2GSE:shared_process":[],"2D7ZKW5Z8:shared_process":[],"2D6YB5W7T:shared_process":[],"2D6DKU8MK:shared_process":[],"2D4ZW7PCP:shared_process":[],"2D566ZBPK:shared_process":[],"2D866A3V3:shared_process":[],"2D7TTRTEG:shared_process":[],"2D71TQHNV:shared_process":[],"2D57SSSRY:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}
--------------------------------------------------------------------------------
/sample-data/cmoa/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/sample-data/cmoa/.DS_Store
--------------------------------------------------------------------------------
/sample-data/cmoa/README:
--------------------------------------------------------------------------------
1 | Carnegie Museum of Art
2 |
3 | Github Repo
4 | ---------
5 | https://github.com/cmoa/collection
6 |
--------------------------------------------------------------------------------
/sample-data/penn museum/README:
--------------------------------------------------------------------------------
1 | Penn Museum
2 |
3 | README
4 | ------
5 | https://www.penn.museum/collections/objects/data.php
6 |
--------------------------------------------------------------------------------
/sample-data/penn museum/all-20180121.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/sample-data/penn museum/all-20180121.zip
--------------------------------------------------------------------------------
/sample-data/small-sample.csv:
--------------------------------------------------------------------------------
1 | "Name","Institution Type","Format","URL","Description","Informal Score"
2 | "CMOA","Museum","JSON, CSV","https://github.com/cmoa/collection","","10"
3 | "Penn Museum","Museum","JSON, CSV, XML","https://www.penn.museum/collections/data.php","JSON is poorly structured","7"
4 | "Met Museum","Museum","CSV",,"¯\_(ツ)_/¯
5 | ","3"
6 | "DigitalNZ Te Puna Web Directory","Library","XML","https://natlib.govt.nz/files/data/tepunawebdirectory.xml","MARC XML","3"
7 | "Canadian Subject Headings","Library","RDF/XML","http://www.collectionscanada.gc.ca/obj/900/f11/040004/csh.rdf","Ugh, rdf","4"
8 | "DPLA","Aggregator","CSV,JSON,XML","dp.la","","100"
9 |
--------------------------------------------------------------------------------
/slides/SparkInTheDark101.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/slides/SparkInTheDark101.pdf
--------------------------------------------------------------------------------
/survey-responses.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spark4lib/code4lib2018/790537ca06807071cab0f51196a70edf5917b059/survey-responses.csv
--------------------------------------------------------------------------------
/worksheets/functions-list.md:
--------------------------------------------------------------------------------
1 | # Spark in the Dark Workshop: PySpark Functions Cheatsheet
2 |
3 | ## Dataframe Functions:
4 |
5 | ### Shown in Workshop
6 |
7 | * **agg:** Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()). Runs the given operation on the aggregate.
8 | * **collect():** Returns all the records as a list of Row. *Note: we only see this in the context of a SQL call, but it can be used outside of that.*
9 | * **columns:** Returns all column names as a list.
10 | * **count():** Returns the number of rows in this DataFrame.
11 | * **distinct():** Returns a new DataFrame containing the distinct rows in this DataFrame.
12 | * **drop(colName):** Returns a new DataFrame that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).
13 | * **dropna():** Returns a new DataFrame omitting rows with null values.
14 | * **dropDuplicates():** Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
15 | * **fillna(value, subset=None):** Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
16 | * **filter(condition) / where(condition):** Filters rows using the given condition. `where()` is an alias for filter().
17 | * **groupBy(cols):** Groups the DataFrame using the specified columns, so we can run aggregation on them. `groupby()` is an alias for `groupBy()`.
18 | * **head(n):** Returns the first `n` rows.
19 | * **orderBy(cols):** Return a new DataFrame sorted by the specified column(s).
20 | * **printSchema():** Prints out the schema in the tree format.
21 | * **rdd:** Returns the content as an `pyspark.RDD` of `Row`.
22 | * **show(n, truncate=True):** Prints the first n rows to the console.
23 | * **sort(cols):** Returns a new DataFrame sorted by the specified column(s).
24 | * **withColumn(colName, column):** Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
25 |
26 | ### Not Shown, but Possibly Helpful
27 |
28 | * **alias:** Returns a new DataFrame with an alias set.
29 | * **corr(col1, col2, method=None):** Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. DataFrame.corr() and DataFrameStatFunctions.corr() are aliases of each other.
30 | * **dtypes:** Returns all column names and their data types as a list.
31 | * **explain:** Prints the (logical and physical) plans to the console for debugging purpose.
32 | * **first():** Returns the first row as a Row.
33 | * **foreach(f):** Applies the f function to all Row of this DataFrame.
34 | * **hint():** Specifies some hint on the current DataFrame.
35 | * **limit():** Limits the result count to the number specified.
36 | * **replace(to_replace, value=None, subset=None):** Returns a new DataFrame replacing a value with another value.
37 | * **toDF(cols):** Returns a new class:DataFrame that with new specified column names
38 | * **toJSON():** Converts a DataFrame into a RDD of string. Each row is turned into a JSON document as one element in the returned RDD.
39 | * **withColumnRenamed(existing, new):** Returns a new DataFrame by renaming an existing column. This is a no-op if schema doesn’t contain the given column name.
40 | * **write:** Interface for saving the content of the non-streaming DataFrame out into external storage.
41 |
42 | ## Column Functions:
43 |
44 | * **asc():** Returns a sort expression based on the ascending order of the given column name.
45 | * **between(lowerBound, upperBound):** A boolean expression that is evaluated to true if the value of this expression is between the given columns.
46 | * **cast(dataType):** Convert the column into type dataType.
47 | * **contains(val):** Return True/False if rows in Column contains this value.
48 | * **endswith(val) / startswith(val):** Return a Boolean Column based on matching end / start of string.
49 | * **getField():** An expression that gets a field by name in a StructField.
50 | * **getItem():** An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.
51 | * **isNotNull() / isNull():** True if the current expression is not null / null.
52 | * **otherwise() / when():** Evaluates a list of conditions and returns one of multiple possible result expressions. If Column.otherwise() is not invoked, None is returned for unmatched conditions.
53 | * **substr():** Return a Column which is a substring of the column.
54 |
55 | ## Row functions:
56 |
57 | * **asDict():** Return as an dict
58 |
--------------------------------------------------------------------------------
/worksheets/working-with-cho-data.md:
--------------------------------------------------------------------------------
1 | # Spark Practice: Working with CHO Data in Zeppelin
2 |
3 | In this session, we want you to apply what we learned in Spark Shell Introduction (Spark SQL, PySpark, & Zeppelin) for analyzing some provided Cultural Heritage Organization (CHO) data.
4 |
5 | You will be working in small groups on this.
6 |
7 | Table of Contents:
8 |
9 | 1. [Getting Started](#getting-started)
10 | 1. [Part 1: Creating a Dataframe Instance from a CHO CSV](#part-1-creating-a-dataframe-instance-from-a-cho-csv)
11 | 1. [Part 2: Simple Analysis of your CHO Dataset](#part-2-simple-analysis-of-your-cho-dataset)
12 | 1. [Part 3: Creating Derivative DataFrames for your CHO ](#part-3-creating-derivative-dataframes-for-your-cho)
13 | 1. [Part 4: Using SQL to Analyze Your Data](#part-4-using-sql-to-analyze-your-data)
14 | 1. [Part 5: Simple Data Visualization](#part-5-simple-data-visualization)
15 |
16 | ## Getting Started
17 |
18 | ### Check Your Environment (Again)
19 |
20 | This should all be the same. But just to make sure, we want to make sure you have:
21 |
22 | 1. PySpark / Spark Working;
23 | 2. You can load the sample data provided on the docker container;
24 | 3. You know the Spark Version you're working with.
25 |
26 | The following paragraph (running bash shell commands) should produce something like the following:
27 |
28 | ```sh
29 | > println("Spark Version: " + sc.version)
30 | > val penndata = sc.textFile("penn.csv")
31 | > val cmoadata = sc.textFile("cmoa.csv")
32 | > println("penn count: " + penndata.count)
33 | > println("cmoadata count: " + cmoadata.count)
34 | Spark Version: 2.1.0
35 | penndata: org.apache.spark.rdd.RDD[String] = penn.csv MapPartitionsRDD[2538] at textFile at :28
36 | cmoadata: org.apache.spark.rdd.RDD[String] = cmoa.csv MapPartitionsRDD[2540] at textFile at :27
37 | penn count: 379317
38 | cmoadata count: 34596
39 | ```
40 |
41 | ### CHO Datasets to Choose From
42 |
43 | See [the CHO dataset files here](../sample-data/). FYI, you'll need to unzip the Penn dataset using the `%sh` interpreter in Zeppelin (so that it unzips in your Docker container instance).
44 |
45 | | Name | Institution Type | Format | URL |
46 | | ---- | ---------------- | ------ | --- |
47 | | CMOA | Museum | CSV | https://github.com/cmoa/collection |
48 | | Penn Museum | Museum | CSV | https://www.penn.museum/collections/data.php |
49 |
50 | ## Part 1: Creating a Dataframe Instance from a CHO CSV
51 |
52 | Using what we learned from the previous session, load one of the CHO datasets into a DataFrame. Start by having the schema inferred.
53 |
54 | Do some basic exploration of the Dataframe to figure out if it is adequately loaded:
55 | 1. `.printSchema()`
56 | 2. `.show(10)` (you'll want to use truncate this time)
57 | 3. `.columns` to view the loaded columns
58 | 4. Check out some values for specific columns using `.select(colName)` & `.distinct()`
59 | 5. Consider making your own Schema & loading the CSV into a DataFrame with that Schema
60 | 6. Perform a `count()` to know how many rows are loaded
61 | 7. Get a sampling of the DataFrame using `.sample(withReplacement, fraction, seed=None)`
62 |
63 | Make sure you feel confident in your DataFrame and the data loaded.
64 |
65 | ## Part 2: Simple Analysis of your CHO Dataset
66 |
67 | Now use the [functions](functions-list.md) we've learned to try to start answering some questions about your CHO data. For example:
68 |
69 | 1. Select (`.select(colName, colName, ...)`) & Filter (`.filter(colName conditional)`) a couple subsets of columns to get a sense of the data.
70 | 2. Try ordering by a couple of fields (`.orderBy(colName)`) and see if that works out.
71 | 3. Apply a filter for finding rows where a column is or is not null (`.filter(pennDf.date_made.isNotNull())` or `.filter(pennDf.date_made.isNull())`). Sometimes it is just as important to get identifiers of records where a field is Null as it is to ignore empty fields in computations.
72 | 4. Use `groupBy(colName)` and Aggregate functions (`count()` or `agg({"colName": "mean"})`) to get numbers related to selected fields (i.e. what is the average height of resources by period or the number of resource by curatorial section?)
73 | 5. Use `distinct()`, `count()`, AND `countDistinct` along with ordering (`orderBy()`) to get rankings of values.
74 |
75 | ## Part 3: Creating Derivative DataFrames for your CHO
76 |
77 | Think of some dataframes that could be good to feed into SQL queries or visualizations. Then consider how you need to derive them using what we learned before. For example:
78 |
79 | 1. Get DataFrame with Curatorial, ID, Culture, URL, & Period, Dropping Nulls (using `dropna()` & `select()`)
80 | 2. Get DataFrame with little description (no Creator, Title, or Description) using `filter()` & `isNull()`
81 | 3. Use `explode()` to break out new rows for each `object_name` breaking on the `|` delimiter.
82 | 4. Make new columns (with `withcolumns()` and `col.like(regex)`) to capture if AD or BC date (or other criteria).
83 |
84 | ## Part 4: Using SQL to Analyze Your Data
85 |
86 | [Here](http://cse.unl.edu/~sscott/ShowFiles/SQL/CheatSheet/SQLCheatSheet.html) is a SQL Cheatsheet in case you need it.
87 |
88 | First, use `df.createOrReplaceTempView("ViewName")` to create the temporary views.
89 |
90 | Then Use SQL (via `spark.sql("""SQL query""")`) to analyze and query your data.
91 |
92 | ## Part 5: Simple Data Visualization
93 |
94 | Use the `sql` interpreter in Zeppelin to run SQL queries against our previously defined Temporary SQL View (just enter it where the table name normally goes in a SQL Query).
95 |
96 | This gives us immediate access to the SQL-driven visualizations available in Zeppelin.
97 |
--------------------------------------------------------------------------------
/worksheets/working-with-spark-shell.md:
--------------------------------------------------------------------------------
1 | # Spark Practice: Working with `spark-shell` Steps
2 |
3 | The presenter is running this with a local, standalone installation of Spark 2.2.1. If you want to try this but don't have Spark installed, see the docker options for getting a container with Spark shell via PySpark available at [the bottom of this workshop](#run-the-spark-shell-walk-through-yourself). Slides (captured mainly as a back-up in case of computer issues: https://docs.google.com/presentation/d/1W3xGitwB_nM36kTvVRsAQjRtw763HI7S00-an0nhhSI/edit#slide=id.g2e2dc1ae6b_0_9)
4 |
5 | ## Run a PySpark Interpreter Shell
6 |
7 | ```shell
8 | $ pyspark
9 | ```
10 |
11 | You should see something like the following:
12 |
13 | ```
14 | $ pyspark
15 | Python 3.5.1 (default, May 24 2017, 21:07:43)
16 | [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
17 | Type "help", "copyright", "credits" or "license" for more information.
18 | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19 | Setting default log level to "WARN".
20 | To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21 | 18/02/11 10:50:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22 | 18/02/11 10:50:07 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
23 | Welcome to
24 | ____ __
25 | / __/__ ___ _____/ /__
26 | _\ \/ _ \/ _ `/ __/ '_/
27 | /__ / .__/\_,_/_/ /_/\_\ version 2.2.1
28 | /_/
29 |
30 | Using Python version 3.5.1 (default, May 24 2017 21:07:43)
31 | SparkSession available as 'spark'.
32 | >>>
33 | ```
34 |
35 | Notes:
36 | - We are running in the PySpark interpreter to see spark-shell. One could also run a Scala interpreter.
37 | - This automatically makes a SparkSession for us; when coding, you'll need to start this yourself.
38 |
39 | ## Check out the Spark Standalone Dashboard
40 |
41 | If you're running on the setup instructions below (i.e. you're running your docker container on port 3030), at https://localhost:3030 should be your Spark Dashboard.
42 |
43 | .
44 |
45 | ## Our Dataset for this Simple Walk through:
46 |
47 | See [the CSV file here](../sample-data/small-sample.csv).
48 |
49 | | Name | Institution Type | Format | URL | Description |
50 | | ---- | ---------------- | ------ | --- | ----------- |
51 | | CMOA | Museum | JSON, CSV | https://github.com/cmoa/collection |
52 | | Penn Museum | Museum | JSON, CSV, XML | https://www.penn.museum/collections/data.php | JSON is poorly structured |
53 | | Met Museum | Museum | CSV | | ¯\_(ツ)_/¯\n |
54 | | DigitalNZ Te Puna Web Directory | Library | XML | https://natlib.govt.nz/files/data/tepunawebdirectory.xml,MARC | XML |
55 | | Canadian Subject Headings | Library | RDF/XML | http://www.collectionscanada.gc.ca/obj/900/f11/040004/csh.rdf | "Ugh, rdf" |
56 | | DPLA | Aggregator | CSV,JSON,XML | dp.la |
57 |
58 | ## Read a CSV File
59 |
60 | Read the CSV using the PySpark read & CSV libraries in our interpreter shell:
61 |
62 | ```
63 | > spark.read.csv("sample-data/small-sample.csv")
64 | DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]
65 | ```
66 |
67 | We see it makes a DataFrame. But what if we want to see the data? Run `.show()` on that data to see what was read in.
68 |
69 | ```
70 | > spark.read.csv("sample-data/small-sample.csv").show()
71 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
72 | | _c0| _c1| _c2| _c3| _c4| _c5|
73 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
74 | | Name|Institution Type| Format| URL| Description|Informal Score|
75 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
76 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
77 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| null|
78 | | ,3| null| null| null| null| null|
79 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
80 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
81 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
82 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
83 | ```
84 |
85 | Great! But... not entirely. What about headers?
86 |
87 |
88 | ```
89 | > spark.read.csv("sample-data/small-sample.csv", header=True).show()
90 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
91 | | Name|Institution Type| Format| URL| Description|Informal Score|
92 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
93 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
94 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
95 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| null|
96 | | ,3| null| null| null| null| null|
97 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
98 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
99 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
100 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
101 | ```
102 |
103 | That's better.
104 |
105 | But hold up, we seem to have trouble with a pesky newline in one of the cells. Let's try `DROPMALFORMED` mode and see if that helps.
106 |
107 | ```
108 | > spark.read.csv("sample-data/small-sample.csv", header=True, mode="DROPMALFORMED").show()
109 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
110 | | Name|Institution Type| Format| URL| Description|Informal Score|
111 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
112 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
113 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
114 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
115 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
116 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
117 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
118 | ```
119 |
120 | Well, that's better, but we don't want to lose the Met's row. Let's try the `multiLine` option instead (this option is available in Spark 2.2.x ONLY).
121 |
122 | ```
123 | > spark.read.csv("sample-data/small-sample.csv", header=True, multiLine=True).show()
124 | +--------------------+----------------+--------------+--------------------+--------------------+---------------+
125 | | Name|Institution Type| Format| URL| Description|Informal Score
126 | +--------------------+----------------+--------------+--------------------+--------------------+---------------+
127 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10
128 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7
129 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
130 | | 3
131 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3
132 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4
133 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100
134 | +--------------------+----------------+--------------+--------------------+--------------------+---------------+
135 | ```
136 |
137 | That's... better? That column still looks wonky, but that may just be `show()` awkward presentation. Let's inspect the inferred schema of this DataFrame:
138 |
139 | ```
140 | > spark.read.csv("sample-data/small-sample.csv", header=True, multiLine=True).printSchema()
141 | root
142 | |-- Name: string (nullable = true)
143 | |-- Institution Type: string (nullable = true)
144 | |-- Format: string (nullable = true)
145 | |-- URL: string (nullable = true)
146 | |-- Description: string (nullable = true)
147 | |-- Informal Score: string (nullable = true)
148 | ```
149 |
150 | Looks good. But let's go ahead and pass in a specified schema so `Informal Score` is not a string. First we need to create one with the specific datatypes declared:
151 |
152 | ```
153 | > from pyspark.sql.types import *
154 | > customSchema = StructType([
155 | StructField("Name", StringType(), True),
156 | StructField("Institution Type", StringType(), True),
157 | StructField("Format", StringType(), True),
158 | StructField("URL", StringType(), True),
159 | StructField("Description", StringType(), True),
160 | StructField("Informal Score", DecimalType(), True)])
161 | > sampleDf = spark.read.csv("sample-data/small-sample.csv", header=True, multiLine=True, schema=customSchema)
162 | > sampleDf.show()
163 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
164 | | Name|Institution Type| Format| URL| Description|Informal Score|
165 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
166 | | CMOA| Museum| JSON, CSV|https://github.co...| 10| null|
167 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
168 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
169 | | 3|
170 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
171 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
172 | | DPLA| Aggregator| CSV,JSON,XML| dp.la| 100| null|
173 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
174 | ```
175 |
176 | # Simple Data Analysis
177 |
178 | Our CSV has been read into a DataFrame pretty well, and is now stored at `sampleDf`. We're still a bit worried about that line, let's check the number of rows:
179 |
180 | ```sh
181 | > sampleDf.count()
182 | 6
183 | ```
184 | Now, do we have the correct number of non-header rows? 6? Yep. Now let's inspect some of the rows using `head`, whitch takes a number of rows we want to see as a parameter:
185 |
186 | ```sh
187 | > sampleDf.head(4)
188 | [Row(Name='CMOA', Institution Type='Museum', Format='JSON, CSV', URL='https://github.com/cmoa/collection', Description='10', Informal Score=None), Row(Name='Penn Museum', Institution Type='Museum', Format='JSON, CSV, XML', URL='https://www.penn.museum/collections/data.php', Description='JSON is poorly structured', Informal Score=Decimal('7')), Row(Name='Met Museum', Institution Type='Museum', Format='CSV', URL=None, Description='¯_(ツ)_/¯\n', Informal Score=Decimal('3')), Row(Name='DigitalNZ Te Puna Web Directory', Institution Type='Library', Format='XML', URL='https://natlib.govt.nz/files/data/tepunawebdirectory.xml', Description='MARC XML', Informal Score=Decimal('3'))]
189 | ```
190 |
191 | This returns an Array of Spark Rows, so that looks good. Let's check the number of columns:
192 |
193 | ```sh
194 | > sampleDf.columns
195 | ['Name', 'Institution Type', 'Format', 'URL', 'Description', 'Informal Score', 'CSV', 'XML', 'RDF', 'JSON']
196 | > len(sampleDf.columns)
197 | 10
198 | ```
199 |
200 | Then inspect specific columns now using `select(colName)`:
201 |
202 | ```sh
203 | > sampleDf.select("Name").show()
204 | +--------------------+
205 | | Name|
206 | +--------------------+
207 | | CMOA|
208 | | Penn Museum|
209 | | Met Museum|
210 | |DigitalNZ Te Puna...|
211 | |Canadian Subject ...|
212 | | DPLA|
213 | +--------------------+
214 | ```
215 |
216 | Looks good! And the `Informal Score` column?
217 |
218 | ```sh
219 | > sampleDf.select("Name", "Informal Score").show()
220 | +--------------------+--------------+
221 | | Name|Informal Score|
222 | +--------------------+--------------+
223 | | CMOA| null|
224 | | Penn Museum| 7|
225 | | Met Museum| 3|
226 | |DigitalNZ Te Puna...| 3|
227 | |Canadian Subject ...| 4|
228 | | DPLA| null|
229 | +--------------------+--------------+
230 | ```
231 |
232 | Now we feel comfortable with the data loaded, we can do some more analysis. Let's apply a `.filter()` based on the `Informal Score` and compare it with `select()`:
233 |
234 | ```sh
235 | > scoresBooleanDf = sampleDf.select('Name', 'URL', sampleDf['Informal Score'] > 5)
236 | > scoresBooleanDf.show()
237 | +--------------------+--------------------+--------------------+
238 | | Name| URL|(Informal Score > 5)|
239 | +--------------------+--------------------+--------------------+
240 | | CMOA|https://github.co...| null|
241 | | Penn Museum|https://www.penn....| true|
242 | | Met Museum| null| false|
243 | |DigitalNZ Te Puna...|https://natlib.go...| false|
244 | |Canadian Subject ...|http://www.collec...| false|
245 | | DPLA| dp.la| null|
246 | +--------------------+--------------------+--------------------+
247 | > highScoresDf = sampleDf.filter(sampleDf['Informal Score'] > 5)
248 | > highScoresDf.select('Name', 'URL', 'Informal Score').show()
249 | +-----------+--------------------+--------------+
250 | | Name| URL|Informal Score|
251 | +-----------+--------------------+--------------+
252 | |Penn Museum|https://www.penn....| 7|
253 | +-----------+--------------------+--------------+
254 | ```
255 |
256 | One gives us a True / False if the provided condition is met, and the other only returns rows when the condition is met (i.e. is `True`).
257 |
258 | And then to order results from these methods:
259 |
260 | ```sh
261 | > sampleDf.orderBy("Informal Score").show()
262 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
263 | | Name|Institution Type| Format| URL| Description|Informal Score|
264 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
265 | | DPLA| Aggregator| CSV,JSON,XML| dp.la| 100| null|
266 | | CMOA| Museum| JSON, CSV|https://github.co...| 10| null|
267 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
268 | | 3|
269 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
270 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
271 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
272 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
273 |
274 | > sampleDf.orderBy(sampleDf["Informal Score"].desc()).show()
275 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
276 | | Name|Institution Type| Format| URL| Description|Informal Score|
277 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
278 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
279 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
280 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
281 | | 3|
282 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
283 | | CMOA| Museum| JSON, CSV|https://github.co...| 10| null|
284 | | DPLA| Aggregator| CSV,JSON,XML| dp.la| 100| null|
285 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
286 | ```
287 |
288 | Note that ordering defaults to Ascending, and you need to add `.desc()` for descending order.
289 |
290 | Another helpful, quick analysis function is Grouping.
291 |
292 | ```sh
293 | > URLGroupDf = sampleDf.groupBy("Format").count()
294 | > URLGroupDf.show()
295 | +--------------+-----+
296 | | Format|count|
297 | +--------------+-----+
298 | | CSV| 1|
299 | | XML| 1|
300 | | JSON, CSV| 1|
301 | |JSON, CSV, XML| 1|
302 | | RDF/XML| 1|
303 | | CSV,JSON,XML| 1|
304 | +--------------+-----+
305 |
306 | > URLGroupDf.printSchema()
307 | root
308 | |-- Format: string (nullable = true)
309 | |-- count: long (nullable = false)
310 | ```
311 |
312 | And there is the ability to ask for `distinct()` values only:
313 |
314 | ```sh
315 | > sampleDf.select("Format").distinct().show()
316 | +--------------+
317 | | Format|
318 | +--------------+
319 | | CSV|
320 | | XML|
321 | | JSON, CSV|
322 | |JSON, CSV, XML|
323 | | RDF/XML|
324 | | CSV,JSON,XML|
325 | +--------------+
326 | > sampleDf.select("Format").distinct().count()
327 | 6
328 | ```
329 |
330 | This leads us to another problem we need to address here: multi-value cells.
331 |
332 | ## Simple Changes to our DataFrame
333 |
334 | The `explode` function allows us to split multi-value cells across rows, with the rest of the row's data staying the same:
335 |
336 | ```sh
337 | > from pyspark.sql.functions import *
338 | > sampleDf = sampleDf.withColumn("Format", explode(split("Format", ",")))
339 | > sampleDf.show()
340 | +--------------------+----------------+-------+--------------------+--------------------+--------------+
341 | | Name|Institution Type| Format| URL| Description|Informal Score|
342 | +--------------------+----------------+-------+--------------------+--------------------+--------------+
343 | | CMOA| Museum| JSON|https://github.co...| 10| null|
344 | | CMOA| Museum| CSV|https://github.co...| 10| null|
345 | | Penn Museum| Museum| JSON|https://www.penn....|JSON is poorly st...| 7|
346 | | Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7|
347 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|
348 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
349 | | 3|
350 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
351 | |Canadian Subject ...| Library|RDF/XML|http://www.collec...| Ugh, rdf| 4|
352 | | DPLA| Aggregator| CSV| dp.la| 100| null|
353 | | DPLA| Aggregator| JSON| dp.la| 100| null|
354 | | DPLA| Aggregator| XML| dp.la| 100| null|
355 | +--------------------+----------------+-------+--------------------+--------------------+--------------+
356 | ```
357 |
358 | Alternatively, depending on what you want to do, you can create new columns based on analysis of that multivalue field:
359 |
360 | ```sh
361 | > sampleDf = sampleDf.withColumn('CSV', sampleDf.Format.like("%CSV%"))
362 | > sampleDf = sampleDf.withColumn('XML', sampleDf.Format.like("%XML%"))
363 | > sampleDf = sampleDf.withColumn('RDF', sampleDf.Format.like("%RDF%"))
364 | > sampleDf = sampleDf.withColumn('JSON', sampleDf.Format.like("%JSON%"))
365 | > sampleDf.show()
366 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
367 | | Name|Institution Type| Format| URL| Description|Informal Score| CSV| XML| RDF| JSON|
368 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
369 | | CMOA| Museum| JSON|https://github.co...| 10| null|false|false|false| true|
370 | | CMOA| Museum| CSV|https://github.co...| 10| null| true|false|false|false|
371 | | Penn Museum| Museum| JSON|https://www.penn....|JSON is poorly st...| 7|false|false|false| true|
372 | | Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7| true|false|false|false|
373 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|false| true|false|false|
374 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯
375 | | 3| true|false|false|false|
376 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|false| true|false|false|
377 | |Canadian Subject ...| Library|RDF/XML|http://www.collec...| Ugh, rdf| 4|false| true| true|false|
378 | | DPLA| Aggregator| CSV| dp.la| 100| null| true|false|false|false|
379 | | DPLA| Aggregator| JSON| dp.la| 100| null|false|false|false| true|
380 | | DPLA| Aggregator| XML| dp.la| 100| null|false| true|false|false|
381 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
382 | ```
383 |
384 | With these simple, connected functions, we should have what we need to handle checking unique `Format` values (note: there are a million other ways to handle this as well!):
385 |
386 | ```sh
387 | > sampleDf.select("Format").distinct().show()
388 | +-------+
389 | | Format|
390 | +-------+
391 | | CSV|
392 | | XML|
393 | | CSV|
394 | | XML|
395 | |RDF/XML|
396 | | JSON|
397 | +-------+
398 | ```
399 |
400 | Hold on- why is this not fixed yet? Because of something well-known to data mungers: whitespace. Let's remove that white space using regex replacement:
401 |
402 | ```sh
403 | > sampleDf = sampleDf.withColumn("Format", regexp_replace(sampleDf.Format, "\s+", ""))
404 | > sampleDf.select("Format").distinct().show()
405 | +-------+
406 | | Format|
407 | +-------+
408 | | CSV|
409 | | XML|
410 | |RDF/XML|
411 | | JSON|
412 | +-------+
413 | ```
414 |
415 | This should repair our `Format` column. Let's now run a function to get a subset of rows based on a `where` conditional:
416 |
417 | ```sh
418 | > sampleDf.where(col('Format') == 'XML').show()
419 | +--------------------+----------------+------+--------------------+--------------------+--------------+-----+----+-----+-----+
420 | | Name|Institution Type|Format| URL| Description|Informal Score| CSV| XML| RDF| JSON|
421 | +--------------------+----------------+------+--------------------+--------------------+--------------+-----+----+-----+-----+
422 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|false|true|false|false|
423 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|false|true|false|false|
424 | | DPLA| Aggregator| XML| dp.la| 100| null|false|true|false|false|
425 | +--------------------+----------------+------+--------------------+--------------------+--------------+-----+----+-----+-----+
426 |
427 | > sampleDf.where(col('Format') == 'XML').count()
428 | 3
429 | ```
430 |
431 | ## Running SQL Queries
432 |
433 | We can also use Spark SQL to run SQL queries on our Dataframe. We need to first create a view (ours will be temporary; consider this an in-memory table); then write our SQL query and `collect()` the results:
434 |
435 | ```py
436 | > sampleDf.createOrReplaceTempView("sampleDataView")
437 | >>> query = spark.sql("""SELECT `Institution Type`, COUNT(DISTINCT(Format)) as NumFormats
438 | ... FROM sampleDataView
439 | ... GROUP BY `Institution Type`""")
440 | >>> query.collect()
441 | [Row(Institution Type='Library', NumFormats=2), Row(Institution Type='Aggregator', NumFormats=3), Row(Institution Type='Museum', NumFormats=3)]
442 | >>> for n in query.collect():
443 | ... n
444 | ...
445 | Row(Institution Type='Library', NumFormats=2)
446 | Row(Institution Type='Aggregator', NumFormats=3)
447 | Row(Institution Type='Museum', NumFormats=3)
448 | ```
449 |
450 | Note this returns an array of PySpark Rows.
451 |
452 | At this point, we're going to move to the [Zeppelin Notebook portion of this workshop](working-with-zeppelin.md), where you'll be running these commands (and more) yourself.
453 |
454 | # Run the Spark-Shell Walk-through Yourself
455 |
456 | 1. I am using the Getty Images docker container for this section (Thanks, Getty!) for sake of consistency:
457 | ```bash
458 | $ docker pull gettyimages/spark
459 | ```
460 | 2. Start docker:
461 | ```bash
462 | $ docker run --name start-shell -p 3030:8080 gettyimages/spark
463 | ```
464 | 3. SSH into the Docker bash shell: :
465 | ```bash
466 | $ docker exec -it start-shell /bin/bash
467 | ```
468 | 4. Grab our small sample data and check out that it transferred okay
469 | ```bash
470 | $ mkdir sample-data
471 | $ curl https://raw.githubusercontent.com/spark4lib/code4lib2018/master/sample-data/small-sample.csv > sample-data/small-sample.csv
472 | $ head sample-data/small-sample.csv
473 | ```
474 | 5. Start pyspark
475 | ```bash
476 | $ pyspark
477 | ```
478 | 6. Run the steps above yourself!
479 |
--------------------------------------------------------------------------------
/worksheets/working-with-zeppelin.md:
--------------------------------------------------------------------------------
1 | # Spark Practice: Working with `spark-shell` in Zeppelin Info
2 |
3 | In this session, you will learn Zeppelin, Spark, & Spark SQL basics via primarily the DataFrames API. We are starting with a really simple dataset to focus on the tools. In the next session, we will use an expanded cultural heritage metadata set.
4 |
5 | Table of Contents:
6 |
7 | 1. [Getting Started](#getting-started)
8 | 1. [Part 1: Creating a Dataframe Instance from a CSV](#part-1-creating-a-dataframe-instance-from-a-csv)
9 | 1. [Part 2: Simple Analysis of our Dataset via Dataframe API](#part-2-simple-analysis-of-our-dataset-via-dataframe-api)
10 | 1. [Part 3: Creating New Dataframes to Expand or Rework our Original Dataset](#part-3-creating-new-dataframes-to-expand-or-rework-our-original-dataset)
11 | 1. [Part 4: Using SQL to Analyze our Dataset](#part-4-using-sql-to-analyze-our-dataset)
12 | 1. [Part 5: Simple Data Visualization](#part-5-simple-data-visualization)
13 |
14 | ## Getting Started
15 |
16 | ### Datasets? DataFrames?
17 |
18 | A **Dataset** is a distributed collection of data. Dataset provides the benefits of strong typing, ability to use powerful lambda functions with the benefits of (Spark SQL’s) optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.
19 |
20 | A **DataFrame** is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. (Note that in Scala type parameters (generics) are enclosed in square brackets.)
21 |
22 | [[source](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html#datasets-and-dataframes)]
23 |
24 | ### How to Run a Paragraph in a Zeppelin Note(book)
25 |
26 | To run a paragraph in a Zeppelin notebook, you can either click the `play` button (blue triangle) on the right-hand side or click on the paragraph simply press `Shift + Enter`.
27 |
28 | ### What are Zeppelin Interpreters?
29 |
30 | In the following paragraphs we are going to execute Spark code, run shell commands to download and move files, run sql queries etc. Each paragraph will start with `%` followed by an interpreter name, e.g. `%spark2` for a Spark 2.x interpreter. Different interpreter names indicate what will be executed: code, markdown, html etc. This allows you to perform data ingestion, munging, wrangling, visualization, analysis, processing and more, all in one place!
31 |
32 | Throughout this notebook we will use the following interpreters:
33 |
34 | - The default interpreter is set to be PySpark for our Workshop Docker Container.
35 | - `%spark` - Spark interpreter to run Spark code written in Scala
36 | - `%spark.sql` - Spark SQL interprter (to execute SQL queries against temporary tables in Spark)
37 | - `%sh` - Shell interpreter to run shell commands
38 | - `%angular` - Angular interpreter to run Angular and HTML code
39 | - `%md` - Markdown for displaying formatted text, links, and images
40 |
41 | To learn more about Zeppelin interpreters check out this [link](https://zeppelin.apache.org/docs/0.5.6-incubating/manual/interpreters.html).
42 |
43 | ### Check Your Environment
44 |
45 | We want to make sure you have:
46 |
47 | 1. PySpark / Spark Working;
48 | 2. You can load the sample data provided on the docker container;
49 | 3. You know the Spark Version you're working with.
50 |
51 | The following paragraph (running bash shell commands) should produce something like the following:
52 |
53 | ```sh
54 | > println("Spark Version: " + sc.version)
55 | > val penndata = sc.textFile("penn.csv")
56 | > val cmoadata = sc.textFile("cmoa.csv")
57 | > println("penn count: " + penndata.count)
58 | > println("cmoadata count: " + cmoadata.count)
59 | Spark Version: 2.1.0
60 | penndata: org.apache.spark.rdd.RDD[String] = penn.csv MapPartitionsRDD[2538] at textFile at :28
61 | cmoadata: org.apache.spark.rdd.RDD[String] = cmoa.csv MapPartitionsRDD[2540] at textFile at :27
62 | penn count: 379317
63 | cmoadata count: 34596
64 | ```
65 |
66 | ### Our Dataset for this Simple Walk through:
67 |
68 | See [the CSV file here](../sample-data/small-sample.csv).
69 |
70 | | Name | Institution Type | Format | URL | Description |
71 | | ---- | ---------------- | ------ | --- | ----------- |
72 | | CMOA | Museum | JSON, CSV | https://github.com/cmoa/collection |
73 | | Penn Museum | Museum | JSON, CSV, XML | https://www.penn.museum/collections/data.php | JSON is poorly structured |
74 | | Met Museum | Museum | CSV | | ¯\_(ツ)_/¯\n |
75 | | DigitalNZ Te Puna Web Directory | Library | XML | https://natlib.govt.nz/files/data/tepunawebdirectory.xml,MARC | XML |
76 | | Canadian Subject Headings | Library | RDF/XML | http://www.collectionscanada.gc.ca/obj/900/f11/040004/csh.rdf | "Ugh, rdf" |
77 | | DPLA | Aggregator | CSV,JSON,XML | dp.la |
78 |
79 | ## Part 1: Creating a Dataframe Instance from a CSV
80 |
81 | Spark
82 |
83 | ```
84 | > spark.read.csv("small-sample.csv")
85 | DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string]
86 | ```
87 |
88 | This uses `pyspark.sql.DataFrameReader` to load from external storage systems into a DataFrame. For the `csv` module particularly, it loads a CSV file and returns the result as a DataFrame. The above command returns the DataFrame instance.
89 |
90 | Using `.show()` method on that DataFrame instance, we can see a pretty print version of the data:
91 |
92 | ```
93 | > spark.read.csv("small-sample.csv").show()
94 | +--------------------+----------------+--------------+--------------------+--------------------+
95 | | _c0| _c1| _c2| _c3| _c4|
96 | +--------------------+----------------+--------------+--------------------+--------------------+
97 | | Name|Institution Type| Format| URL| Description|
98 | | CMOA| Museum| JSON, CSV|https://github.co...| null|
99 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...|
100 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯|
101 | | "| null| null| null| null|
102 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML|
103 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf|
104 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null|
105 | +--------------------+----------------+--------------+--------------------+--------------------+
106 | ```
107 |
108 | We want to fix a few things about this data being read in, using the `pyspark.sql.csv` module's available options:
109 |
110 | ### Adding a Header
111 |
112 | We want to take the first row as a header.
113 |
114 | ```
115 | > spark.read.csv("code4lib2018/sample-data/small-sample.csv", header=True).show()
116 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
117 | | Name|Institution Type| Format| URL| Description|Informal Score|
118 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
119 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
120 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
121 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| null|
122 | | ,3| null| null| null| null| null|
123 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
124 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
125 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
126 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
127 | ```
128 |
129 | ### Adding (& removing) DROPMALFORMED mode
130 |
131 | We see there is a newline causing us some issues. We don't want to leave those out now, but it is an opportunity to learn how `DROPMALFORMED` mode can help us.
132 |
133 | ```
134 | > spark.read.csv("code4lib2018/sample-data/small-sample.csv", header=True, mode="DROPMALFORMED").show()
135 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
136 | | Name|Institution Type| Format| URL| Description|Informal Score|
137 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
138 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
139 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
140 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
141 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
142 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
143 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
144 | ```
145 |
146 | ### Working Around the New Lines / Multiline CSV
147 |
148 | With the Spark Shell (CLI) walk through, we could use the `multiLine` qualifier. However, that causes an error in our Zeppelin notebook, since we're running version 2.1.x and that qualifier was added in Spark 2.2.x.
149 |
150 | For the sake of the rest of this example, we run a simple Python script within our PySpark notebook context to remove the newlines:
151 |
152 | ```py
153 | import csv
154 |
155 | # we will see a cleaner way using pyspark to handle this issue later
156 | # though really, use proper quoting with Spark 2.2.x & multifile=True
157 |
158 | with open('code4lib2018/sample-data/small-sample.csv') as fh:
159 | test = csv.reader(fh)
160 | with open('code4lib2018/sample-data/small-sample-stripped.csv', 'w') as fout:
161 | test_write = csv.writer(fout, quoting=csv.QUOTE_ALL)
162 | for row in test:
163 | new_row = [val.replace("\r\n", "") for val in row]
164 | test_write.writerow(new_row)
165 | ```
166 |
167 | This saves a newlines-free CSV for the sake of this notebook in `code4lib2018/sample-data/small-sample-stripped.csv`.
168 |
169 | ### Handling Schemas
170 |
171 | Right now, we're inferring the schema via the `sparl.read.csv` module. `inferSchema` is set to `True` by default. We can see what it is guessing by using `printSchema()` on the DataFrame instance:
172 |
173 | ```
174 | > spark.read.csv("code4lib2018/sample-data/small-sample-stripped.csv", header=True, inferSchema=True).printSchema()
175 | root
176 | |-- Name: string (nullable = true)
177 | |-- Institution Type: string (nullable = true)
178 | |-- Format: string (nullable = true)
179 | |-- URL: string (nullable = true)
180 | |-- Description: string (nullable = true)
181 | |-- Informal Score: integer (nullable = true)
182 | ```
183 |
184 | We want to update a few things though - make the score a `DecimalType` instead of an `IntegerType`; and make the `Name` required (i.e. `nullable = false`). So we first define our own Schema, importing the dataTypes we need:
185 |
186 | ```py
187 | >from pyspark.sql.types import *
188 | > customSchema = StructType([
189 | StructField("Name", StringType(), True),
190 | StructField("Institution Type", StringType(), True),
191 | StructField("Format", StringType(), True),
192 | StructField("URL", StringType(), True),
193 | StructField("Description", StringType(), True),
194 | StructField("Informal Score", DecimalType(), True)])
195 | > sampleDf = spark.read.csv("code4lib2018/sample-data/small-sample-stripped.csv", header=True, schema=customSchema)
196 | > sampleDf.show()
197 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
198 | | Name|Institution Type| Format| URL| Description|Informal Score|
199 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
200 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
201 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
202 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3|
203 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
204 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
205 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
206 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
207 | ```
208 |
209 | At this point, we've saved our DataFrame instance of the CSV as `sampleDf`.
210 |
211 |
212 | ## Part 2: Simple Analysis of our Dataset via Dataframe API
213 |
214 | Here are a few ways to explore our DataFrame instance:
215 |
216 | ### View Subsets of DataFrame Records
217 |
218 | ```
219 | > sampleDf.head(2)
220 | [Row(Name=u'CMOA', Institution Type=u'Museum', Format=u'JSON, CSV', URL=u'https://github.com/cmoa/collection', Description=None, Informal Score=Decimal('10')), Row(Name=u'Penn Museum', Institution Type=u'Museum', Format=u'JSON, CSV, XML', URL=u'https://www.penn.museum/collections/data.php', Description=u'JSON is poorly structured', Informal Score=Decimal('7'))]
221 | ```
222 |
223 | ### View Number & List of Columns in a DataFrame
224 |
225 | ```
226 | > sampleDf.columns
227 | ['Name', 'Institution Type', 'Format', 'URL', 'Description', 'Informal Score']
228 | > len(sampleDf.columns)
229 | 6
230 | ```
231 |
232 | ### Show Specific Columns
233 |
234 | `.select(colName)` lets you select one or more columns in a new DataFrame, which you can then `show()` (or process otherwise):
235 |
236 | ```
237 | > sampleDf.select("Name").show()
238 | +--------------------+
239 | | Name|
240 | +--------------------+
241 | | CMOA|
242 | | Penn Museum|
243 | | Met Museum|
244 | |DigitalNZ Te Puna...|
245 | |Canadian Subject ...|
246 | | DPLA|
247 | +--------------------+
248 | > sampleDf.select("Name", "Informal Score").show()
249 | +--------------------+--------------+
250 | | Name|Informal Score|
251 | +--------------------+--------------+
252 | | CMOA| 10|
253 | | Penn Museum| 7|
254 | | Met Museum| 3|
255 | |DigitalNZ Te Puna...| 3|
256 | |Canadian Subject ...| 4|
257 | | DPLA| 100|
258 | +--------------------+--------------+
259 | ```
260 |
261 | ### Column Filter versus Select
262 |
263 | `select` lets you get a subset of columns in a new DataFrame. You can add a statement to a `select` function, and it will return the result of that.
264 |
265 | `filter`, however, lets you get a subset of columns in a new Dataframe, but takes a conditional statement and only returns rows from those columns that return `True`.
266 |
267 | ```
268 | > scoresBooleanDf = sampleDf.select('Name', 'URL', sampleDf['Informal Score'] > 5)
269 | > scoresBooleanDf.show()
270 | +--------------------+--------------------+--------------------+
271 | | Name| URL|(Informal Score > 5)|
272 | +--------------------+--------------------+--------------------+
273 | | CMOA|https://github.co...| true|
274 | | Penn Museum|https://www.penn....| true|
275 | | Met Museum| null| false|
276 | |DigitalNZ Te Puna...|https://natlib.go...| false|
277 | |Canadian Subject ...|http://www.collec...| false|
278 | | DPLA| dp.la| true|
279 | +--------------------+--------------------+--------------------+
280 | > highScoresDf = sampleDf.filter(sampleDf['Informal Score'] > 5)
281 | > highScoresDf.select('Name', 'URL', 'Informal Score').show()
282 | +-----------+--------------------+--------------+
283 | | Name| URL|Informal Score|
284 | +-----------+--------------------+--------------+
285 | | CMOA|https://github.co...| 10|
286 | |Penn Museum|https://www.penn....| 7|
287 | | DPLA| dp.la| 100|
288 | +-----------+--------------------+--------------+
289 | ```
290 |
291 | ### Number of Rows in a DataFrame
292 |
293 | `count()` returns the number of records in a DataFrame.
294 |
295 | ```
296 | > numSampleDf = sampleDf.count()
297 | > numHighScoresDf = highScoresDf.count()
298 | > print("Percentage of Datasets with a Score at or above 5: " + str(float(numHighScoresDf)/float(numSampleDf)*100) + "%")
299 | Percentage of Datasets with a Score at or above 5: 50.0%
300 | ```
301 |
302 | ### Out of the Box Stats on your Dataframe
303 |
304 | `.describe()` lets you get out of the box (so to speak) statistics on your DataFrame. You can run it on your entire DataFrame, or on selected subsets:
305 |
306 | ```
307 | > sampleDf.describe().show()
308 | +-------+-----------+----------------+------+--------------------+--------------------+-----------------+
309 | |summary| Name|Institution Type|Format| URL| Description| Informal Score|
310 | +-------+-----------+----------------+------+--------------------+--------------------+-----------------+
311 | | count| 6| 6| 6| 5| 4| 6|
312 | | mean| null| null| null| null| null| 21.1667|
313 | | stddev| null| null| null| null| null|38.71649088782022|
314 | | min| CMOA| Aggregator | CSV| dp.la|JSON is poorly st...| 3|
315 | | max|Penn Museum| Museum| XML|https://www.penn....| ¯_(ツ)_/¯| 100|
316 | +-------+-----------+----------------+------+--------------------+--------------------+-----------------+
317 |
318 | > sampleDf.describe('Informal Score').show()
319 | +-------+-----------------+
320 | |summary| Informal Score|
321 | +-------+-----------------+
322 | | count| 6|
323 | | mean| 21.1667|
324 | | stddev|38.71649088782022|
325 | | min| 3|
326 | | max| 100|
327 | +-------+-----------------+
328 | ```
329 |
330 | ### Order Responses
331 |
332 | `orderBy()` lets you order your DataFrame according to a given column. It defaults to ascending order, and `.desc()` applied to the column used for ordering can reverse that:
333 |
334 | ```
335 | > sampleDf.orderBy("Informal Score").show()
336 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
337 | | Name|Institution Type| Format| URL| Description|Informal Score|
338 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
339 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
340 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3|
341 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
342 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
343 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
344 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
345 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
346 | > sampleDf.orderBy(sampleDf["Informal Score"].desc()).show()
347 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
348 | | Name|Institution Type| Format| URL| Description|Informal Score|
349 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
350 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100|
351 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10|
352 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|
353 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4|
354 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3|
355 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|
356 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+
357 | ```
358 |
359 | ### Group Responses
360 |
361 | Like with SQL, you can Group your DataFrame as well. Here, we make a new DataFrame off of `sampleDf`. This new DataFrame only contains `Institution Type`, groups by it and shows the `count()` for each:
362 |
363 | ```
364 | > URLGroupDf = sampleDf.groupBy("Institution Type").count()
365 | > URLGroupDf.show()
366 | +----------------+-----+
367 | |Institution Type|count|
368 | +----------------+-----+
369 | | Library| 2|
370 | | Aggregator | 1|
371 | | Museum| 3|
372 | +----------------+-----+
373 | > URLGroupDf.printSchema()
374 | root
375 | |-- Institution Type: string (nullable = true)
376 | |-- count: long (nullable = false)
377 | ```
378 |
379 | ### Group Values then Compute an Aggregate
380 |
381 | `count` is already a type of Aggregate function. However, we can use `agg` to then pass in our own special function with set formulas.
382 |
383 | ```
384 | > formatGroupDf = sampleDf.groupBy("Institution Type").agg({"Informal Score": 'mean'})
385 | > formatGroupDf.show()
386 | +----------------+-------------------+
387 | |Institution Type|avg(Informal Score)|
388 | +----------------+-------------------+
389 | | Library| 3.5000|
390 | | Aggregator | 100.0000|
391 | | Museum| 6.6667|
392 | +----------------+-------------------+
393 | > formatGroupDf.printSchema()
394 | root
395 | |-- Institution Type: string (nullable = true)
396 | |-- avg(Informal Score): decimal(14,4) (nullable = true)
397 | ```
398 |
399 | ### Get Distinct Values in a Column of our DataFrame
400 |
401 | `distinct()` returns only unique values in a Column of a DataFrame. Note: this returns a notably bad response of distinct values that we use in the next section.
402 |
403 | ```
404 | > distinctFormatDf = sampleDf.select("Format").distinct()
405 | > distinctFormatDf.show()
406 | +--------------+
407 | | Format|
408 | +--------------+
409 | | CSV|
410 | | XML|
411 | | JSON, CSV|
412 | |JSON, CSV, XML|
413 | | RDF/XML|
414 | | CSV,JSON,XML|
415 | +--------------+
416 | > distinctFormatDf.count()
417 | 6
418 | ```
419 |
420 | ### Combining Aggregates, SQL Functions, & Distinct Values
421 |
422 | ```
423 | > countDistinctDF = sampleDf.select("Name", "Institution Type", "Format").groupBy("Institution Type").agg(countDistinct("Format"))
424 | > countDistinctDF.show()
425 | +----------------+----------------------+
426 | |Institution Type|count(DISTINCT Format)|
427 | +----------------+----------------------+
428 | | Library| 2|
429 | | Aggregator | 1|
430 | | Museum| 3|
431 | +----------------+----------------------+
432 | ```
433 |
434 | ## Part 3: Creating New Dataframes to Expand or Rework our Original Dataset
435 |
436 | Now we want to focus on deriving new DataFrames from our original `sampleDf`.
437 |
438 | ### Create a new DataFrame without Duplicate Values
439 |
440 | `dropDuplicates()` will remove Rows with duplicate values from the select Column(s).
441 |
442 | ```
443 | > scoresDf = sampleDf.select('Informal Score').dropDuplicates()
444 | > scoresDf.show()
445 | +--------------+
446 | |Informal Score|
447 | +--------------+
448 | | 7|
449 | | 10|
450 | | 100|
451 | | 3|
452 | | 4|
453 | +--------------+
454 | ```
455 |
456 | ### DataFrame with Dropped Null Values
457 |
458 | ```
459 | > noNullDf = sampleDf.select('Name', 'URL').dropna()
460 | > noNullDf.show()
461 | +--------------------+--------------------+
462 | | Name| URL|
463 | +--------------------+--------------------+
464 | | CMOA|https://github.co...|
465 | | Penn Museum|https://www.penn....|
466 | |DigitalNZ Te Puna...|https://natlib.go...|
467 | |Canadian Subject ...|http://www.collec...|
468 | | DPLA| dp.la|
469 | +--------------------+--------------------+
470 | ```
471 |
472 | ### DataFrame with Filled Null Values
473 |
474 | ```
475 | > nonNullDf = sampleDf.select('Name', 'URL').fillna('No URL')
476 | > nonNullDf.show()
477 | +--------------------+--------------------+
478 | | Name| URL|
479 | +--------------------+--------------------+
480 | | CMOA|https://github.co...|
481 | | Penn Museum|https://www.penn....|
482 | | Met Museum| No URL|
483 | |DigitalNZ Te Puna...|https://natlib.go...|
484 | |Canadian Subject ...|http://www.collec...|
485 | | DPLA| dp.la|
486 | +--------------------+--------------------+
487 | ```
488 |
489 | ### DataFrame with Rows where Given Column *is* Null
490 |
491 | ```
492 | > filterNonNullDF = sampleDf.filter(sampleDf.URL.isNull()).sort("Name")
493 | > filterNonNullDF.show()
494 | +----------+----------------+------+----+-----------+--------------+
495 | | Name|Institution Type|Format| URL|Description|Informal Score|
496 | +----------+----------------+------+----+-----------+--------------+
497 | |Met Museum| Museum| CSV|null| ¯_(ツ)_/¯| 3|
498 | +----------+----------------+------+----+-----------+--------------+
499 | ```
500 |
501 | ### Handling `Format` (Multivalue fields)
502 |
503 | #### Split `Format` into Arrays of Values
504 |
505 | ```
506 | > from pyspark.sql.functions import split
507 | > sampleDf = sampleDf.withColumn("Format Array", split(sampleDf.Format, ","))
508 | > sampleDf.show()
509 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+------------------+
510 | | Name|Institution Type| Format| URL| Description|Informal Score| Format Array|
511 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+------------------+
512 | | CMOA| Museum| JSON, CSV|https://github.co...| null| 10| [JSON, CSV]|
513 | | Penn Museum| Museum|JSON, CSV, XML|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|
514 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3| [CSV]|
515 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3| [XML]|
516 | |Canadian Subject ...| Library| RDF/XML|http://www.collec...| Ugh, rdf| 4| [RDF/XML]|
517 | | DPLA| Aggregator | CSV,JSON,XML| dp.la| null| 100| [CSV, JSON, XML]|
518 | +--------------------+----------------+--------------+--------------------+--------------------+--------------+------------------+
519 | > sampleDf.printSchema()
520 | root
521 | |-- Name: string (nullable = true)
522 | |-- Institution Type: string (nullable = true)
523 | |-- Format: string (nullable = true)
524 | |-- URL: string (nullable = true)
525 | |-- Description: string (nullable = true)
526 | |-- Informal Score: decimal(10,0) (nullable = true)
527 | |-- Format Array: array (nullable = true)
528 | | |-- element: string (containsNull = true)
529 | ```
530 |
531 | #### Explode `Format` into Multiple Rows of Values
532 |
533 | ```
534 | > from pyspark.sql.functions import explode
535 | > sampleDf = sampleDf.withColumn("Format", explode(split("Format", ",")))
536 | > sampleDf.show()
537 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+
538 | | Name|Institution Type| Format| URL| Description|Informal Score| Format Array|
539 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+
540 | | CMOA| Museum| JSON|https://github.co...| null| 10| [JSON, CSV]|
541 | | CMOA| Museum| CSV|https://github.co...| null| 10| [JSON, CSV]|
542 | | Penn Museum| Museum| JSON|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|
543 | | Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|
544 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|
545 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3| [CSV]|
546 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3| [XML]|
547 | |Canadian Subject ...| Library|RDF/XML|http://www.collec...| Ugh, rdf| 4| [RDF/XML]|
548 | | DPLA| Aggregator | CSV| dp.la| null| 100| [CSV, JSON, XML]|
549 | | DPLA| Aggregator | JSON| dp.la| null| 100| [CSV, JSON, XML]|
550 | | DPLA| Aggregator | XML| dp.la| null| 100| [CSV, JSON, XML]|
551 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+
552 | ```
553 |
554 | #### Create New Columns With Booleans if Format among Format Values
555 |
556 | ```
557 | > sampleDf.select("Format").distinct().show()
558 | +-------+
559 | | Format|
560 | +-------+
561 | | CSV|
562 | | XML|
563 | | CSV|
564 | | XML|
565 | |RDF/XML|
566 | | JSON|
567 | +-------+
568 | > sampleDf = sampleDf.withColumn('CSV', sampleDf.Format.like("%CSV%"))
569 | > sampleDf = sampleDf.withColumn('XML', sampleDf.Format.like("%XML%"))
570 | > sampleDf = sampleDf.withColumn('RDF', sampleDf.Format.like("%RDF%"))
571 | > sampleDf = sampleDf.withColumn('JSON', sampleDf.Format.like("%JSON%"))
572 | > sampleDf.show()
573 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+-----+-----+-----+-----+
574 | | Name|Institution Type| Format| URL| Description|Informal Score| Format Array| CSV| XML| RDF| JSON|
575 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+-----+-----+-----+-----+
576 | | CMOA| Museum| JSON|https://github.co...| null| 10| [JSON, CSV]|false|false|false| true|
577 | | CMOA| Museum| CSV|https://github.co...| null| 10| [JSON, CSV]| true|false|false|false|
578 | | Penn Museum| Museum| JSON|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|false|false|false| true|
579 | | Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]| true|false|false|false|
580 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|[JSON, CSV, XML]|false| true|false|false|
581 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3| [CSV]| true|false|false|false|
582 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3| [XML]|false| true|false|false|
583 | |Canadian Subject ...| Library|RDF/XML|http://www.collec...| Ugh, rdf| 4| [RDF/XML]|false| true| true|false|
584 | | DPLA| Aggregator | CSV| dp.la| null| 100| [CSV, JSON, XML]| true|false|false|false|
585 | | DPLA| Aggregator | JSON| dp.la| null| 100| [CSV, JSON, XML]|false|false|false| true|
586 | | DPLA| Aggregator | XML| dp.la| null| 100| [CSV, JSON, XML]|false| true|false|false|
587 | +--------------------+----------------+-------+--------------------+--------------------+--------------+------------------+-----+-----+-----+-----+
588 | ```
589 |
590 | #### Transform DataFrame to RDD & Use `map`
591 |
592 | ```
593 | > sampleRdd = sampleDf.select("Format").rdd.map(lambda x: x[0].split(","))
594 | > sampleRdd.take(5)
595 | [[u'JSON'], [u' CSV'], [u'JSON'], [u' CSV'], [u' XML']]
596 | ```
597 |
598 | ### Drop a Column
599 |
600 | ```
601 | > sampleDf = sampleDf.drop('Format Array')
602 | > sampleDf.show()
603 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
604 | | Name|Institution Type| Format| URL| Description|Informal Score| CSV| XML| RDF| JSON|
605 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
606 | | CMOA| Museum| JSON|https://github.co...| null| 10|false|false|false| true|
607 | | CMOA| Museum| CSV|https://github.co...| null| 10| true|false|false|false|
608 | | Penn Museum| Museum| JSON|https://www.penn....|JSON is poorly st...| 7|false|false|false| true|
609 | | Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7| true|false|false|false|
610 | | Penn Museum| Museum| XML|https://www.penn....|JSON is poorly st...| 7|false| true|false|false|
611 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3| true|false|false|false|
612 | |DigitalNZ Te Puna...| Library| XML|https://natlib.go...| MARC XML| 3|false| true|false|false|
613 | |Canadian Subject ...| Library|RDF/XML|http://www.collec...| Ugh, rdf| 4|false| true| true|false|
614 | | DPLA| Aggregator | CSV| dp.la| null| 100| true|false|false|false|
615 | | DPLA| Aggregator | JSON| dp.la| null| 100|false|false|false| true|
616 | | DPLA| Aggregator | XML| dp.la| null| 100|false| true|false|false|
617 | +--------------------+----------------+-------+--------------------+--------------------+--------------+-----+-----+-----+-----+
618 | ```
619 |
620 | ### Remove Whitespace Around Values in a Column
621 |
622 | ```
623 | > from pyspark.sql.functions import regexp_replace
624 | > sampleDf = sampleDf.withColumn("Format", regexp_replace(sampleDf.Format, "\s+", ""))
625 | > sampleDf.select("Format").distinct().show()
626 | +-------+
627 | | Format|
628 | +-------+
629 | | CSV|
630 | | XML|
631 | |RDF/XML|
632 | | JSON|
633 | +-------+
634 | ```
635 |
636 | ### Create Derivative Dataframe Based on Conditional Using `where()`
637 |
638 | ```
639 | > CSVsampleDf = sampleDf.where((sampleDf.Format == "CSV"))
640 | > CSVsampleDf.show()
641 | +-----------+----------------+------+--------------------+--------------------+--------------+----+-----+-----+-----+
642 | | Name|Institution Type|Format| URL| Description|Informal Score| CSV| XML| RDF| JSON|
643 | +-----------+----------------+------+--------------------+--------------------+--------------+----+-----+-----+-----+
644 | | CMOA| Museum| CSV|https://github.co...| null| 10|true|false|false|false|
645 | |Penn Museum| Museum| CSV|https://www.penn....|JSON is poorly st...| 7|true|false|false|false|
646 | | Met Museum| Museum| CSV| null| ¯_(ツ)_/¯| 3|true|false|false|false|
647 | | DPLA| Aggregator | CSV| dp.la| null| 100|true|false|false|false|
648 | +-----------+----------------+------+--------------------+--------------------+--------------+----+-----+-----+-----+
649 | ```
650 |
651 | ## Part 4: Using SQL to Analyze our Dataset
652 |
653 | To have a more dynamic experience, let’s create a temporary (in-memory) view that we can query against and interact with the resulting data in a table or graph format. The temporary view will allow us to execute SQL queries against it.
654 |
655 | Note that the temporary view will reside in memory as long as the Spark session is alive. [Here](http://cse.unl.edu/~sscott/ShowFiles/SQL/CheatSheet/SQLCheatSheet.html) is a SQL Cheatsheet in case you need it.
656 |
657 | ### Create Temporary SQL View from a DataFrame
658 |
659 | ```
660 | > sampleDf.createOrReplaceTempView("sampleDataView")
661 | ```
662 |
663 | ### Write & Run a Spark SQL Query
664 |
665 | This will take a Spark SQL Query then run it, using `collect()` to return Rows from our specified DataFrame that match the query:
666 |
667 | ```
668 | > sparkQuery = spark.sql("SELECT * FROM sampleDataView LIMIT 20")
669 | > sparkQuery.collect()
670 | [Row(Name=u'CMOA', Institution Type=u'Museum', Format=u'JSON', URL=u'https://github.com/cmoa/collection', Description=None, Informal Score=Decimal('10'), CSV=False, XML=False, RDF=False, JSON=True), Row(Name=u'CMOA', Institution Type=u'Museum', Format=u'CSV', URL=u'https://github.com/cmoa/collection', Description=None, Informal Score=Decimal('10'), CSV=True, XML=False, RDF=False, JSON=False), Row(Name=u'Penn Museum', Institution Type=u'Museum', Format=u'JSON', URL=u'https://www.penn.museum/collections/data.php', Description=u'JSON is poorly structured', Informal Score=Decimal('7'), CSV=False, XML=False, RDF=False, JSON=True), Row(Name=u'Penn Museum', Institution Type=u'Museum', Format=u'CSV', URL=u'https://www.penn.museum/collections/data.php', Description=u'JSON is poorly structured', Informal Score=Decimal('7'), CSV=True, XML=False, RDF=False, JSON=False), Row(Name=u'Penn Museum', Institution Type=u'Museum', Format=u'XML', URL=u'https://www.penn.museum/collections/data.php', Description=u'JSON is poorly structured', Informal Score=Decimal('7'), CSV=False, XML=True, RDF=False, JSON=False), Row(Name=u'Met Museum', Institution Type=u'Museum', Format=u'CSV', URL=None, Description=u'\xaf_(\u30c4)_/\xaf', Informal Score=Decimal('3'), CSV=True, XML=False, RDF=False, JSON=False), Row(Name=u'DigitalNZ Te Puna Web Directory', Institution Type=u'Library', Format=u'XML', URL=u'https://natlib.govt.nz/files/data/tepunawebdirectory.xml', Description=u'MARC XML', Informal Score=Decimal('3'), CSV=False, XML=True, RDF=False, JSON=False), Row(Name=u'Canadian Subject Headings', Institution Type=u'Library', Format=u'RDF/XML', URL=u'http://www.collectionscanada.gc.ca/obj/900/f11/040004/csh.rdf', Description=u'Ugh, rdf', Informal Score=Decimal('4'), CSV=False, XML=True, RDF=True, JSON=False), Row(Name=u'DPLA', Institution Type=u'Aggregator ', Format=u'CSV', URL=u'dp.la', Description=None, Informal Score=Decimal('100'), CSV=True, XML=False, RDF=False, JSON=False), Row(Name=u'DPLA', Institution Type=u'Aggregator ', Format=u'JSON', URL=u'dp.la', Description=None, Informal Score=Decimal('100'), CSV=False, XML=False, RDF=False, JSON=True), Row(Name=u'DPLA', Institution Type=u'Aggregator ', Format=u'XML', URL=u'dp.la', Description=None, Informal Score=Decimal('100'), CSV=False, XML=True, RDF=False, JSON=False)]
671 | ```
672 |
673 | Here's another Spark SQL Example:
674 |
675 | ```
676 | > sparkQuery = spark.sql("""SELECT `Institution Type`, COUNT(DISTINCT(Format)) AS NumFormats FROM sampleDataView GROUP BY `Institution Type`""")
677 | > sparkQuery.collect()
678 | > for n in sparkQuery.collect():
679 | > n
680 | Row(Institution Type=u'Library', NumFormats=2)
681 | Row(Institution Type=u'Aggregator ', NumFormats=3)
682 | Row(Institution Type=u'Museum', NumFormats=3)
683 | ```
684 |
685 | ## Part 5: Simple Data Visualization
686 |
687 | Use the `sql` interpreter in Zeppelin to run SQL queries against our previously defined Temporary SQL View (just enter it where the table name normally goes in a SQL Query).
688 |
689 | This gives us immediate access to the SQL-driven visualizations available in Zeppelin.
690 |
691 |
692 | This is the end of this session. In the [next session, you'll be working in small groups to take some of the things learned above and apply them to current, CHO (cultural heritage organization) data](working-with-cho-data.md).
693 |
694 |
695 |
696 |
697 | ----
698 |
--------------------------------------------------------------------------------