├── .DS_Store
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── examples
├── jupyter-keyboard-shortcuts.ipynb
├── multi-lang-support-in-spark.ipynb
├── delta-emr-serverless.ipynb
├── dynamodb-with-spark.ipynb
├── hudi-emr-serverless.ipynb
├── word-count-with-pyspark.ipynb
├── plot-graph-using-bokeh.ipynb
├── iceberg-emr-serverless.ipynb
├── word-count-emr-serverless.ipynb
├── redshift-with-spark.ipynb
├── jdbc-rdbms-with-spark.ipynb
├── jupyter-magic-commands.ipynb
├── documentdb-with-spark.ipynb
├── Getting-started-emr-serverless.ipynb
├── udf-with-spark-sql.ipynb
├── machine-learning-with-pyspark-linear-regression.ipynb
├── visualize-data-with-pandas-matplotlib.ipynb
├── redshift-connect-from-spark-using-username-password.ipynb
├── hive-presto-with-python.ipynb
├── table-with-sql-hivecontext-presto.ipynb
├── redshift-connect-from-spark-using-iam-role.ipynb
├── query-hudi-dataset-with-spark-sql.ipynb
├── table-with-hiveql-from-data-in-s3.ipynb
└── install-notebook-scoped-libraries-at-runtime.ipynb
└── CONTRIBUTING.md
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/emr-studio-notebook-examples/HEAD/.DS_Store
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
4 | this software and associated documentation files (the "Software"), to deal in
5 | the Software without restriction, including without limitation the rights to
6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7 | the Software, and to permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EMR Studio Notebook Examples
2 |
3 | This repository contains ready-to-use notebook examples for a wide variety of use cases in [Amazon EMR Studio](https://aws.amazon.com/emr/features/studio/).
4 |
5 | For more information about using Amazon EMR Studio, see [Use EMR Studio](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio.html) in the *Amazon EMR Management Guide*.
6 |
7 | You can submit feedback and requests for changes by opening an issue in this repository or by making proposed changes and submitting a pull request.
8 | We welcome contributions to this repository in the form of fixes to existing examples or addition of new examples. For more information on contributing, please see the [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) guide.
9 |
10 | ## Table of Contents
11 | 1. [Security](#Security)
12 | 2. [License](#License)
13 | 3. [FAQ](#FAQ)
14 |
15 | ## Security
16 |
17 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
18 |
19 | ## License
20 |
21 | This library is licensed under the MIT-0 License. See the LICENSE file.
22 |
23 | ## FAQ
24 |
25 | *Will these examples work outside of Amazon EMR Studio?*
26 |
27 | - Most of the examples should work with/without minimal modifications even outside EMR Studio.
28 |
29 | *How do I contribute my own example notebook?*
30 |
31 | - Although we're extremely excited to receive contributions from the community, we're still working on the best mechanism to take in examples from external sources. Please bear with us in the short-term if pull requests take longer than expected or are closed.
--------------------------------------------------------------------------------
/examples/jupyter-keyboard-shortcuts.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c1d6be25",
6 | "metadata": {},
7 | "source": [
8 | "# Jupyter Keyboard Shortcuts\n",
9 | "\n",
10 | "This notebook covers common keyboard shortcuts that you can use to increase your productivity when working with Jupyter Notebooks.\n",
11 | "\n",
12 | "
\n",
13 | "NOTE : If you are on a Mac, substitute \"command\" for \"control\". Dont type the _+_ (it means press both keys at once).
\n",
14 | "\n",
15 | "***\n",
16 | "\n",
17 | "\n",
18 | "Shortcuts when in either _command mode_ (outside the cells) or _edit mode_ (inside a cell):\n",
19 | "---\n",
20 | "- `Shift` + `Enter` run selected cell or cells - if no cells below, insert a code cell below\n",
21 | "\n",
22 | "- `Ctrl` + `B` toggle hide/show left sidebar\n",
23 | "\n",
24 | "- `Ctrl` + `S` save and checkpoint\n",
25 | "- `Ctrl` + `Shift` + `S` save as\n",
26 | "- `Ctrl` + `F` find \n",
27 | "\n",
28 | "***\n",
29 | "\n",
30 | "\n",
31 | "Shortcuts when in _command mode_ (outside the cells, no blinking cursor):\n",
32 | "---\n",
33 | "- `Enter` enter _edit mode_ in the active cell\n",
34 | "\n",
35 | "- Scroll up with the up arrow \n",
36 | "- Scroll down with the down arrow\n",
37 | "\n",
38 | "- `A` insert a new cell above the active cell\n",
39 | "- `B` insert a new cell below the active cell\n",
40 | "\n",
41 | "- `M` make the active cell a Markdown cell\n",
42 | "- `Y` make the active cell a code cell\n",
43 | "\n",
44 | "- `Shift` + `Up Arrow` select the current cell and the cell above\n",
45 | "- `Shift` + `Down Arrow` select the current cell and the cell below\n",
46 | "- `Ctrl` + `A` select all cells\n",
47 | "\n",
48 | "- `X` cut the selected cell or cells\n",
49 | "- `C` copy the selected cell or cells\n",
50 | "- `V` paste the cell(s) which were copied or cut most recently\n",
51 | "\n",
52 | "- `Shift + M` merge multiple selected cells into one cell\n",
53 | "\n",
54 | "- `DD` (`D` twice) delete the active cell\n",
55 | "- `00` (Zero twice) restart the kernel\n",
56 | "\n",
57 | "- `Z` undo most recent command mode action\n",
58 | "\n",
59 | "Shortcuts when in _edit mode_ (inside a cell with a blinking cursor):\n",
60 | "---\n",
61 | "\n",
62 | "- `Esc` enter _command mode_\n",
63 | "\n",
64 | "- `Tab` code completion (or indent if at start of line)\n",
65 | "- `Shift` + `Tab` tooltip help\n",
66 | "- `Ctrl` + `Shift` + `-` split the active cell at the cursor\n",
67 | "\n",
68 | "The usual commands for code editors:\n",
69 | "\n",
70 | "- `Ctrl` + `]` indent\n",
71 | "- `Ctrl` + `[` dedent\n",
72 | "\n",
73 | "- `Ctrl` + `/` toggle comment\n",
74 | "\n",
75 | "Plus the usual shortcuts for select all, cut, copy, paste, undo, etc.\n",
76 | "\n",
77 | "***\n"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "id": "88f53ece",
84 | "metadata": {},
85 | "outputs": [],
86 | "source": []
87 | }
88 | ],
89 | "metadata": {
90 | "kernelspec": {
91 | "display_name": "Python 3",
92 | "language": "python",
93 | "name": "python3"
94 | },
95 | "language_info": {
96 | "codemirror_mode": {
97 | "name": "ipython",
98 | "version": 3
99 | },
100 | "file_extension": ".py",
101 | "mimetype": "text/x-python",
102 | "name": "python",
103 | "nbconvert_exporter": "python",
104 | "pygments_lexer": "ipython3",
105 | "version": "3.7.4"
106 | }
107 | },
108 | "nbformat": 4,
109 | "nbformat_minor": 5
110 | }
111 |
--------------------------------------------------------------------------------
/examples/multi-lang-support-in-spark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "a0177726",
6 | "metadata": {},
7 | "source": [
8 | "# Multi-Language Support in Spark Kernels\n",
9 | "\n",
10 | "Topics covered in this example:\n",
11 | "\n",
12 | "* Using multi-language (Python, Scala, R and SQL) from within Spark Notebooks.\n",
13 | "* Sharing data across language using temp tables/views.\n",
14 | "\n",
15 | "***\n",
16 | "\n",
17 | "## Prerequisites\n",
18 | "\n",
19 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
20 | "\n",
21 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
22 | "* The EMR cluster attached to this notebook should be version 6.4.0 or later.\n",
23 | "* This notebook uses the `PySpark` kernel.\n",
24 | "***\n",
25 | "\n",
26 | "## Introduction\n",
27 | "\n",
28 | "This example shows how to use multiple languages within Spark notebooks. You can mix and match Python, Scala, R and SQL from within Spark notebooks. Supported kernels are PySpark, Spark and SparkR kernels. \n",
29 | "\n",
30 | "***\n",
31 | "\n",
32 | "The `%%pyspark` cellmagic allows users to write pyspark code in all Spark kernels"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "id": "831c3a58",
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "%%pyspark\n",
43 | "a = 1 "
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "id": "ca9be37c",
49 | "metadata": {},
50 | "source": [
51 | "The `%%sql` cellmagic allows users to execute Spark-SQL code. Here I am querying the tables in the default database."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "id": "301ceb18",
58 | "metadata": {},
59 | "outputs": [],
60 | "source": [
61 | "%%sql\n",
62 | "SHOW TABLES "
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "id": "0d6b263c",
68 | "metadata": {},
69 | "source": [
70 | "The `%%rspark` cell magic allows users to execute sparkr code. "
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "id": "ddf003b1",
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "%%rspark\n",
81 | "a <- 1"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "id": "0cd9d9e1",
87 | "metadata": {},
88 | "source": [
89 | "The `%%scalaspark` cell magic allows users to execute spark scala code. Note that here I am reading data from the temp table previously creating using Python."
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "id": "faf303a4",
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "%%scalaspark\n",
100 | "val a = 1"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "id": "2f8e1dcf",
106 | "metadata": {},
107 | "source": [
108 | "### Sharing Data using temp tables/views.\n",
109 | "\n",
110 | "You can share data between languages using temp tables. Lets create a temp table using python in Spark:"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "id": "9a81e181",
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "%%pyspark\n",
121 | "df=spark.sql(\"SELECT count(1) from nyc_top_trips_report LIMIT 20\")\n",
122 | "df.createOrReplaceTempView(\"nyc_top_trips_report_v\")"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "id": "154b5757",
128 | "metadata": {},
129 | "source": [
130 | "And now lets read the temp table using Scala:"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "id": "3e26a9b6",
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "%%scalaspark\n",
141 | "val df=spark.sql(\"SELECT * from nyc_top_trips_report_v\")\n",
142 | "df.show(5)"
143 | ]
144 | }
145 | ],
146 | "metadata": {
147 | "kernelspec": {
148 | "display_name": "PySpark",
149 | "language": "",
150 | "name": "pysparkkernel"
151 | },
152 | "language_info": {
153 | "codemirror_mode": {
154 | "name": "python",
155 | "version": 3
156 | },
157 | "mimetype": "text/x-python",
158 | "name": "pyspark",
159 | "pygments_lexer": "python3"
160 | }
161 | },
162 | "nbformat": 4,
163 | "nbformat_minor": 5
164 | }
165 |
--------------------------------------------------------------------------------
/examples/delta-emr-serverless.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "072f6bb8",
6 | "metadata": {},
7 | "source": [
8 | "# Linux Foundation Delta Lake example using EMR Serverless on EMR Studio"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "2b32fb4d",
14 | "metadata": {},
15 | "source": [
16 | "#### Topics covered in this example\n",
17 | "\n",
18 | " Configure a Spark session \n",
19 | " Create a Delta lake table \n",
20 | " Query the table \n",
21 | " "
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "cc083781",
27 | "metadata": {},
28 | "source": [
29 | "***\n",
30 | "\n",
31 | "## Prerequisites\n",
32 | "\n",
33 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
34 | "\n",
35 | "* EMR Serverless should be chosen as the Compute. The Application version should be 6.14 or higher.\n",
36 | "* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
37 | "* This notebook uses the `PySpark` kernel.\n",
38 | "***"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "id": "aad41746",
44 | "metadata": {},
45 | "source": [
46 | "## 1. Configure your Spark session.\n",
47 | "Configure the Spark Session. Set up Spark SQL extensions to use Delta lake. "
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "id": "d9a9cc4d",
54 | "metadata": {
55 | "tags": []
56 | },
57 | "outputs": [],
58 | "source": [
59 | "%%configure -f\n",
60 | "{\n",
61 | " \"conf\": {\n",
62 | " \"spark.sql.extensions\" : \"io.delta.sql.DeltaSparkSessionExtension\",\n",
63 | " \"spark.sql.catalog.spark_catalog\": \"org.apache.spark.sql.delta.catalog.DeltaCatalog\",\n",
64 | " \"spark.jars\": \"/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar\",\n",
65 | " \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\"\n",
66 | " }\n",
67 | "}"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "id": "30ff484d",
73 | "metadata": {},
74 | "source": [
75 | "---\n",
76 | "## 2. Create a Delta lake Table\n",
77 | "We will create a Spark Dataframe with sample data and write this into a Delta lake table. "
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "id": "b27a97e9",
83 | "metadata": {},
84 | "source": [
85 | "\n",
86 | " NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "id": "7ae499d1",
93 | "metadata": {
94 | "tags": []
95 | },
96 | "outputs": [],
97 | "source": [
98 | "tableName = \"delta_table\"\n",
99 | "basePath = \"s3://my_bucket/aws_workshop/delta_data_location/\" + tableName\n"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "id": "4212f6be",
106 | "metadata": {
107 | "tags": []
108 | },
109 | "outputs": [],
110 | "source": [
111 | "data = spark.createDataFrame([\n",
112 | " (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n",
113 | " (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n",
114 | " (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n",
115 | " (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\")\n",
116 | "],[\"id\", \"creation_date\", \"last_update_time\"])"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "id": "abf2dbc6",
123 | "metadata": {
124 | "tags": []
125 | },
126 | "outputs": [],
127 | "source": [
128 | "data.write.format(\"delta\"). \\\n",
129 | " save(basePath)\n"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "id": "a73d4a2b",
135 | "metadata": {},
136 | "source": [
137 | "---\n",
138 | "## 3. Query the table\n",
139 | "We will read the table using spark.read into a Spark dataframe"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "id": "7972630e",
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "df = spark.read.format(\"delta\").load(basePath)\n",
150 | "df.show()"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "id": "3dbfb8c8",
156 | "metadata": {},
157 | "source": [
158 | "### You have made it to the end of this notebook!!"
159 | ]
160 | }
161 | ],
162 | "metadata": {
163 | "kernelspec": {
164 | "display_name": "PySpark",
165 | "language": "python",
166 | "name": "spark_magic_pyspark"
167 | },
168 | "language_info": {
169 | "codemirror_mode": {
170 | "name": "python",
171 | "version": 3
172 | },
173 | "file_extension": ".py",
174 | "mimetype": "text/x-python",
175 | "name": "pyspark",
176 | "pygments_lexer": "python3"
177 | }
178 | },
179 | "nbformat": 4,
180 | "nbformat_minor": 5
181 | }
182 |
--------------------------------------------------------------------------------
/examples/dynamodb-with-spark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c0c8c8f9",
6 | "metadata": {},
7 | "source": [
8 | "## Connect DynamoDB using spark connector from EMR Studio Notebook using Spark Scala\n",
9 | "\n",
10 | "#### Topics covered in this example\n",
11 | "\n",
12 | "* Configuring the EMR-DynamoDB Connector\n",
13 | "* Connecting to AWS DynamoDB using EMR-DynamoDB Connector to read data into Spark DF and get the count"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "id": "66cbe6a8",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents:\n",
22 | "\n",
23 | "1. [Prerequisites](#Prerequisites)\n",
24 | "2. [Introduction](#Introduction)\n",
25 | "3. [Load the configuration in memory](#Load-the-configuration-in-memory)\n",
26 | "4. [Read data using Scala](#Read-data-using-Scala)"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "id": "f74c4358",
32 | "metadata": {},
33 | "source": [
34 | "## Prerequisites"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "id": "d35bce95",
40 | "metadata": {},
41 | "source": [
42 | " 1. EMR Version - emr-6.4.0\n",
43 | " 2. The notebook is using Amazon EMR-DynamoDB Connector(emr-ddb-hadoop.jar) which is used to access data in Amazon DynamoDB using Apache Hadoop, Apache Hive, and Apache Spark in Amazon EMR. (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMRforDynamoDB.html)\n",
44 | " 3. The Amazon EMR-DynamoDB Connector is locally available within EMR at location below: \n",
45 | " \n",
46 | " `/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`\n",
47 | " 4. To locally use the spark.jars in the configuration using the `file://` protocol, Livy does not allow local files to be uploaded to the user session by default. So we need to update the livy configuration to allow the local directory and restart livy server to have these take effects.\n",
48 | " \n",
49 | " ##### The following commands should be run from the EMR cluster attached to the notebook\n",
50 | " \n",
51 | " a. Open the Livy configuration file\n",
52 | " \n",
53 | " `vim /etc/livy/conf/livy.conf`\n",
54 | " \n",
55 | " b. Add the ddb jars:\n",
56 | " \n",
57 | " `livy.file.local-dir-whitelist /usr/share/aws/emr/ddb/lib/`\n",
58 | " \n",
59 | " c. Restart Livy\n",
60 | " \n",
61 | " `systemctl restart livy-server.service`\n"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "id": "9e0bfc9d",
67 | "metadata": {},
68 | "source": [
69 | "## Introduction"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "id": "fc8fb425",
75 | "metadata": {},
76 | "source": [
77 | "This notebook shows how to connect to DynamoDB using DynamoDB Spark connector(emr-ddb-hadoop) from Amazon EMR Studio Notebook using Scala"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "id": "e705dd8e",
83 | "metadata": {},
84 | "source": [
85 | "## Load the configuration in memory"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "id": "d2b1831e",
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "%%configure -f\n",
96 | "{\n",
97 | " \"conf\": {\n",
98 | " \"spark.jars\":\"file:///usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar\" \n",
99 | " }\n",
100 | "}"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "id": "664d194a",
106 | "metadata": {},
107 | "source": [
108 | "## Read data using Scala"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": null,
114 | "id": "3b102e29",
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "import org.apache.hadoop.io.Text;\n",
119 | "import org.apache.hadoop.dynamodb.DynamoDBItemWritable\n",
120 | "import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat\n",
121 | "import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat\n",
122 | "import org.apache.hadoop.mapred.JobConf\n",
123 | "import org.apache.hadoop.io.LongWritable\n",
124 | "\n",
125 | "var jobConf = new JobConf(sc.hadoopConfiguration)\n",
126 | "jobConf.set(\"dynamodb.servicename\", \"dynamodb\")\n",
127 | "jobConf.set(\"dynamodb.input.tableName\", \"\") // Pointing to DynamoDB table\n",
128 | "jobConf.set(\"dynamodb.endpoint\", \"\")\n",
129 | "jobConf.set(\"dynamodb.regionid\", \"\")\n",
130 | "jobConf.set(\"dynamodb.throughput.read\", \"1\")\n",
131 | "jobConf.set(\"dynamodb.throughput.read.percent\", \"1\")\n",
132 | "jobConf.set(\"dynamodb.version\", \"2011-12-05\")\n",
133 | " \n",
134 | "jobConf.set(\"mapred.output.format.class\", \"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat\")\n",
135 | "jobConf.set(\"mapred.input.format.class\", \"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat\")\n",
136 | " \n",
137 | "var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])\n",
138 | " \n",
139 | "// Doing a count of items on Orders table\n",
140 | "orders.count()"
141 | ]
142 | }
143 | ],
144 | "metadata": {
145 | "kernelspec": {
146 | "display_name": "Spark",
147 | "language": "",
148 | "name": "sparkkernel"
149 | },
150 | "language_info": {
151 | "codemirror_mode": "text/x-scala",
152 | "mimetype": "text/x-scala",
153 | "name": "scala",
154 | "pygments_lexer": "scala"
155 | }
156 | },
157 | "nbformat": 4,
158 | "nbformat_minor": 5
159 | }
160 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 | ## Adding an Example
43 |
44 | Any new example must be added in the `examples` directory.
45 | When adding a new example, there are several things to consider as you implement:
46 |
47 | ### Added Value
48 | As an official learning resource, it is important that any new examples add value to our learning resources. This means that it should not duplicate an existing example.
49 |
50 | ### Introduction Structure
51 |
52 | * The example must have a descriptive `file name` and `title` that clearly defines it’s usage and the `topics` that it would cover. Follow a file naming pattern - feature/function`-`app/lib.ipynb
53 | * The example must have a `Prerequisites` markdown that lists the installations required before executing the notebook as is attached to a cluster. For eg: EMR applications required like Hive, Presto, etc.
54 | * The example must have an `Introduction` markdown that explains the basics of the topics covered in the notebook. The introduction should include links to relevant aws public docs and data sets used in the notebook.
55 | * Every `code cell` must be preceded by a `markdown cell` that clearly explains the following code cell in detail and provides any additional usage or information.
56 | * Text portions of notebooks should follow the [AWS Style Guide](https://alpha-docs-aws.amazon.com/awsstyleguide/latest/styleguide/Welcome.html) and guidelines for [service names](https://w.amazon.com/bin/view/AWSDocs/editing/service-names/).
57 | * Assume only the knowledge that a beginner data scientist would have. Don’t assume that the reader is an experienced coder or has a rigorous technical background.
58 |
59 | ### Code Cells Structure
60 |
61 | * The example should not contain any sensitive information like security groups/subnets, passwords, etc in any of the code or markdown cells.
62 | * The kernel that you use to write your notebook must be named exactly the same as one of the default kernels that ships with EMR notebooks. The best way to avoid this problem is to actually author your notebook file within a EMR Notebook.
63 | * It's preferred that the cell outputs be left empty. Use `Edit` and `Clear All` before submitting a PR.
64 | * DO NOT include single quotes (```'```). This is due to a limitation with nbconvert.
65 |
66 | ### Testing
67 | The example must be tested using an EMR Notebook and should have successfully run when attached to a cluster.
68 |
69 | ### Style and Formatting
70 | We strive to keep all the examples consistent in style and formatting and as idiomatic as possible. This hopefully makes navigating examples easier for users.
71 |
72 |
73 | ## Finding contributions to work on
74 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
75 |
76 |
77 | ## Code of Conduct
78 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
79 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
80 | opensource-codeofconduct@amazon.com with any additional questions or comments.
81 |
82 |
83 | ## Security issue notifications
84 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
85 |
86 |
87 | ## Licensing
88 |
89 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
90 |
--------------------------------------------------------------------------------
/examples/hudi-emr-serverless.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "e8107dbc",
6 | "metadata": {},
7 | "source": [
8 | "# Apache Hudi example using EMR Serverless on EMR Studio"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "4c1167c9",
14 | "metadata": {},
15 | "source": [
16 | "#### Topics covered in this example\n",
17 | "\n",
18 | " Configure a Spark session \n",
19 | " Create an Apache Hudi table \n",
20 | " Query the table \n",
21 | " "
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "286bcc03",
27 | "metadata": {},
28 | "source": [
29 | "***\n",
30 | "\n",
31 | "## Prerequisites\n",
32 | "\n",
33 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
34 | "\n",
35 | "* EMR Serverless should be chosen as the Compute. The Application version should be 6.14 or higher.\n",
36 | "* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
37 | "* This notebook uses the `PySpark` kernel.\n",
38 | "***"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "id": "26017ee4",
44 | "metadata": {},
45 | "source": [
46 | "## 1. Configure your Spark session.\n",
47 | "Configure the Spark Session. Set up Spark SQL extensions to use Apache Hudi. Set up the options for the Hudi table."
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "id": "87ee73d7",
54 | "metadata": {
55 | "tags": []
56 | },
57 | "outputs": [],
58 | "source": [
59 | "%%configure -f\n",
60 | "{\n",
61 | " \"conf\": {\n",
62 | " \"spark.jars\": \"/usr/lib/hudi/hudi-spark-bundle.jar\",\n",
63 | " \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n",
64 | " \"spark.hadoop.hive.metastore.client.factory.class\": \"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\"\n",
65 | " }\n",
66 | "}"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "id": "378216bd",
72 | "metadata": {},
73 | "source": [
74 | "\n",
75 | " NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "id": "7aec16c9",
82 | "metadata": {
83 | "tags": []
84 | },
85 | "outputs": [],
86 | "source": [
87 | "tableName = \"hudi_table\"\n",
88 | "basePath = \"s3://my_bucket/aws_workshop/hudi_data_location/\" + tableName\n",
89 | "\n",
90 | "hudi_options = {\n",
91 | " 'hoodie.table.name': tableName,\n",
92 | " 'hoodie.datasource.write.recordkey.field': 'id',\n",
93 | " 'hoodie.datasource.write.table.name': tableName,\n",
94 | " 'hoodie.datasource.write.operation': 'insert',\n",
95 | " 'hoodie.datasource.write.precombine.field': 'creation_date'\n",
96 | "}"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "id": "2b66f7de",
102 | "metadata": {},
103 | "source": [
104 | "---\n",
105 | "## 2. Create an Apache Hudi Table\n",
106 | "We will create a Spark Dataframe with sample data and write this into a Hudi table. "
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "id": "0639ede8",
113 | "metadata": {
114 | "tags": []
115 | },
116 | "outputs": [],
117 | "source": [
118 | "data = spark.createDataFrame([\n",
119 | " (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n",
120 | " (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n",
121 | " (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n",
122 | " (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\")\n",
123 | "],[\"id\", \"creation_date\", \"last_update_time\"])"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "id": "347cb264",
130 | "metadata": {
131 | "tags": []
132 | },
133 | "outputs": [],
134 | "source": [
135 | "data.write.format(\"hudi\"). \\\n",
136 | " options(**hudi_options). \\\n",
137 | " mode(\"overwrite\"). \\\n",
138 | " save(basePath)\n"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "id": "1b6e4210",
144 | "metadata": {},
145 | "source": [
146 | "---\n",
147 | "## 3. Query the table\n",
148 | "We will read the table using spark.read into a Spark dataframe"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "id": "03f4d11a",
155 | "metadata": {
156 | "tags": []
157 | },
158 | "outputs": [],
159 | "source": [
160 | "df = spark.read.format(\"hudi\").load(basePath)\n",
161 | "df.show()"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "id": "deb14d09",
167 | "metadata": {},
168 | "source": [
169 | "### You have made it to the end of this notebook!!"
170 | ]
171 | }
172 | ],
173 | "metadata": {
174 | "kernelspec": {
175 | "display_name": "PySpark",
176 | "language": "python",
177 | "name": "spark_magic_pyspark"
178 | },
179 | "language_info": {
180 | "codemirror_mode": {
181 | "name": "python",
182 | "version": 3
183 | },
184 | "file_extension": ".py",
185 | "mimetype": "text/x-python",
186 | "name": "pyspark",
187 | "pygments_lexer": "python3"
188 | }
189 | },
190 | "nbformat": 4,
191 | "nbformat_minor": 5
192 | }
193 |
--------------------------------------------------------------------------------
/examples/word-count-with-pyspark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Word count with PySpark\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Write a file to HDFS, read the file and perform word count on the data."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "***\n",
18 | "\n",
19 | "## Prerequisites\n",
20 | "\n",
21 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
22 | "\n",
23 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
24 | "* This notebook uses the `PySpark` kernel.\n",
25 | "***"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "## Introduction\n",
33 | "In this example we write a file to hdfs, use pyspark to count the occurrence of each word in the file stored in hdfs and store the results to s3.\n",
34 | "***"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "## Setup\n",
42 | "1. Create a S3 bucket to save your results or use an existing s3 bucket. For example: `s3://EXAMPLE-BUCKET/word-count/`\n",
43 | "***"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "## Example\n",
51 | "\n",
52 | "Create a test data frame with some sample records.\n",
53 | "We will use the `createDataFrame()` method to create and `printSchema()` method to print out the schema."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "wordsDF = sqlContext.createDataFrame([(\"emr\",), (\"spark\",), (\"example\",), (\"spark\",), (\"pyspark\",), (\"python\",),\n",
63 | " (\"example\",), (\"emr\",), (\"example\",), (\"spark\",), (\"pyspark\",), (\"python\",)], [\"words\"])\n",
64 | "wordsDF.show()\n",
65 | "wordsDF.printSchema()"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "Print out the number of unique words so that we can verify this number with the end result count."
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "uniqueWordsCount = wordsDF.distinct().groupBy().count().head()[0]\n",
82 | "print(uniqueWordsCount)"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "This step only shows an example on how to write to hdfs.\n",
90 | "You can use an existing file stored in hdfs and read it as shown in the next steps."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "wordsDF.write.csv(\"hdfs:///user/hadoop/test-data.csv\")"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "Read the csv file from hdfs and store in RDD."
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "metadata": {},
113 | "outputs": [],
114 | "source": [
115 | "wordsData = sc.textFile(\"hdfs:///user/hadoop/test-data.csv\")\n",
116 | "wordsData.count()"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "Display the contents of the file."
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "wordsData.collect()"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "Count the occurance of each word and print the count of the result. This should be equal to the number of unique words we found earlier."
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "wordsCounts = wordsData.flatMap(lambda line: line.split(\" \")) \\\n",
149 | " .map(lambda word: (word, 1)) \\\n",
150 | " .reduceByKey(lambda a, b: a+b)\n",
151 | "wordsCounts.count()"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "Display the count for each word."
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "wordsCounts.collect()"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "Save the results to your s3 bucket. The results are stored in the key `word-count` and split based on paritions."
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {},
181 | "outputs": [],
182 | "source": [
183 | "wordsCounts.saveAsTextFile(\"s3://EXAMPLE-BUCKET/word-count\") # Change this to the S3 location that you created in Setup step 1."
184 | ]
185 | }
186 | ],
187 | "metadata": {
188 | "kernelspec": {
189 | "display_name": "PySpark",
190 | "language": "",
191 | "name": "pysparkkernel"
192 | },
193 | "language_info": {
194 | "codemirror_mode": {
195 | "name": "python",
196 | "version": 3
197 | },
198 | "mimetype": "text/x-python",
199 | "name": "pyspark",
200 | "pygments_lexer": "python3"
201 | }
202 | },
203 | "nbformat": 4,
204 | "nbformat_minor": 4
205 | }
206 |
--------------------------------------------------------------------------------
/examples/plot-graph-using-bokeh.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "35cfb208",
6 | "metadata": {},
7 | "source": [
8 | "# Plot graph using `bokeh`\n",
9 | "\n",
10 | "#### Topics covered in this example\n",
11 | "* Installing python libraries on the EMR cluster.\n",
12 | "* Use Bokeh plotting library to plot trignometric functions: sin, cos and tan functions. "
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "26237079",
18 | "metadata": {},
19 | "source": [
20 | "***\n",
21 | "\n",
22 | "## Introduction\n",
23 | "This is an example from bokeh tutorial . In this example, we are going to use Bokeh to plot an interactive graph. We will plot different trignometric functions, namely sin, cos and tan. To interactive with the graph, we will use a dropdown that offers a choice to select a function and the sliders to control the frequency, amplitude, and phase.\n",
24 | "\n",
25 | "***"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "id": "81f6b2ab",
31 | "metadata": {},
32 | "source": [
33 | "Install dependency `bokeh` using `%pip` cell magic.\n",
34 | "\n",
35 | "`%pip install` is same as `!/emr/notebook-env/bin/pip install` and are installed in `/home/emr-notebook/`.\n",
36 | "\n",
37 | "After installation, these libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the master node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": null,
43 | "id": "c13d9f55",
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "%pip install matplotlib"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "id": "d38052f4",
53 | "metadata": {},
54 | "source": [
55 | "import dependencies"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "id": "976fa391",
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "from ipywidgets import interact\n",
66 | "import numpy as np\n",
67 | "\n",
68 | "from bokeh.io import push_notebook, show, output_notebook\n",
69 | "from bokeh.plotting import figure\n",
70 | "output_notebook()"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "id": "281a2bb0",
76 | "metadata": {},
77 | "source": [
78 | "Generate x and y axes co-ordinates for sin function.\n",
79 | "For x axis, take an array of evenly spaced numbers and calculate corresponding y axis co-ordinate."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "id": "0624f7b4",
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "x = np.linspace(0, 2*np.pi, 2000)\n",
90 | "y = np.sin(x)"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "id": "eb01a95c",
96 | "metadata": {},
97 | "source": [
98 | "Define a graph and draw a line with sin as default function"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "id": "201d45b4",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "p = figure(title=\"simple line example\", plot_height=300, plot_width=600, y_range=(-5,5),\n",
109 | " background_fill_color=\"#efefef\")\n",
110 | "r = p.line(x, y, color=\"#8888cc\", line_width=1.5, alpha=0.8)"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "id": "52fa47ba",
116 | "metadata": {},
117 | "source": [
118 | "Define an update function that will be callable to interactors to update the graph with different trignometic functions."
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": null,
124 | "id": "76215a3b",
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "def update(f, w=1, A=1, phi=0):\n",
129 | " if f == \"sin\": func = np.sin\n",
130 | " elif f == \"cos\": func = np.cos\n",
131 | " elif f == \"tan\": func = np.tan \n",
132 | " r.data_source.data[\"y\"] = A * func(w * x + phi)\n",
133 | " push_notebook()"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "id": "7fbe402e",
139 | "metadata": {},
140 | "source": [
141 | "Render the graph"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "id": "ccf184ad",
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "show(p, notebook_handle=True)"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "id": "75216e25",
157 | "metadata": {},
158 | "source": [
159 | "Interact with the graph: dropdown to select a trignometric functions and sliders to change the value of frequency, amplitude, and phase"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "id": "acf00693",
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "interact(update, f=[\"sin\", \"cos\"], w=(0,50), A=(1,10), phi=(0, 20, 0.1))"
170 | ]
171 | }
172 | ],
173 | "metadata": {
174 | "kernelspec": {
175 | "display_name": "Python 3",
176 | "language": "python",
177 | "name": "python3"
178 | },
179 | "language_info": {
180 | "codemirror_mode": {
181 | "name": "ipython",
182 | "version": 3
183 | },
184 | "file_extension": ".py",
185 | "mimetype": "text/x-python",
186 | "name": "python",
187 | "nbconvert_exporter": "python",
188 | "pygments_lexer": "ipython3",
189 | "version": "3.7.4"
190 | }
191 | },
192 | "nbformat": 4,
193 | "nbformat_minor": 5
194 | }
195 |
--------------------------------------------------------------------------------
/examples/iceberg-emr-serverless.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "9405ab68",
6 | "metadata": {},
7 | "source": [
8 | "# Apache Iceberg example using EMR Serverless on EMR Studio"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "7993faa6",
14 | "metadata": {},
15 | "source": [
16 | "#### Topics covered in this example\n",
17 | "\n",
18 | " Configure a Spark session \n",
19 | " Create an Apache Iceberg table \n",
20 | " Query the table \n",
21 | " "
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "d44db83f",
27 | "metadata": {},
28 | "source": [
29 | "***\n",
30 | "\n",
31 | "## Prerequisites\n",
32 | "\n",
33 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
34 | "\n",
35 | "* EMR Serverless should be chosen as the Compute. The Application version should be 6.14 or higher.\n",
36 | "* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
37 | "* You must have a database in AWS Glue named \"default\".\n",
38 | "* This notebook uses the `PySpark` kernel.\n",
39 | "***"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "id": "a99c60bc",
45 | "metadata": {},
46 | "source": [
47 | "## 1. Configure your Spark session.\n",
48 | "Configure the Spark Session. Set up Spark SQL extensions to use Apache Iceberg. \n",
49 | "\n",
50 | " NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "id": "e6337b6d",
57 | "metadata": {
58 | "tags": []
59 | },
60 | "outputs": [],
61 | "source": [
62 | "%%configure -f\n",
63 | "{\n",
64 | " \"conf\": {\n",
65 | " \"spark.sql.extensions\":\"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions\",\n",
66 | " \"spark.sql.catalog.glue_catalog\": \"org.apache.iceberg.spark.SparkCatalog\",\n",
67 | " \"spark.sql.catalog.glue_catalog.warehouse\": \"s3://my_bucket/aws_workshop\",\n",
68 | " \"spark.sql.catalog.glue_catalog.catalog-impl\": \"org.apache.iceberg.aws.glue.GlueCatalog\",\n",
69 | " \"spark.sql.catalog.glue_catalog.io-impl\": \"org.apache.iceberg.aws.s3.S3FileIO\",\n",
70 | " \"spark.jars\": \"/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar\"\n",
71 | " }\n",
72 | "}"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "id": "d6e7ee48",
78 | "metadata": {},
79 | "source": [
80 | "---\n",
81 | "## 2. Create an Apache Iceberg Table\n",
82 | "We will create a Spark Dataframe with sample data and write this into an Iceberg table. \n",
83 | "\n",
84 | "\n",
85 | " NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "id": "b938063c",
92 | "metadata": {
93 | "tags": []
94 | },
95 | "outputs": [],
96 | "source": [
97 | "data = spark.createDataFrame([\n",
98 | " (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n",
99 | " (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n",
100 | " (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n",
101 | " (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\")\n",
102 | "],[\"id\", \"creation_date\", \"last_update_time\"])\n",
103 | "\n",
104 | "## Write a DataFrame as a Iceberg dataset to the Amazon S3 location.\n",
105 | "spark.sql(\"\"\"CREATE TABLE IF NOT EXISTS glue_catalog.default.iceberg_table (id string,\n",
106 | "creation_date string,\n",
107 | "last_update_time string)\n",
108 | "USING iceberg\n",
109 | "location \"\"\" + \"\\\"s3://my_bucket/aws_workshop/iceberg_table\\\"\")\n",
110 | "\n",
111 | "data.writeTo(\"glue_catalog.default.iceberg_table\").append()"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "id": "fecffb52",
117 | "metadata": {},
118 | "source": [
119 | "---\n",
120 | "## 3. Query the table\n",
121 | "We will query the table using %% sql magic and Spark SQL statement"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "id": "ad26952e",
128 | "metadata": {
129 | "tags": []
130 | },
131 | "outputs": [],
132 | "source": [
133 | "%%sql\n",
134 | "\n",
135 | "SELECT * from glue_catalog.default.iceberg_table LIMIT 10\n"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "id": "07a3d787",
141 | "metadata": {},
142 | "source": [
143 | "We will read the table using spark.read into a Spark dataframe"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "id": "9a5136e9",
150 | "metadata": {
151 | "tags": []
152 | },
153 | "outputs": [],
154 | "source": [
155 | "df = spark.read.format(\"iceberg\").load(\"glue_catalog.default.iceberg_table\")\n",
156 | "df.show()"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "id": "7a126d4b",
162 | "metadata": {},
163 | "source": [
164 | "### You have made it to the end of this notebook!!"
165 | ]
166 | }
167 | ],
168 | "metadata": {
169 | "kernelspec": {
170 | "display_name": "PySpark",
171 | "language": "python",
172 | "name": "spark_magic_pyspark"
173 | },
174 | "language_info": {
175 | "codemirror_mode": {
176 | "name": "python",
177 | "version": 3
178 | },
179 | "file_extension": ".py",
180 | "mimetype": "text/x-python",
181 | "name": "pyspark",
182 | "pygments_lexer": "python3"
183 | }
184 | },
185 | "nbformat": 4,
186 | "nbformat_minor": 5
187 | }
188 |
--------------------------------------------------------------------------------
/examples/word-count-emr-serverless.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Word count with with EMR Serverless on EMR Studio\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Write a file to S3, read the file and perform word count on the data."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "tags": []
17 | },
18 | "source": [
19 | "***\n",
20 | "\n",
21 | "## Prerequisites\n",
22 | "\n",
23 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
24 | "\n",
25 | "* Create an S3 bucket to save your results or use an existing s3 bucket. For example: `s3://EXAMPLE-BUCKET/word-count/`\n",
26 | "* The Interactive runtime role selected when attaching to an application should have S3 read and write permission to the above bucket.\n",
27 | "* The EMR Serverless application attached to this notebook should be of type `SPARK`.\n",
28 | "* This notebook uses the `PySpark` kernel.\n",
29 | "***"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "## Introduction\n",
37 | "In this example we write a file to S3, use pyspark to count the occurrence of each word in the file and store the results to s3.\n",
38 | "***"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "## Example\n",
46 | "\n",
47 | "Create a test data frame with some sample records.\n",
48 | "We will use the `createDataFrame()` method to create and `printSchema()` method to print out the schema.\n",
49 | "\n",
50 | "\n",
51 | " NOTE : You will need to update EXAMPLE-BUCKET in the statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {
58 | "tags": []
59 | },
60 | "outputs": [],
61 | "source": [
62 | "BUCKET = \"s3://EXAMPLE-BUCKET/word-count/\" # Change this to the S3 location that you created in prerequisites.\n",
63 | "\n",
64 | "wordsDF = sqlContext.createDataFrame([(\"emr\",), (\"spark\",), (\"example\",), (\"spark\",), (\"pyspark\",), (\"python\",),\n",
65 | " (\"example\",), (\"emr\",), (\"example\",), (\"spark\",), (\"pyspark\",), (\"python\",)], [\"words\"])\n",
66 | "wordsDF.show()\n",
67 | "wordsDF.printSchema()"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "Print out the number of unique words so that we can verify this number with the end result count."
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {
81 | "tags": []
82 | },
83 | "outputs": [],
84 | "source": [
85 | "uniqueWordsCount = wordsDF.distinct().groupBy().count().head()[0]\n",
86 | "print(uniqueWordsCount)"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "This step only shows an example on how to write to s3.\n",
94 | "You can use an existing file stored in S3 and read it as shown in the next steps."
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {
101 | "tags": []
102 | },
103 | "outputs": [],
104 | "source": [
105 | "wordsDF.write.csv(BUCKET + \"test-data.csv\")"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "Read the csv file from S3 and store in RDD."
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {},
119 | "outputs": [],
120 | "source": [
121 | "wordsData = sc.textFile(BUCKET + \"test-data.csv\")\n",
122 | "wordsData.count()"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "Display the contents of the file."
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {},
136 | "outputs": [],
137 | "source": [
138 | "wordsData.collect()"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "Count the occurence of each word and print the count of the result. This should be equal to the number of unique words we found earlier."
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {},
152 | "outputs": [],
153 | "source": [
154 | "wordsCounts = wordsData.flatMap(lambda line: line.split(\" \")) \\\n",
155 | " .map(lambda word: (word, 1)) \\\n",
156 | " .reduceByKey(lambda a, b: a+b)\n",
157 | "wordsCounts.count()"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "Display the count for each word."
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": null,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "wordsCounts.collect()"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "Save the results to your s3 bucket. The results are stored in the key `word-count` and split based on paritions."
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "wordsCounts.saveAsTextFile(BUCKET + \"word-count\")"
190 | ]
191 | }
192 | ],
193 | "metadata": {
194 | "kernelspec": {
195 | "display_name": "PySpark",
196 | "language": "python",
197 | "name": "spark_magic_pyspark"
198 | },
199 | "language_info": {
200 | "codemirror_mode": {
201 | "name": "python",
202 | "version": 3
203 | },
204 | "file_extension": ".py",
205 | "mimetype": "text/x-python",
206 | "name": "pyspark",
207 | "pygments_lexer": "python3"
208 | }
209 | },
210 | "nbformat": 4,
211 | "nbformat_minor": 4
212 | }
213 |
--------------------------------------------------------------------------------
/examples/redshift-with-spark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "a28da63f",
6 | "metadata": {},
7 | "source": [
8 | "# Connect to Amazon Redshift with Pyspark, Spark Scala, and SparkR\n",
9 | "\n",
10 | "\n",
11 | "## Table of Contents:\n",
12 | "\n",
13 | "1. [Prerequisites](#Prerequisites)\n",
14 | "2. [Introduction](#Introduction)\n",
15 | "3. [Setup](#Setup)\n",
16 | "4. [Connect to Amazon Redshift using Pyspark](#Connect-to-Amazon-Redshift-using-Pyspark)\n",
17 | "5. [Connect to Amazon Redshift using Scala](#Connect-to-Amazon-Redshift-using-Scala)\n",
18 | "6. [Connect to Amazon Redshift using SparkR](#Connect-to-Amazon-Redshift-using-SparkR)\n",
19 | "\n",
20 | "\n",
21 | "## Prerequisites\n",
22 | "\n",
23 | "In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.\n",
24 | "* This example we connect to Amazon Redshift cluster, hence the EMR cluster attached to this notebook must have the connectivity (VPC) and appropriate rules (Security Group).\n",
25 | "\n",
26 | "\n",
27 | "## Introduction\n",
28 | "In this example we use Pyspark, Spark Scala, and Spark R to connect to a table in Amazon Redshift using spark-redshift connector.\n",
29 | "\n",
30 | "[spark-redshift](#https://github.com/spark-redshift-community/spark-redshift) is a performant Amazon Redshift data source for Apache Spark\n",
31 | "\n",
32 | "## Setup\n",
33 | "\n",
34 | "* Create an S3 bucket location to be used as a temporary location for Redshift dataset. For example: s3://EXAMPLE-BUCKET/temporary-redshift-dataset/\n",
35 | "\n",
36 | "* Create an AWS IAM role which will be associated to the Amazon Redshift cluster. Make sure that this IAM role has access to read and write to the above mentioned S3 bucket location with the appropriate IAM policy. More details:\n",
37 | "\n",
38 | " * [Create AWS IAM role for Amazon Redshift](#https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-an-iam-role.html)\n",
39 | " * [Associate IAM role with Amazon Redshift cluster](#https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html)\n"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": null,
45 | "id": "618db6ef",
46 | "metadata": {},
47 | "outputs": [],
48 | "source": [
49 | "%%configure -f\n",
50 | "{ \n",
51 | " \"conf\": \n",
52 | " {\n",
53 | " \"spark.jars.packages\": \"org.apache.spark:spark-avro_2.11:2.4.2,io.github.spark-redshift-community:spark-redshift_2.11:4.0.1\"\n",
54 | " }\n",
55 | "}"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "id": "c3433ed7",
61 | "metadata": {},
62 | "source": [
63 | "## Connect to Amazon Redshift using Pyspark"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "id": "8220a165",
70 | "metadata": {},
71 | "outputs": [],
72 | "source": [
73 | "%%pyspark\n",
74 | "\n",
75 | "#Declare the variables and replace the variables values as appropiate\n",
76 | "\n",
77 | "str_jdbc_url=\"jdbc:redshift://:5439/dev?user=&password=\"\n",
78 | "str_dbname=\"\"\n",
79 | "str_tgt_table=\"\"\n",
80 | "str_s3_path=\"s3://\"\n",
81 | "str_iam_role=\"\"\n",
82 | "\n",
83 | "# Read data from source table\n",
84 | "\n",
85 | "jdbcDF = spark.read \\\n",
86 | " .format(\"io.github.spark_redshift_community.spark.redshift\") \\\n",
87 | " .option(\"url\", str_jdbc_url) \\\n",
88 | " .option(\"dbtable\", str_dbname) \\\n",
89 | " .option(\"tempdir\", str_s3_path) \\\n",
90 | " .option(\"aws_iam_role\",str_iam_role) \\\n",
91 | " .load()\n",
92 | "\n",
93 | "jdbcDF.limit(5).show()\n",
94 | "\n",
95 | "# Write data to target table\n",
96 | "\n",
97 | "jdbcDF.write \\\n",
98 | " .format(\"io.github.spark_redshift_community.spark.redshift\") \\\n",
99 | " .option(\"url\", str_jdbc_url) \\\n",
100 | " .option(\"dbtable\", str_tgt_table) \\\n",
101 | " .option(\"tempdir\", str_s3_path) \\\n",
102 | " .option(\"aws_iam_role\",str_iam_role).mode(\"append\").save()"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "id": "bd9f2fad",
108 | "metadata": {},
109 | "source": [
110 | "## Connect to Amazon Redshift using Scala"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "id": "c7f17fdd",
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "%%scalaspark\n",
121 | "\n",
122 | "#Declare the variables and replace the variables values as appropiate\n",
123 | "\n",
124 | "val str_jdbc_url=\"jdbc:redshift://:5439/dev?user=&password=\"\n",
125 | "val str_dbname=\"\"\n",
126 | "val str_tgt_table=\"\"\n",
127 | "val str_s3_path=\"s3://\"\n",
128 | "val str_iam_role=\"\"\n",
129 | "val str_username=\"\"\n",
130 | "val str_password=\"\"\n",
131 | "\n",
132 | "# Read data from source table\n",
133 | "val jdbcDF = (spark.read.format(\"io.github.spark_redshift_community.spark.redshift\")\n",
134 | " .option(\"url\", str_jdbc_url)\n",
135 | " .option(\"dbtable\", str_dbname)\n",
136 | " .option(\"tempdir\", str_s3_path)\n",
137 | " .option(\"aws_iam_role\", str_iam_role)\n",
138 | " .load())\n",
139 | "\n",
140 | "# Write data to target table\n",
141 | "\n",
142 | "jdbcDF.limit(5).show()\n",
143 | "\n",
144 | "jdbcDF.write.mode(\"append\").\n",
145 | " format(\"io.github.spark_redshift_community.spark.redshift\").option(\"url\", str_jdbc_url).option(\"dbtable\", str_tgt_table).option(\"aws_iam_role\", str_iam_role).option(\"tempdir\", str_s3_path).save()\n",
146 | " \n"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "id": "d20e8bcd",
152 | "metadata": {},
153 | "source": [
154 | "## Connect to Amazon Redshift using SparkR"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "id": "eecde2e3",
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "%%rspark\n",
165 | "\n",
166 | "#Declare the variables and replace the variables values as appropiate\n",
167 | "\n",
168 | "str_jdbc_url=\"jdbc:redshift://:5439/dev?user=&password=\"\n",
169 | "str_dbname=\"\"\n",
170 | "str_tgt_table=\"\"\n",
171 | "str_s3_path=\"s3://\"\n",
172 | "str_iam_role=\"\"\n",
173 | "\n",
174 | "# Read data from source table\n",
175 | "\n",
176 | "df <- read.df(\n",
177 | " NULL,\n",
178 | " \"io.github.spark_redshift_community.spark.redshift\",\n",
179 | " aws_iam_role = str_iam_role,\n",
180 | " tempdir = str_s3_path,\n",
181 | " dbtable = str_src_table,\n",
182 | " url = str_jdbc_url)\n",
183 | "\n",
184 | "showDF(df)"
185 | ]
186 | }
187 | ],
188 | "metadata": {
189 | "kernelspec": {
190 | "display_name": "PySpark",
191 | "language": "",
192 | "name": "pysparkkernel"
193 | },
194 | "language_info": {
195 | "codemirror_mode": {
196 | "name": "python",
197 | "version": 3
198 | },
199 | "mimetype": "text/x-python",
200 | "name": "pyspark",
201 | "pygments_lexer": "python3"
202 | }
203 | },
204 | "nbformat": 4,
205 | "nbformat_minor": 5
206 | }
207 |
--------------------------------------------------------------------------------
/examples/jdbc-rdbms-with-spark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "44cde1cc",
6 | "metadata": {},
7 | "source": [
8 | "# Connect RDBMS using jdbc connector from EMR Studio Notebook using Pyspark, Spark Scala, and SparkR\n",
9 | "\n",
10 | "#### Topics covered in this example\n",
11 | "\n",
12 | "* Configuring jdbc driver\n",
13 | "* Connecting to database using jdbc to read data\n",
14 | "* Connecting to database using jdbc to write data"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "id": "e63a1147",
20 | "metadata": {},
21 | "source": [
22 | "## Table of Contents:\n",
23 | "\n",
24 | "1. [Prerequisites](#Prerequisites)\n",
25 | "2. [Introduction](#Introduction)\n",
26 | "3. [Upload the MySQL jdbc driver in S3 and declare the path](#Upload-the-MySQL-jdbc-driver-in-S3-and-declare-the-path)\n",
27 | "4. [Read and write data using Pyspark](#Read-and-write-data-using-Pyspark)\n",
28 | "5. [Read and write data using Scala](#Read-and-write-data-using-Scala)\n",
29 | "6. [Read and write data using SparkR](#Read-and-write-data-using-SparkR)\n"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "id": "1138c207",
35 | "metadata": {},
36 | "source": [
37 | "## Prerequisites"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "id": "7a9f2d7b",
43 | "metadata": {},
44 | "source": [
45 | "Download jdbc driver and upload it on S3 which is accessible from the Amazon EMR cluster attached to the Amazon EMR Studio. "
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "id": "34d0aa33",
51 | "metadata": {},
52 | "source": [
53 | "## Introduction"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "id": "f0706f71",
59 | "metadata": {},
60 | "source": [
61 | "This notebooks shows how to connect RDBS using jdbc connector from Amazon EMR Studio Notebook. "
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "id": "980ad752",
67 | "metadata": {},
68 | "source": [
69 | "## Upload the MySQL jdbc driver in S3 and declare the path"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "id": "fd99a56a",
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "%%configure -f\n",
80 | "{\n",
81 | " \"conf\": {\n",
82 | " \"spark.jars\": \"s3:///jars/mysql-connector-java-8.0.19.jar\" \n",
83 | " }\n",
84 | "}"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "id": "1c41e375",
90 | "metadata": {},
91 | "source": [
92 | "## Read and write data using Pyspark"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": null,
98 | "id": "1c954008",
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "%%pyspark\n",
103 | "\n",
104 | "#Declare the variables and replace the variables values as appropiate\n",
105 | "\n",
106 | "str_jdbc_url=\"jdbc:mysql://:3306/\"\n",
107 | "str_Query= \"\"\n",
108 | "str_dbname=\"\"\n",
109 | "str_username=\"\"\n",
110 | "str_password=\"\"\n",
111 | "str_tgt_table=\"\"\n",
112 | "\n",
113 | "# Read data from source table\n",
114 | "\n",
115 | "jdbcDF = spark.read \\\n",
116 | " .format(\"jdbc\") \\\n",
117 | " .option(\"url\", str_jdbc_url) \\\n",
118 | " .option(\"query\", str_Query) \\\n",
119 | " .option(\"user\", str_username) \\\n",
120 | " .option(\"password\", str_password) \\\n",
121 | " .load()\n",
122 | "\n",
123 | "jdbcDF.limit(5).show()\n",
124 | "\n",
125 | "# Write data to the target database\n",
126 | "\n",
127 | "jdbcDF.write \\\n",
128 | " .format(\"jdbc\") \\\n",
129 | " .option(\"url\", str_jdbc_url) \\\n",
130 | " .option(\"dbtable\", str_tgt_table) \\\n",
131 | " .option(\"user\", str_username) \\\n",
132 | " .option(\"password\", str_password).mode(\"append\").save()\n"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "id": "5c26a471",
138 | "metadata": {},
139 | "source": [
140 | "## Read and write data using Scala"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "id": "2666bd59",
147 | "metadata": {},
148 | "outputs": [],
149 | "source": [
150 | "%%scalaspark\n",
151 | "\n",
152 | "#Declare the variables and replace the variables values as appropiate\n",
153 | "\n",
154 | "val str_jdbc_url=\"jdbc:mysql://:3306/\"\n",
155 | "val str_Query= \"\"\n",
156 | "val str_dbname=\"\"\n",
157 | "val str_username=\"\"\n",
158 | "val str_password=\"\"\n",
159 | "val str_tgt_table=\"\"\n",
160 | "\n",
161 | "# Read data from source table\n",
162 | "\n",
163 | "val jdbcDF = (spark.read.format(\"jdbc\")\n",
164 | " .option(\"url\", str_jdbc_url)\n",
165 | " .option(\"query\", str_Query)\n",
166 | " .option(\"user\", str_username)\n",
167 | " .option(\"password\", str_password)\n",
168 | " .load())\n",
169 | "\n",
170 | "jdbcDF.limit(5).show()\n",
171 | "\n",
172 | "# Write data to the target database\n",
173 | "\n",
174 | "val connectionProperties = new java.util.Properties\n",
175 | "connectionProperties.put(\"user\", str_username)\n",
176 | "connectionProperties.put(\"password\", str_password)\n",
177 | "\n",
178 | "jdbcDF.write.mode(\"append\").jdbc(str_jdbc_url, str_tgt_table, connectionProperties)\n"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "id": "066b6a73",
184 | "metadata": {},
185 | "source": [
186 | "## Read and write data using SparkR"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "id": "898d69a1",
193 | "metadata": {},
194 | "outputs": [],
195 | "source": [
196 | "%%rspark\n",
197 | "\n",
198 | "#Declare the variables and replace the variables values as appropiate\n",
199 | "\n",
200 | "str_jdbc_url=\"jdbc:mysql://:3306/\"\n",
201 | "str_dbname=\"\"\n",
202 | "str_username=\"\"\n",
203 | "str_password=\"\"\n",
204 | "str_tgt_table=\"\"\n",
205 | "\n",
206 | "# Read data from source database\n",
207 | "\n",
208 | "df <- read.jdbc(str_jdbc_url, \n",
209 | " \"(select employee_id, first_name, last_name, email, dept_name from notebook.employee e, notebook.dept d where e.department_id = d.department_id) AS tmp\", \n",
210 | " user = str_username, \n",
211 | " password = str_password)\n",
212 | "\n",
213 | "showDF(df)\n",
214 | "\n",
215 | "jdbcDF.limit(5).show()\n",
216 | "\n",
217 | "# Write data to the target database\n",
218 | "\n",
219 | "write.jdbc(df, \n",
220 | " str_jdbc_url, \n",
221 | " str_tgt_table,\n",
222 | " user = str_username,\n",
223 | " password = str_password,\n",
224 | " mode = \"append\")\n"
225 | ]
226 | }
227 | ],
228 | "metadata": {
229 | "kernelspec": {
230 | "display_name": "PySpark",
231 | "language": "",
232 | "name": "pysparkkernel"
233 | },
234 | "language_info": {
235 | "codemirror_mode": {
236 | "name": "python",
237 | "version": 3
238 | },
239 | "mimetype": "text/x-python",
240 | "name": "pyspark",
241 | "pygments_lexer": "python3"
242 | }
243 | },
244 | "nbformat": 4,
245 | "nbformat_minor": 5
246 | }
247 |
--------------------------------------------------------------------------------
/examples/jupyter-magic-commands.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "8b1bc4f8",
6 | "metadata": {},
7 | "source": [
8 | "# Using Jupyter Magics on EMR Studio\n",
9 | "\n",
10 | "\n",
11 | "#### Topics covered in this example:\n",
12 | "\n",
13 | "* Built-in Magics\n",
14 | "* EMR Magics\n",
15 | " * Mounting a Workspace Directory\n",
16 | " * Executing local python files or package\n",
17 | " * Downloading a file\n"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "id": "4602d4e6",
23 | "metadata": {},
24 | "source": [
25 | "***\n",
26 | "\n",
27 | "## Prerequisites\n",
28 | "\n",
29 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
30 | "\n",
31 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
32 | "* This notebook uses the `PySpark` kernel.\n",
33 | "***\n",
34 | "\n"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "id": "4ed4a9c0",
40 | "metadata": {},
41 | "source": [
42 | "## Built-in Magics\n",
43 | "\n",
44 | "Jupyter magics act as convenient functions that accomplish something useful and saves the effort of writing Python code instead. There are useful buiilt-in magic functions and some are unique to EMR Studio, we document them here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-magics.html\n",
45 | "\n",
46 | "The most important magic commands are:\n",
47 | "\n",
48 | "1. `%%configure` - allows you to change the session properties of a Spark session in Spark Kernels:\n",
49 | "\n",
50 | "```\n",
51 | "%%configure -f\n",
52 | "{ \"conf\": {\n",
53 | " spark.submit.deployMode\":\"cluster\"\n",
54 | " }\n",
55 | "}\n",
56 | "```\n",
57 | "\n",
58 | "2. `%%display` - is only available in Spark Kernels and allows you to display the rows of a Spark dataframe in a tabular format in addition to providing the ability to visualize the rows in a chart.\n",
59 | "\n",
60 | "```\n",
61 | "%%display df\n",
62 | "```\n",
63 | "\n",
64 | "Let\"s see the `%%display` magic in action:"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "id": "96714577",
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "data = [{\n",
75 | " \"Category\": \"A\",\n",
76 | " \"ID\": 1,\n",
77 | " \"Value\": 121.44,\n",
78 | " \"Truth\": True\n",
79 | "}, {\n",
80 | " \"Category\": \"B\",\n",
81 | " \"ID\": 2,\n",
82 | " \"Value\": 300.01,\n",
83 | " \"Truth\": False\n",
84 | "}, {\n",
85 | " \"Category\": \"C\",\n",
86 | " \"ID\": 3,\n",
87 | " \"Value\": 10.99,\n",
88 | " \"Truth\": None\n",
89 | "}, {\n",
90 | " \"Category\": \"E\",\n",
91 | " \"ID\": 4,\n",
92 | " \"Value\": 33.87,\n",
93 | " \"Truth\": True\n",
94 | "}]\n",
95 | "\n",
96 | "df = spark.createDataFrame(data)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "id": "1e29448c",
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "%%display\n",
107 | "df"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "id": "5ea11a01",
113 | "metadata": {},
114 | "source": [
115 | "## EMR Magics\n",
116 | "\n",
117 | "The EMR magics package available here (https://pypi.org/simple emr-notebooks-magics) offers the following magics that can be used on Python3 kernels as well as Spark Kernels on EMR Studio. The two magics we discuss in this notebook are:\n",
118 | "\n",
119 | "* mount_workspace_dir - allows you to mount an EMR Studio Workspace directory to an EMR Cluster.\n",
120 | "* generate_s3_download_url - alows you to generate a temporary signed download URL for an S3 object.\n",
121 | "\n",
122 | "Lets install the EMR-notebooks-magics package on your EMR Cluster:\n",
123 | "\n",
124 | "```\n",
125 | "%pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple emr-notebooks-magics\n",
126 | "```"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "id": "569f2868",
132 | "metadata": {},
133 | "source": [
134 | "### Mounting a Workspace Directory\n",
135 | "\n",
136 | "Lets mount a Workpsace directory on to the EMR cluster:"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "id": "aa127024",
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "%mount_workspace_dir "
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "id": "7c36a413",
152 | "metadata": {},
153 | "source": [
154 | "Note that your current directory changes to the mounted Workspace directory and you can list the contents in it."
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "id": "cb3ee468",
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "%%sh\n",
165 | "pwd"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "id": "9f868793",
172 | "metadata": {},
173 | "outputs": [],
174 | "source": [
175 | "%%sh\n",
176 | "ls"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "id": "750782e7",
182 | "metadata": {},
183 | "source": [
184 | "### Executing Local Python Files or Packages.\n",
185 | "\n",
186 | "\n"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "id": "983b4594",
192 | "metadata": {},
193 | "source": [
194 | "You can now execute local python files and packages."
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "id": "5d6d05ba",
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "%run -i \"\"\""
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "id": "84148202",
210 | "metadata": {},
211 | "source": [
212 | "### Downloading a File.\n",
213 | "\n",
214 | "Sometimes we need to download a file to our local desktop for e.g. to further analyze some data in Excel. Lets now see the `generate_s3_download_url` magic in action that allows us to do just that. We save the dataframe as a Parquet file in an S3 bucket."
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": null,
220 | "id": "bec9d8dd",
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "df = _\n",
225 | "s3_url = \"s3:////.parquet.gzip\"\n",
226 | "df.to_parquet(s3_url, compression=\"gzip\")"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "id": "fcc17350",
233 | "metadata": {},
234 | "outputs": [],
235 | "source": [
236 | "%%sh\n",
237 | "aws s3 ls s3:////.parquet.gzip"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "id": "c671cc85",
243 | "metadata": {},
244 | "source": [
245 | "We can now generate a download URL for this file. Note that the url is a temporary one and the command provides options on how long the url should be available."
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "id": "dbda72bf",
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "%generate_s3_download_url s3:////.parquet.gzip"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "id": "2f853f3a",
261 | "metadata": {},
262 | "source": [
263 | "and then view the link for the output."
264 | ]
265 | }
266 | ],
267 | "metadata": {
268 | "kernelspec": {
269 | "display_name": "PySpark",
270 | "language": "",
271 | "name": "pysparkkernel"
272 | },
273 | "language_info": {
274 | "codemirror_mode": {
275 | "name": "python",
276 | "version": 3
277 | },
278 | "mimetype": "text/x-python",
279 | "name": "pyspark",
280 | "pygments_lexer": "python3"
281 | }
282 | },
283 | "nbformat": 4,
284 | "nbformat_minor": 5
285 | }
286 |
--------------------------------------------------------------------------------
/examples/documentdb-with-spark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "4c7d5e9f",
6 | "metadata": {},
7 | "source": [
8 | "## Connect DocumentDb using spark connector from EMR Studio Notebook using Pyspark, Spark Scala, and SparkR\n",
9 | "\n",
10 | "#### Topics covered in this example\n",
11 | "\n",
12 | "* Configuring mongodb spark connector\n",
13 | "* Configuring mongodb input database URI\n",
14 | "* Configuring mongodb output database URI\n",
15 | "* Connecting to AWS DocumentDB using mongodb spark connector to read data into Spark DF\n",
16 | "* Connecting to AWS DocumentDB using mongodb spark connector to write data from Spark DF to DocumentDB"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "id": "1d779439",
22 | "metadata": {},
23 | "source": [
24 | "## Table of Contents:\n",
25 | "\n",
26 | "1. [Prerequisites](#Prerequisites)\n",
27 | "2. [Introduction](#Introduction)\n",
28 | "3. [Load the configuration in memory](#Load-the-configuration-in-memory)\n",
29 | "4. [Read data using Pyspark](#Read-data-using-Pyspark)\n",
30 | "5. [Write data using Pyspark](#Write-data-using-Pyspark)\n",
31 | "6. [Read data using Scala](#Read-data-using-Scala)\n",
32 | "7. [Write data using Scala](#Write-data-using-Scala)\n",
33 | "8. [Read data using SparkR](#Read-data-using-SparkR)\n",
34 | "9. [Write data using SparkR](#Write-data-using-SparkR)"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "id": "99a57308",
40 | "metadata": {},
41 | "source": [
42 | "## Prerequisites"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "id": "2f8bec46",
48 | "metadata": {},
49 | "source": [
50 | " 1. This notebook support Multi-language support for Spark kernels\n",
51 | " 2. Mongo Spark Connector Version - mongo-spark-connector_2.12:3.0.1\n",
52 | " 3. EMR Version - emr-6.4.0\n",
53 | " 4. DocumentDB Engine Version - docdb 4.0.0"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "id": "40d6cbd2",
59 | "metadata": {},
60 | "source": [
61 | "## Introduction"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "id": "71f3992d",
67 | "metadata": {},
68 | "source": [
69 | "This notebooks shows how to connect to DocumentDB using mongo spark connector(mongo-spark-connector_2.12:3.0.1) from Amazon EMR Studio Notebook using Pyspark, Scala, SparkR"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "id": "832f5897",
75 | "metadata": {},
76 | "source": [
77 | "## Load the configuration in memory"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "id": "3a453ec7",
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "%%configure -f\n",
88 | "{\n",
89 | " \"conf\": {\n",
90 | " \"spark.mongodb.input.uri\": \"mongodb://:@:/.?readPreference=secondaryPreferred\",\n",
91 | " \"spark.mongodb.output.uri\": \"mongodb://:@:/.\",\n",
92 | " \"spark.jars.packages\": \"org.mongodb.spark:mongo-spark-connector_2.12:3.0.1\"\n",
93 | " }\n",
94 | "}"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "id": "eb52648d",
100 | "metadata": {},
101 | "source": [
102 | "## Read data using Pyspark"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "id": "bbc8b684",
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "%%pyspark\n",
113 | "df = spark.read.format(\"mongo\").option(\"database\", \"\").option(\"collection\", \"\").load()\n",
114 | "df.show()"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "id": "7eccd39c",
120 | "metadata": {},
121 | "source": [
122 | "## Write data using Pyspark"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "id": "fef2a682",
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "%%pyspark\n",
133 | "people = spark.createDataFrame([(\"Bilbo Baggins\", 50), (\"Gandalf\", 1000), (\"Thorin\", 195), (\"Balin\", 178), (\"Kili\", 77),\n",
134 | " (\"Dwalin\", 169), (\"Oin\", 167), (\"Gloin\", 158), (\"Fili\", 82), (\"Bombur\", None)], [\"name\", \"age\"])\n",
135 | "people.show()\n",
136 | "people.write.format(\"mongo\").mode(\"append\").option(\"database\",\n",
137 | "\"\").option(\"collection\", \"\").save()\n",
138 | "df_people = spark.read.format(\"mongo\").option(\"database\", \"\").option(\"collection\", \"\").load()\n",
139 | "df_people.show()"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "id": "9b059706",
145 | "metadata": {},
146 | "source": [
147 | "## Read data using Scala"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "id": "8b2563e1",
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "%%scalaspark\n",
158 | "val df = spark.read.format(\"mongo\").option(\"database\", \"\").option(\"collection\", \"\").load()\n",
159 | "df.show()"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "id": "75d2a02b",
165 | "metadata": {},
166 | "source": [
167 | "## Write data using Scala"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "id": "63966e6d",
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "%%scalaspark\n",
178 | "import com.mongodb.spark._\n",
179 | "import com.mongodb.spark.config._\n",
180 | "val writeConfig = WriteConfig(Map(\"collection\" -> \"\", \"writeConcern.w\" -> \"majority\"), Some(WriteConfig(sc)))\n",
181 | "val sparkDocuments = sc.parallelize((1 to 10).map(i => Document.parse(s\"{spark: $i}\")))\n",
182 | "MongoSpark.save(sparkDocuments, writeConfig)\n",
183 | "val numbers_df = spark.read.format(\"mongo\").option(\"database\", \"\").option(\"collection\", \"\").load()\n",
184 | "numbers_df.show()"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "id": "8b2a8692",
190 | "metadata": {},
191 | "source": [
192 | "## Read data using SparkR"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "id": "11d58b43",
199 | "metadata": {},
200 | "outputs": [],
201 | "source": [
202 | "%%rspark\n",
203 | "df <- read.df(\"\", source = \"com.mongodb.spark.sql.DefaultSource\", database = \"\", collection = \"\")\n",
204 | "showDF(df)"
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "id": "3609e158",
210 | "metadata": {},
211 | "source": [
212 | "## Write data using SparkR"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "id": "ffddd791",
219 | "metadata": {},
220 | "outputs": [],
221 | "source": [
222 | "%%rspark\n",
223 | "charactersRdf <- data.frame(list(name=c(\"Bilbo Baggins\", \"Gandalf\", \"Thorin\",\n",
224 | " \"Balin\", \"Kili\", \"Dwalin\", \"Oin\", \"Gloin\", \"Fili\", \"Bombur\"),\n",
225 | " age=c(50, 1000, 195, 178, 77, 169, 167, 158, 82, NA)))\n",
226 | "charactersSparkdf <- createDataFrame(charactersRdf)\n",
227 | "write.df(charactersSparkdf, \"\", source = \"com.mongodb.spark.sql.DefaultSource\",\n",
228 | " mode = \"overwrite\", database = \"\", collection = \"\")\n",
229 | "characters_df <- read.df(\"\", source = \"com.mongodb.spark.sql.DefaultSource\",\n",
230 | " database = \"\", collection = \"\")\n",
231 | "showDF(characters_df)"
232 | ]
233 | }
234 | ],
235 | "metadata": {
236 | "kernelspec": {
237 | "display_name": "PySpark",
238 | "language": "",
239 | "name": "pysparkkernel"
240 | },
241 | "language_info": {
242 | "codemirror_mode": {
243 | "name": "python",
244 | "version": 3
245 | },
246 | "mimetype": "text/x-python",
247 | "name": "pyspark",
248 | "pygments_lexer": "python3"
249 | }
250 | },
251 | "nbformat": 4,
252 | "nbformat_minor": 5
253 | }
254 |
--------------------------------------------------------------------------------
/examples/Getting-started-emr-serverless.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "1f7439a3",
6 | "metadata": {},
7 | "source": [
8 | "# Get started with EMR Serverless on EMR Studio"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "e283e844",
14 | "metadata": {},
15 | "source": [
16 | "#### Topics covered in this example\n",
17 | "\n",
18 | " Configure a Spark session \n",
19 | " Import a library to help with plot \n",
20 | " Spark DataFrames: reading a public dataset, selecting data and writing to a S3 location \n",
21 | " Spark SQL: creating a new view and selecting data \n",
22 | " Visualize your data \n",
23 | " "
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "id": "d16c0e10",
29 | "metadata": {
30 | "execution": {
31 | "iopub.execute_input": "2023-10-16T17:21:25.407818Z",
32 | "iopub.status.busy": "2023-10-16T17:21:25.407393Z",
33 | "iopub.status.idle": "2023-10-16T17:21:39.912554Z",
34 | "shell.execute_reply": "2023-10-16T17:21:39.911928Z",
35 | "shell.execute_reply.started": "2023-10-16T17:21:25.407789Z"
36 | }
37 | },
38 | "source": [
39 | "***\n",
40 | "\n",
41 | "## Prerequisites\n",
42 | "\n",
43 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
44 | "\n",
45 | "* EMR Serverless should be chosen as the Compute.\n",
46 | "* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.\n",
47 | "* This notebook uses the `PySpark` kernel.\n",
48 | "* Your Serverless Application must be configured with a VPC that has internet connectivity. [Learn more](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html)\n",
49 | "***"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "id": "8af6027b",
55 | "metadata": {},
56 | "source": [
57 | "## 1. Configure your Spark session.\n",
58 | "Configure the Spark Session to use Virtualenv. Virtualenv is needed to install other Python packages."
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "id": "24ce8423",
65 | "metadata": {
66 | "tags": []
67 | },
68 | "outputs": [],
69 | "source": [
70 | "%%configure -f\n",
71 | "{\n",
72 | " \"conf\": {\n",
73 | " \"spark.pyspark.virtualenv.enabled\": \"true\",\n",
74 | " \"spark.pyspark.virtualenv.bin.path\": \"/usr/bin/virtualenv\",\n",
75 | " \"spark.pyspark.virtualenv.type\": \"native\",\n",
76 | " \"spark.pyspark.python\": \"/usr/bin/python3\",\n",
77 | " \"spark.executorEnv.PYSPARK_PYTHON\": \"/usr/bin/python3\"\n",
78 | " }\n",
79 | "}"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "id": "b2165194",
85 | "metadata": {},
86 | "source": [
87 | "Start a Spark session:"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "id": "d14c84e6",
94 | "metadata": {
95 | "tags": []
96 | },
97 | "outputs": [],
98 | "source": [
99 | "spark"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "id": "1ea0659c",
105 | "metadata": {},
106 | "source": [
107 | "Run the `%%info` magic command which shows the Spark configuration for the current session as well as provides links to navigate to the live Spark UI for the session:"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "id": "4148b249",
114 | "metadata": {
115 | "tags": []
116 | },
117 | "outputs": [],
118 | "source": [
119 | "%%info"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "id": "02facc47",
125 | "metadata": {},
126 | "source": [
127 | "---\n",
128 | "## 2. Install packages from PyPI\n",
129 | "We will install matplotlib Python package. \n",
130 | "\n",
131 | "NOTE : You will need internet access to do this step.
"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "id": "bd12d484",
138 | "metadata": {
139 | "tags": []
140 | },
141 | "outputs": [],
142 | "source": [
143 | "sc.install_pypi_package(\"matplotlib\")"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "id": "26a5516b",
149 | "metadata": {},
150 | "source": [
151 | "---\n",
152 | "## 3. Read data from S3\n",
153 | "We will use a public data set on NYC yellow taxis. Read the Parquet file from S3. The file has headers and we want Spark to infer the schema. \n",
154 | "\n",
155 | "NOTE : You will need to update your runtime role to allow Get access to the s3://athena-examples-us-east-1/notebooks/ folder and its sub-folders.
"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": null,
161 | "id": "34e4291d",
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "file_name = \"s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet\"\n",
166 | "\n",
167 | "taxi_df = (spark.read.format(\"parquet\").option(\"header\", \"true\") \\\n",
168 | " .option(\"inferSchema\", \"true\").load(file_name))"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "id": "8f910a35",
174 | "metadata": {},
175 | "source": [
176 | "#### Use Spark Dataframe to group and count specific column from taxi_df"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": null,
182 | "id": "6c66389d",
183 | "metadata": {},
184 | "outputs": [],
185 | "source": [
186 | "taxi1_df = taxi_df.groupBy(\"VendorID\", \"passenger_count\").count()\n",
187 | "taxi1_df.show()"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "id": "afe654d5",
193 | "metadata": {},
194 | "source": [
195 | "### Use the %%display magic to quickly visualize a dataframe\n",
196 | "\n",
197 | " You can choose to view the results in a table format. \n",
198 | " You can also choose to visualize your data with five types of charts. You can select the display type below and the chart will change accordingly. \n",
199 | " "
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "id": "a1649eed",
206 | "metadata": {
207 | "tags": []
208 | },
209 | "outputs": [],
210 | "source": [
211 | "%%display\n",
212 | "taxi1_df"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "id": "6f8a3889",
218 | "metadata": {},
219 | "source": [
220 | "---\n",
221 | "## 4. Run Spark SQL commands\n",
222 | "#### Create a new temporary view taxis. Use Spark SQL to select data from this view. Create a taxi dataframe for further processing"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "id": "d34e2a59",
229 | "metadata": {},
230 | "outputs": [],
231 | "source": [
232 | "taxi_df.createOrReplaceTempView(\"taxis\")\n",
233 | "\n",
234 | "sqlDF = spark.sql(\n",
235 | " \"SELECT DOLocationID, sum(total_amount) as sum_total_amount \\\n",
236 | " FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID\"\n",
237 | ")\n",
238 | "sqlDF.show(50)"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "id": "ea77d28f",
244 | "metadata": {},
245 | "source": [
246 | "Use %%sql magic"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "id": "ecbeea32",
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "%%sql\n",
257 | "SHOW DATABASES"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "id": "08a44bb0",
263 | "metadata": {},
264 | "source": [
265 | "---\n",
266 | "## 5. Visualize your data using Python \n",
267 | "#### Use matplotlib to plot the drop off location and the total amount as a bar chart"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "id": "fef525f5",
274 | "metadata": {},
275 | "outputs": [],
276 | "source": [
277 | "import matplotlib.pyplot as plt\n",
278 | "import numpy as np\n",
279 | "import pandas as pd\n",
280 | "\n",
281 | "plt.clf()\n",
282 | "df = sqlDF.toPandas()\n",
283 | "plt.bar(df.DOLocationID, df.sum_total_amount)\n",
284 | "%matplot plt"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "id": "0ec35ea5",
290 | "metadata": {},
291 | "source": [
292 | "### You have made it to the end of the demo notebook!!"
293 | ]
294 | }
295 | ],
296 | "metadata": {
297 | "kernelspec": {
298 | "display_name": "PySpark",
299 | "language": "python",
300 | "name": "spark_magic_pyspark"
301 | },
302 | "language_info": {
303 | "codemirror_mode": {
304 | "name": "python",
305 | "version": 3
306 | },
307 | "file_extension": ".py",
308 | "mimetype": "text/x-python",
309 | "name": "pyspark",
310 | "pygments_lexer": "python3"
311 | }
312 | },
313 | "nbformat": 4,
314 | "nbformat_minor": 5
315 | }
316 |
--------------------------------------------------------------------------------
/examples/udf-with-spark-sql.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Use Spark UDF with Spark SQL and Spark DataFrame\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Creating and registering UDF.\n",
11 | "* Special handling and best practices for UDF."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "***\n",
19 | "\n",
20 | "## Prerequisites\n",
21 | "\n",
22 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
23 | "\n",
24 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
25 | "* This notebook uses the `Spark` kernel.\n",
26 | "***"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "## Introduction\n",
34 | "User-Defined Functions (UDFs) are user-programmable routines that act on one row. This example demonstrate how to define and register UDFs and invoke them in SQL using the `%%sql` magic.\n",
35 | "\n",
36 | "#### Why do you need UDFs ?\n",
37 | "Spark stores data in dataframes or RDDs—resilient distributed datasets. As with a traditional SQL database, you cannot create your own custom function and run that against the database directly unless you register the function first. That is, save it to the database as if it were one of the built-in database functions, like sum(), average(), count(), etc.\n",
38 | "\n",
39 | "The document: Scala UDF provides detailed information.\n",
40 | "***"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## Example"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "Create a spark dataFrame."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "import spark.implicits._\n",
64 | "\n",
65 | "val columns = Seq(\"No\", \"Name\")\n",
66 | "\n",
67 | "val data = Seq((\"1\", \"john jones\"),\n",
68 | " (\"2\", \"tracey smith\"),\n",
69 | " (\"3\", \"amy sanders\"))\n",
70 | "\n",
71 | "val df = data.toDF(columns:_*)"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "Print the table dataFrame."
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "df.show(false)"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "#### Create Spark UDF to use it on DataFrame\n",
95 | "\n",
96 | "Create a function to convert a string to camel case. The function takes a string parameter and converts the first letter of every word to upper case letter."
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "val convertToCamelCase = (str:String) => {\n",
106 | " val arr = str.split(\" \")\n",
107 | " arr.map(f => f.substring(0,1).toUpperCase + f.substring(1,f.length)).mkString(\" \")\n",
108 | "}"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "Now convert this function `convertToCamelCase()` to a UDF by passing the function to Spark SQL `udf()`."
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {},
122 | "outputs": [],
123 | "source": [
124 | "val convertToCamelCaseUDF = udf(convertToCamelCase)"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "Now you can use the `convertToCamelCaseUDF()` on a DataFrame column."
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "df.select(col(\"No\"), convertToCamelCaseUDF(col(\"Name\")).as(\"Name\") ).show(false)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "#### Registering Spark UDF to use it on SQL\n",
148 | "\n",
149 | "In order to use a function on Spark SQL, you need to register the function with Spark using `spark.udf.register()`.\n",
150 | "\n",
151 | "Create another function convertToUpperCase to convert the whole string to upper case and register as udf."
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "val convertToUpperCase = (str:String) => {\n",
161 | " str.toUpperCase()\n",
162 | "}\n",
163 | "\n",
164 | "spark.udf.register(\"convertToUpperCaseUDF\", convertToUpperCase)"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "Create a temp view named `NAME_TABLE`. `createOrReplaceTempView` creates (or replaces if that view name already exists) a lazily evaluated view that you can then use like a hive table in Spark SQL. It is not persistent at this moment but you can run SQL queries on top of the view."
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {},
178 | "outputs": [],
179 | "source": [
180 | "df.createOrReplaceTempView(\"NAME_TABLE\")\n",
181 | "spark.sql(\"select No, convertToUpperCaseUDF(Name) from NAME_TABLE\").show(false)"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "You can also use the `%%sql` magic to query. Use the `convertToUpperCaseUDF` udf and display a new column `UpperCaseName`."
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 17,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "%%sql\n",
198 | "select No, Name, convertToUpperCaseUDF(Name) as UpperCaseName from NAME_TABLE"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "### Special Handling"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "#### Execution order\n",
213 | "Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.\n",
214 | "\n",
215 | "Therefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there’s no guarantee that the null check will happen before invoking the UDF. For example,\n",
216 | "\n",
217 | "```\n",
218 | "%%sql\n",
219 | "SELECT No, convertToUpperCaseUDF(Name) as Name from NAME_TABLE WHERE Name is not null and convertToUpperCaseUDF(Name) like \"%John%\"\n",
220 | "```\n",
221 | "\n",
222 | "This WHERE clause does not guarantee the `convertToUpperCaseUDF` to be invoked after filtering out nulls."
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "#### Handling null check\n",
230 | "UDF’s are error-prone when not designed carefully. for example, when you have a column that contains the value null on some records.\n",
231 | "\n",
232 | "```\n",
233 | "columns = [\"No\",\"Name\"]\n",
234 | "\n",
235 | "data = [(\"1\", \"john jones\"),\n",
236 | " (\"2\", \"tracey smith\"),\n",
237 | " (\"3\", \"amy sanders\"),\n",
238 | " (\"4\", null)]\n",
239 | "\n",
240 | "%%sql\n",
241 | "select No, convertToUpperCaseUDF(Name) as Name from NAME_TABLE\n",
242 | "```\n",
243 | "\n",
244 | "Record with `No 4` has value `null` for the `Name` column. Since we are not handling null with UDF function, using this on DataFrame returns an error."
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "#### Performance\n",
252 | "\n",
253 | "UDF’s are a black box and hence optimizations can’t be applied on the Dataframe/Dataset. When possible you should use the Spark SQL built-in functions as these functions provide optimizations."
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "***\n",
261 | "### Best practices while using UDFs.\n",
262 | "\n",
263 | "* Make the UDF null-aware and do null checking inside the UDF itself.\n",
264 | "* Use `IF` or `CASE WHEN` expressions to do the null check and invoke the UDF in a conditional branch.\n",
265 | "* Create UDF only when existing built-in SQL function are insufficient."
266 | ]
267 | }
268 | ],
269 | "metadata": {
270 | "kernelspec": {
271 | "display_name": "Spark",
272 | "language": "",
273 | "name": "sparkkernel"
274 | },
275 | "language_info": {
276 | "codemirror_mode": "text/x-scala",
277 | "mimetype": "text/x-scala",
278 | "name": "scala",
279 | "pygments_lexer": "scala"
280 | }
281 | },
282 | "nbformat": 4,
283 | "nbformat_minor": 4
284 | }
285 |
--------------------------------------------------------------------------------
/examples/machine-learning-with-pyspark-linear-regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "italic-invalid",
6 | "metadata": {},
7 | "source": [
8 | "# Spark Machine Learning using linear regression\n",
9 | "\n",
10 | "\n",
11 | "#### Topics covered in this example\n",
12 | "* `VectorAssembler`, `LinearRegression` and `RegressionEvaluator` from `pyspark.ml`."
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "moderate-dominant",
18 | "metadata": {},
19 | "source": [
20 | "***\n",
21 | "\n",
22 | "## Prerequisites\n",
23 | "\n",
24 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
25 | "\n",
26 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
27 | "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
28 | "* This notebook uses the `PySpark` kernel.\n",
29 | "***"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "id": "northern-violence",
35 | "metadata": {},
36 | "source": [
37 | "## Introduction\n",
38 | "In this example we use pyspark to predict the total cost of a trip using New York City Taxi and Limousine Commission (TLC) Trip Record Data from Registry of Open Data on AWS .\n",
39 | "\n",
40 | "***"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "id": "difficult-enough",
46 | "metadata": {},
47 | "source": [
48 | "## Example\n",
49 | "Load the data set for trips into a Spark DataFrame."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "id": "human-probe",
56 | "metadata": {
57 | "tags": []
58 | },
59 | "outputs": [],
60 | "source": [
61 | "df = spark.read.format(\"parquet\") \\\n",
62 | ".load(\"s3://nyc-tlc/trip data/yellow_tripdata_2022-12.parquet\", \n",
63 | " inferSchema = True, \n",
64 | " header = True)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "id": "professional-panama",
70 | "metadata": {},
71 | "source": [
72 | "Mark the dataFrame for caching in memory and display the schema to check the data-types using the `printSchema` method."
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "id": "distinct-dollar",
79 | "metadata": {
80 | "tags": []
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# Mark the dataFrame for caching in memory\n",
85 | "df.cache()\n",
86 | "\n",
87 | "# Print the scehma\n",
88 | "df.printSchema()\n",
89 | "\n",
90 | "# Get the dimensions of the data\n",
91 | "df.count() , len(df.columns)"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "id": "descending-messaging",
98 | "metadata": {
99 | "tags": []
100 | },
101 | "outputs": [],
102 | "source": [
103 | "# Get the summary of the columns\n",
104 | "df.select(\"total_amount\", \"tip_amount\")\\\n",
105 | ".describe()\\\n",
106 | ".show()\n",
107 | "\n",
108 | "# Value counts of VendorID column\n",
109 | "df.groupBy(\"VendorID\")\\\n",
110 | ".count()\\\n",
111 | ".show()"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "id": "compressed-maintenance",
117 | "metadata": {},
118 | "source": [
119 | "### Use VectorAssembler to transform input columns into vectors\n",
120 | "pyspark.ml provides dataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines. \n",
121 | "A `VectorAssembler` combines a given list of columns into a single vector column. In the below cell we combine the columns to a single vector cloumn `features`."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "id": "magnetic-anaheim",
128 | "metadata": {
129 | "tags": []
130 | },
131 | "outputs": [],
132 | "source": [
133 | "from pyspark.ml.feature import VectorAssembler\n",
134 | "\n",
135 | "# Specify the input and output columns of the vector assembler\n",
136 | "vectorAssembler = VectorAssembler(\n",
137 | " inputCols = [\n",
138 | " \"trip_distance\",\n",
139 | " \"PULocationID\",\n",
140 | " \"DOLocationID\",\n",
141 | " \"fare_amount\",\n",
142 | " \"mta_tax\",\n",
143 | " \"tip_amount\", \n",
144 | " \"tolls_amount\",\n",
145 | " \"improvement_surcharge\", \n",
146 | " \"congestion_surcharge\"\n",
147 | " ], \n",
148 | " outputCol = \"features\")\n",
149 | "\n",
150 | "# Transform the data\n",
151 | "v_df = vectorAssembler.setHandleInvalid(\"skip\").transform(df)\n",
152 | "\n",
153 | "# View the transformed data\n",
154 | "v_df = v_df.select([\"features\", \"total_amount\"])\n",
155 | "v_df.show(3)"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "id": "baking-surge",
161 | "metadata": {},
162 | "source": [
163 | "Divide input dataset into training set and test set"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "id": "monthly-ranch",
170 | "metadata": {
171 | "tags": []
172 | },
173 | "outputs": [],
174 | "source": [
175 | "splits = v_df.randomSplit([0.7, 0.3])\n",
176 | "train_df = splits[0]\n",
177 | "test_df = splits[1]"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "id": "august-terror",
183 | "metadata": {},
184 | "source": [
185 | "### Train the model using LinearRegression against training set"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": null,
191 | "id": "golden-industry",
192 | "metadata": {
193 | "tags": []
194 | },
195 | "outputs": [],
196 | "source": [
197 | "from pyspark.ml.regression import LinearRegression\n",
198 | "\n",
199 | "lr = LinearRegression(featuresCol = \"features\", \\\n",
200 | " labelCol = \"total_amount\", \\\n",
201 | " maxIter = 100, \\\n",
202 | " regParam = 0.3, \\\n",
203 | " elasticNetParam = 0.8)\n",
204 | "lr_model = lr.fit(train_df)\n",
205 | "print(\"Coefficients: \" + str(lr_model.coefficients))\n",
206 | "print(\"Intercept: \" + str(lr_model.intercept))"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "id": "congressional-buddy",
212 | "metadata": {},
213 | "source": [
214 | "Report the trained model performance on the training set"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": null,
220 | "id": "waiting-healthcare",
221 | "metadata": {
222 | "tags": []
223 | },
224 | "outputs": [],
225 | "source": [
226 | "training_summary = lr_model.summary\n",
227 | "print(\"RMSE: %f\" % training_summary.rootMeanSquaredError)\n",
228 | "print(\"R squred (R2): %f\" % training_summary.r2)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "id": "personal-serve",
234 | "metadata": {},
235 | "source": [
236 | "Predict the result using test set and report accuracy"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "id": "arabic-corpus",
243 | "metadata": {
244 | "tags": []
245 | },
246 | "outputs": [],
247 | "source": [
248 | "predictions = lr_model.transform(test_df)\n",
249 | "\n",
250 | "from pyspark.sql.functions import col\n",
251 | "predictions.filter(predictions.total_amount > 10.0)\\\n",
252 | ".select(\"prediction\", \"total_amount\")\\\n",
253 | ".withColumn(\"diff\", col(\"prediction\") - col(\"total_amount\"))\\\n",
254 | ".withColumn(\"diff%\", (col(\"diff\") / col(\"total_amount\")) * 100)\\\n",
255 | ".show()"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "id": "copyrighted-kazakhstan",
261 | "metadata": {},
262 | "source": [
263 | "### Report performance on the test set using RegressionEvaluator "
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "id": "thirty-beaver",
270 | "metadata": {
271 | "tags": []
272 | },
273 | "outputs": [],
274 | "source": [
275 | "from pyspark.ml.evaluation import RegressionEvaluator\n",
276 | "\n",
277 | "lr_evaluator = RegressionEvaluator(predictionCol = \"prediction\", \\\n",
278 | " labelCol = \"total_amount\", \\\n",
279 | " metricName = \"r2\")\n",
280 | "print(\"R Squared (R2) on test data = %g\" % lr_evaluator.evaluate(predictions))"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "id": "26f8ecab-a88c-4840-9550-b3a99c177226",
287 | "metadata": {},
288 | "outputs": [],
289 | "source": []
290 | }
291 | ],
292 | "metadata": {
293 | "kernelspec": {
294 | "display_name": "PySpark",
295 | "language": "python",
296 | "name": "pysparkkernel"
297 | },
298 | "language_info": {
299 | "codemirror_mode": {
300 | "name": "python",
301 | "version": 3
302 | },
303 | "file_extension": ".py",
304 | "mimetype": "text/x-python",
305 | "name": "pyspark",
306 | "pygments_lexer": "python3"
307 | }
308 | },
309 | "nbformat": 4,
310 | "nbformat_minor": 5
311 | }
312 |
--------------------------------------------------------------------------------
/examples/visualize-data-with-pandas-matplotlib.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Visualize data with `pandas` and `matplotlib`\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Installing python libraries on the EMR cluster.\n",
11 | "* Reading data as `pandas` DataFrame and converting to `NumPy` array.\n",
12 | "* Pie, bar, histogram and scatter plots using `matplotlib`."
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "***\n",
20 | "\n",
21 | "## Prerequisites\n",
22 | "\n",
23 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
24 | "\n",
25 | "* This example uses a public dataset from stanford, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
26 | "* You may have to restart the kernel after installing the libraries.\n",
27 | "* This notebook uses the `Python3` kernel.\n",
28 | "***"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Introduction\n",
36 | "In this example, we are going to visualize data using the `matplotlib` library. \n",
37 | "Matplotlib is one of the most popular Python packages used for data visualization.\n",
38 | "Gallery covers a variety of ways to use Matplotlib.\n",
39 | "\n",
40 | "We use a public dataset (csv) from web.stanford.edu \n",
41 | "This dataset is a collection of information on coastal status, EU membership, and population (in millions) of different countries.\n",
42 | "***"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "Install dependency `matplotlib`.\n",
50 | "\n",
51 | "`%pip install` is the same as `!/emr/notebook-env/bin/pip install` and are installed in `/home/emr-notebook/`.\n",
52 | "\n",
53 | "After installation, these libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the master node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "%pip install matplotlib"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "Import dependencies"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "import pandas as pd\n",
79 | "import numpy as np\n",
80 | "import math\n",
81 | "import matplotlib"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "Load the dataset into a pandas DataFrame."
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "df = pd.read_csv(\"https://web.stanford.edu/class/cs102/datasets/Countries.csv\")"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "Print some sample records. \n",
105 | "`Population` is in millions."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "print(df.head(5))"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Add `% to total` in the dataset. The total of the `% to total` of all the rows is 100."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "df[\"% to total\"] = df[\"population\"].apply(lambda x: float(x/df[\"population\"].sum()*100))"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "Print sample records."
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "print(df.head(5))"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "Display top 10 populated countries for the analysis."
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "top_ten_df = df.nlargest(10, [\"% to total\"])"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "### Pie chart for the population of top 10 countries"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "import matplotlib.pyplot as plt\n",
179 | "\n",
180 | "plt.pie(top_ten_df[\"% to total\"], labels=top_ten_df[\"country\"], shadow=False, startangle=0, frame=False, autopct=\"%1.1f%%\")\n",
181 | "plt.axis(\"equal\")\n",
182 | "plt.title(\"Country Population %\", loc=\"left\")\n",
183 | "plt.legend(labels = top_ten_df[\"country\"], bbox_to_anchor=(1.7,0.5), loc=\"right\")\n",
184 | "fig = plt.gcf()\n",
185 | "fig.set_size_inches(6,6) \n",
186 | "plt.show()"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "### Bar chart for the population of top 10 countries\n",
194 | "\n",
195 | "`to_numpy()` converts the pandas DataFrame to a `NumPy` array."
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "x = top_ten_df[\"country\"].to_numpy()\n",
205 | "y = top_ten_df[\"population\"].to_numpy()\n",
206 | "\n",
207 | "x_pos = [i for i, _ in enumerate(x)]\n",
208 | "\n",
209 | "plt.bar(x_pos, y, color=\"green\")\n",
210 | "plt.xlabel(\"Country\")\n",
211 | "plt.ylabel(\"Population\")\n",
212 | "plt.title(\"Country Population\")\n",
213 | "fig = plt.gcf()\n",
214 | "fig.set_size_inches(13,6) \n",
215 | "plt.xticks(x_pos, x)\n",
216 | "\n",
217 | "plt.show()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "### Histogram for population of all countries"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "metadata": {},
231 | "outputs": [],
232 | "source": [
233 | "population_dataset = df[\"population\"]"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {},
240 | "outputs": [],
241 | "source": [
242 | "# Find bin size for the histogram\n",
243 | "n_bin = math.ceil((2*(np.percentile(population_dataset, 75) - np.percentile(population_dataset, 25))/population_dataset.count()**(1/3)))"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": null,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "plt.hist(population_dataset, bins=n_bin, color=\"blue\") \n",
253 | "plt.xlabel(\"Country Population\")\n",
254 | "plt.ylabel(\"Number of Countries\")\n",
255 | "plt.title(\"Country Population Histogram\")\n",
256 | "plt.grid(True)\n",
257 | "plt.show()"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "### Scatter plot for Boston Houses"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "metadata": {},
270 | "source": [
271 | "Tabular data: Boston Housing Data.\n",
272 | "The Boston House contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.\n",
273 | "\n",
274 | "`CRIM` - per capita crime rate by town. \n",
275 | "`LSTAT` - % lower status of the population. \n",
276 | "\n",
277 | "Install dependancy `sklearn`."
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": null,
283 | "metadata": {},
284 | "outputs": [],
285 | "source": [
286 | "%pip install scikit-learn"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "from sklearn.datasets import *\n",
296 | "tabular_data = load_boston()\n",
297 | "tabular_data_df = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "Print sample records."
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "tabular_data_df.head(5)"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "x = tabular_data_df[\"CRIM\"]\n",
323 | "y = tabular_data_df[\"LSTAT\"]\n",
324 | "\n",
325 | "plt.scatter(x, y, color=\"blue\", marker=\"o\")\n",
326 | "plt.xlabel(\"Crime\")\n",
327 | "plt.ylabel(\"Lower Status of Population\")\n",
328 | "plt.title(\"Scatterplot for Crime and lower status of the population\")\n",
329 | "plt.grid(True)\n",
330 | "plt.show()"
331 | ]
332 | }
333 | ],
334 | "metadata": {
335 | "kernelspec": {
336 | "display_name": "Python 3",
337 | "language": "python",
338 | "name": "python3"
339 | },
340 | "language_info": {
341 | "codemirror_mode": {
342 | "name": "ipython",
343 | "version": 3
344 | },
345 | "file_extension": ".py",
346 | "mimetype": "text/x-python",
347 | "name": "python",
348 | "nbconvert_exporter": "python",
349 | "pygments_lexer": "ipython3",
350 | "version": "3.7.4"
351 | }
352 | },
353 | "nbformat": 4,
354 | "nbformat_minor": 4
355 | }
356 |
--------------------------------------------------------------------------------
/examples/redshift-connect-from-spark-using-username-password.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "5cc7fcb2-41bb-4c4f-9b9a-b01798ff622f",
6 | "metadata": {
7 | "execution": {
8 | "iopub.execute_input": "2022-11-20T06:27:55.989263Z",
9 | "iopub.status.busy": "2022-11-20T06:27:55.988861Z",
10 | "iopub.status.idle": "2022-11-20T06:27:56.010302Z",
11 | "shell.execute_reply": "2022-11-20T06:27:56.009491Z",
12 | "shell.execute_reply.started": "2022-11-20T06:27:55.989217Z"
13 | },
14 | "tags": []
15 | },
16 | "source": [
17 | "# Connect to Amazon Redshift with Pyspark using EMR-RedShift connector from EMR Studio using Username and Password "
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "id": "6ae9b1b8-14d5-4164-baee-ca2f4d2ff0fc",
23 | "metadata": {},
24 | "source": [
25 | "## Prerequisites\n",
26 | "In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.\n",
27 | "\n",
28 | "- EMR EC2 cluster with Release 6.9.0 on higher\n",
29 | "- This example we connect to Amazon Redshift cluster, hence the EMR cluster attached to this notebook must have the connectivity (VPC) and appropriate rules (Security Group).\n",
30 | " - EMR 6.9.0 cluster should be attached to this notebook and should have the Spark, JupyterEnterpriseGateway, and Livy applications installed. \n",
31 | "- Source table exists in RedShift with sample data\n",
32 | "- Target table exists in RedShift with or without data\n",
33 | "- To use EMR-RedShift connector with Amazon EMR Studio Notebooks, you must first copy the jar files from the local file system to HDFS, present on the master node of the EMR cluster, follow setup steps."
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "id": "637de15c-b1d6-4c43-beb3-79ef8a1dad36",
39 | "metadata": {},
40 | "source": [
41 | "## Introduction\n",
42 | "\n",
43 | "In this example we use Pyspark to connect to a table in Amazon Redshift using spark-redshift connector.\n",
44 | "\n"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "id": "fcd9ed3f-a2a1-47d8-96ed-30ac570310d6",
50 | "metadata": {},
51 | "source": [
52 | "## Setup\n",
53 | "Create an S3 bucket location to be used as a temporary location for Redshift dataset. For example: s3://EXAMPLE-BUCKET/temporary-redshift-dataset/\n",
54 | "\n",
55 | "- Create an AWS IAM role which will be associated to the Amazon Redshift cluster. Make sure that this IAM role has access to read and write to the above mentioned S3 bucket location with the appropriate IAM policy. More details:\n",
56 | "\n",
57 | " [Create AWS IAM role for Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-role.html)\n",
58 | "\n",
59 | " [Associate IAM role with Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html)\n",
60 | "\n",
61 | "- Connect to the master node of the cluster using SSH and then copy the jar files from the local filesystem to HDFS as shown in the following examples. In the example, we create a directory in HDFS for clarity of file management. You can choose your own destination in HDFS, if desired.\n",
62 | "\n",
63 | " `hdfs dfs -mkdir -p /apps/emr_rs_connector/lib`\n",
64 | "\n",
65 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar /apps/emr_rs_connector/lib/RedshiftJDBC.jar`\n",
66 | "\n",
67 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar /apps/emr_rs_connector/lib/spark-redshift.jar`\n",
68 | "\n",
69 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar /apps/emr_rs_connector/lib/spark-avro.jar`\n",
70 | "\n",
71 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar /apps/emr_rs_connector/lib/minimal-json.jar`\n",
72 | "\n",
73 | " `hdfs dfs -ls /apps/emr_rs_connector/lib`\n",
74 | "\n"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "id": "f09b3ad6-eb00-401b-8834-87546fd02b90",
80 | "metadata": {},
81 | "source": [
82 | "## Configure to use jar file in studio notebook"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "id": "2cf647d9-831d-4276-b8bb-d7b2a4cd7d01",
89 | "metadata": {
90 | "tags": []
91 | },
92 | "outputs": [],
93 | "source": [
94 | "%%configure -f\n",
95 | "{\n",
96 | " \"conf\" : {\n",
97 | " \"spark.jars\":\"hdfs:///apps/emr_rs_connector/lib/RedshiftJDBC.jar,hdfs:///apps/emr_rs_connector/lib/minimal-json.jar,hdfs:///apps/emr_rs_connector/lib/spark-avro.jar,hdfs:///apps/emr_rs_connector/lib/spark-redshift.jar\",\n",
98 | " \"spark.pyspark.python\" : \"python3\",\n",
99 | " \"spark.pyspark.virtualenv.enable\" : \"true\",\n",
100 | " \"spark.pyspark.virtualenv.type\" : \"native\",\n",
101 | " \"spark.pyspark.virtualenv.bin.path\" : \"/usr/bin/virtualenv\"\n",
102 | "\n",
103 | " }\n",
104 | "}"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "id": "f3fb38b0-e7e8-4503-906b-96d6730c8867",
110 | "metadata": {},
111 | "source": [
112 | "## Connect to Amazon Redshift using pyspark"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "id": "de481081-525c-4749-bcfd-97dcf6e69e4e",
119 | "metadata": {
120 | "tags": []
121 | },
122 | "outputs": [],
123 | "source": [
124 | "%%pyspark\n",
125 | "\n",
126 | "import pyspark\n",
127 | "from pyspark.sql import SparkSession\n",
128 | "from pyspark import SparkContext\n",
129 | "from pyspark.sql import SQLContext\n",
130 | "\n",
131 | "#jdbc:redshift://jdbc-url:5439/db?user=usr&password=pwd\n",
132 | "str_jdbc_url=\"jdbc:redshift://XXXXX:5439/replace_DB_name?user=XXX&password=YYYY&ApplicationName=EMRRedshiftSparkConnection\"\n",
133 | "str_src_table=\" \"\n",
134 | "str_tgt_table=\" \"\n",
135 | "str_s3_path=\" \"\n",
136 | "str_iam_role=\" \"\n",
137 | "\n",
138 | "#sc = SparkContext().getOrCreate() # Existing SC\n",
139 | "\n",
140 | "sql_context = SQLContext(sc)\n",
141 | "\n",
142 | "\n",
143 | "jdbcDF = sql_context.read\\\n",
144 | " .format(\"io.github.spark_redshift_community.spark.redshift\")\\\n",
145 | " .option(\"url\", str_jdbc_url)\\\n",
146 | " .option(\"dbtable\", str_src_table)\\\n",
147 | " .option(\"aws_iam_role\",str_iam_role)\\\n",
148 | " .option(\"tempdir\", str_s3_path)\\\n",
149 | " .load()\n",
150 | "\n",
151 | "jdbcDF.limit(5).show()\n",
152 | "\n",
153 | "\n",
154 | "jdbcDF.write \\\n",
155 | " .format(\"io.github.spark_redshift_community.spark.redshift\") \\\n",
156 | " .option(\"url\", str_jdbc_url) \\\n",
157 | " .option(\"dbtable\", str_tgt_table) \\\n",
158 | " .option(\"tempdir\", str_s3_path) \\\n",
159 | " .option(\"aws_iam_role\",str_iam_role) \\\n",
160 | " .mode(\"append\")\\\n",
161 | " .save()\n",
162 | "\n",
163 | "\n"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "id": "6d4b9657-7981-49b7-a1b0-7c624d6acc00",
169 | "metadata": {},
170 | "source": [
171 | "## Connect to Amazon Redshift using scalaspark"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "id": "e23f7b64-b555-4bf4-a715-acc4586593e4",
178 | "metadata": {
179 | "tags": []
180 | },
181 | "outputs": [],
182 | "source": [
183 | "%%scalaspark\n",
184 | "\n",
185 | "//Declare the variables and replace the variables values as appropiate\n",
186 | "\n",
187 | "//#jdbc:redshift://jdbc-url:5439/db?user=usr&password=pwd\n",
188 | "val str_jdbc_url=\"jdbc:redshift://XXXXX:5439/replace_DB_name?user=XXX&password=YYYY&ApplicationName=EMRRedshiftSparkConnection\"\n",
189 | "val str_src_table=\" \"\n",
190 | "val str_tgt_table=\" \"\n",
191 | "val str_s3_path=\" \"\n",
192 | "val str_iam_role=\" \"\n",
193 | "\n",
194 | "//Read data from source table\n",
195 | "val jdbcDF = (spark.read.format(\"io.github.spark_redshift_community.spark.redshift\")\n",
196 | " .option(\"url\", str_jdbc_url)\n",
197 | " .option(\"dbtable\", str_src_table)\n",
198 | " .option(\"tempdir\", str_s3_path)\n",
199 | " .option(\"aws_iam_role\", str_iam_role)\n",
200 | " .load())\n",
201 | "\n",
202 | "// Write data to target table\n",
203 | "\n",
204 | "jdbcDF.limit(5).show()\n",
205 | "\n",
206 | "\n",
207 | "jdbcDF.write.mode(\"append\").\n",
208 | " format(\"io.github.spark_redshift_community.spark.redshift\").option(\"url\", str_jdbc_url).option(\"dbtable\", str_tgt_table).option(\"aws_iam_role\", str_iam_role).option(\"tempdir\", str_s3_path).save()\n"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "id": "a5c9c010-13bf-4e63-a093-52b408c446c0",
214 | "metadata": {},
215 | "source": [
216 | "## Connect to Amazon Redshift using SparkR"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "id": "e309bfb9-b914-4b5f-a39b-1371122643b3",
223 | "metadata": {},
224 | "outputs": [],
225 | "source": [
226 | "%%rspark\n",
227 | "\n",
228 | "#Declare the variables and replace the variables values as appropiate\n",
229 | "\n",
230 | "#jdbc:redshift://:5439/db?user=usr&password=pwd\n",
231 | "str_jdbc_url=\"jdbc:redshift://XXXXX:5439/replace_DB_name?user=XXX&password=YYYY&ApplicationName=EMRRedshiftSparkConnection\"\n",
232 | "str_src_table=\" \"\n",
233 | "str_tgt_table=\" \"\n",
234 | "str_s3_path=\" \"\n",
235 | "str_iam_role=\" \"\n",
236 | "\n",
237 | "# Read data from source table\n",
238 | "\n",
239 | "df <- read.df(\n",
240 | " NULL,\n",
241 | " \"io.github.spark_redshift_community.spark.redshift\",\n",
242 | " aws_iam_role = str_iam_role,\n",
243 | " tempdir = str_s3_path,\n",
244 | " dbtable = str_src_table,\n",
245 | " url = str_jdbc_url)\n",
246 | "\n",
247 | "showDF(df)"
248 | ]
249 | }
250 | ],
251 | "metadata": {
252 | "kernelspec": {
253 | "display_name": "PySpark",
254 | "language": "python",
255 | "name": "pysparkkernel"
256 | },
257 | "language_info": {
258 | "codemirror_mode": {
259 | "name": "python",
260 | "version": 3
261 | },
262 | "file_extension": ".py",
263 | "mimetype": "text/x-python",
264 | "name": "pyspark",
265 | "pygments_lexer": "python3"
266 | }
267 | },
268 | "nbformat": 4,
269 | "nbformat_minor": 5
270 | }
271 |
--------------------------------------------------------------------------------
/examples/hive-presto-with-python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "858522da",
6 | "metadata": {},
7 | "source": [
8 | "# Connect to Hive, Presto & Trino Engine using `Python3`\n",
9 | "\n",
10 | "\n",
11 | "#### Topics covered in this example\n",
12 | "* Installing `python3-devel` and `cyrus-sasl-devel` on the EMR master node\n",
13 | "* Installing python libraries on the Amazon EMR cluster\n",
14 | "* Connecting to Hive using Python3 `PyHive` library\n",
15 | "* Connecting to Presto using Python3 `PyHive` library\n",
16 | "* Connecting to Trino using Python3 `PyHive` library"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "id": "f3e440f3",
22 | "metadata": {},
23 | "source": [
24 | "## Table of Contents:\n",
25 | "\n",
26 | "1. [Prerequisites](#Prerequisites)\n",
27 | "2. [Introduction](#Introduction)\n",
28 | "3. [Install dependency libraries](#Install-dependency-libraries)\n",
29 | "4. [Connect to Hive using `PyHive`](#Connect-to-Hive-using-PyHive)\n",
30 | "5. [Connect to Presto using `PyHive`](#Connect-to-Presto-using-PyHive)\n",
31 | "6. [Connect to Trino using `PyHive`](#Connect-to-Trino-using-PyHive)"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "id": "7a960666",
37 | "metadata": {},
38 | "source": [
39 | "***\n",
40 | "\n",
41 | "## Prerequisites\n",
42 | "\n",
43 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
44 | "\n",
45 | "* This example installs python3 libraries, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
46 | "* The EMR cluster attached to this notebook should have the following packages installed:\n",
47 | " * `python3-devel`\n",
48 | " * `cyrus-sasl-devel`\n",
49 | " \n",
50 | "* You can do this by running the following command on EMR mater node:\n",
51 | " sudo yum install -y python3-devel cyrus-sasl-devel
\n",
52 | "* This notebook uses the `Python3` kernel.\n",
53 | "***"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "id": "d1039545",
59 | "metadata": {},
60 | "source": [
61 | "## Introduction\n",
62 | "In this example we use `Python3` to connect to a table in Hive, Presto and Trino using `PyHive`.\n",
63 | "\n",
64 | "PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive.\n",
65 | "***"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "id": "91ae5583",
71 | "metadata": {},
72 | "source": [
73 | "## Install dependency libraries\n",
74 | "\n",
75 | "`%pip install` is the same as `!/emr/notebook-env/bin/pip install` and are installed in `/home/emr-notebook/`.\n",
76 | "\n",
77 | "After installation, these libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the master node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "id": "b7eaed37",
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "%pip install pyhive thrift sasl thrift_sasl"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "id": "a7b2d044",
93 | "metadata": {},
94 | "source": [
95 | "## Connect to Hive using `PyHive`\n",
96 | "\n",
97 | "We will connect to Hive using `PyHive` library. Please make sure you replace the values for `hostName, userName, and databaseName` as applicable to your environment. The Hiveserver2 port is set to default 10000.\n",
98 | "\n",
99 | "In this example, we are connecting to Hive database `default` running locally on the EMR master node. We will query the table `hive_sample_table` to retrieve some values."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "id": "4e55fcc3",
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "from pyhive import hive\n",
110 | "import getpass\n",
111 | "\n",
112 | "hostName = \"127.0.0.1\"\n",
113 | "userName = \"hadoop\"\n",
114 | "userPassword = getpass.getpass('Enter hive user password')\n",
115 | "databaseName = \"default\"\n",
116 | "hivePort = 10000\n",
117 | "\n",
118 | "def hiveConnection(hostName, hivePort, userName, userPassword, databaseName):\n",
119 | " conn = hive.connect(host=hostName,\n",
120 | " port=hivePort,\n",
121 | " username=userName,\n",
122 | " password=userPassword,\n",
123 | " database=databaseName,\n",
124 | " auth='CUSTOM')\n",
125 | " cur = conn.cursor()\n",
126 | " cur.execute('SELECT id FROM hive_sample_table LIMIT 3')\n",
127 | " result = cur.fetchall()\n",
128 | "\n",
129 | " return result\n",
130 | "\n",
131 | "# Call above function\n",
132 | "tableData = hiveConnection(hostName, hivePort, userName, userPassword, databaseName)\n",
133 | "\n",
134 | "# Print the results\n",
135 | "print(tableData)"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "id": "3123417c",
141 | "metadata": {},
142 | "source": [
143 | "## Connect to Presto using `PyHive`\n",
144 | "\n",
145 | "We will connect to Presto using `PyHive` library. Please make sure you replace the values for `hostName, userName, schemaName and catalogName` as applicable to your environment. The port is set to EMR default of 8889.\n",
146 | "\n",
147 | "In this example, we are connecting to `default` schema/database stored inside the `hive` catalog on the EMR master node. We will query the table `presto_sample_table` using within the `default` schema to retrieve some values.\n",
148 | "\n",
149 | "The connection uses HTTP protocol for Presto. You can enabled SSL/TLS and configure LDAPS for Presto on Amazon EMR by referring to the documentation [here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/presto-ssl.html)"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "id": "d36c7db7",
155 | "metadata": {},
156 | "source": [
157 | "Presto authentication and without authentication and getPass #####"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "id": "8d28fec1",
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "from pyhive import presto\n",
168 | "import requests\n",
169 | "\n",
170 | "hostName = \"127.0.0.1\"\n",
171 | "userName = \"hadoop\"\n",
172 | "schemaName = \"default\"\n",
173 | "catalogName = \"hive\"\n",
174 | "prestoPort = 8889\n",
175 | "\n",
176 | "headers = {\n",
177 | " 'X-Presto-User': userName,\n",
178 | " 'X-Presto-Schema': schemaName,\n",
179 | " 'X-Presto-Catalog': catalogName\n",
180 | "}\n",
181 | "\n",
182 | "prestoSession = requests.Session()\n",
183 | "prestoSession.headers.update(headers)\n",
184 | "\n",
185 | "def prestoConnection(prestoSession, hostName, prestoPort):\n",
186 | " conn = presto.connect(requests_session=prestoSession,\n",
187 | " host=hostName,\n",
188 | " port=prestoPort\n",
189 | " )\n",
190 | "\n",
191 | " cur = conn.cursor()\n",
192 | " cur.execute('SELECT id FROM presto_sample_table LIMIT 3')\n",
193 | " result = cur.fetchall()\n",
194 | "\n",
195 | " return result\n",
196 | "\n",
197 | "# Call above function\n",
198 | "tableData = prestoConnection(prestoSession, hostName, prestoPort)\n",
199 | "\n",
200 | "# Print the results\n",
201 | "print(tableData)"
202 | ]
203 | },
204 | {
205 | "cell_type": "markdown",
206 | "id": "375cc99d",
207 | "metadata": {},
208 | "source": [
209 | "## Connect to Trino using `PyHive`\n",
210 | "\n",
211 | "We will now connect to Trino using `PyHive` library. Please make sure you replace the values for `hostName, userName, schemaName and catalogName` as applicable to your environment. The port is set to EMR default of 8889.\n",
212 | "\n",
213 | "In this example, we are connecting to `default` schema/database stored inside the `hive` catalog on the EMR master node. We will query the table `trino_sample_table` using within the `default` schema to retrieve some values.\n",
214 | "\n",
215 | "The connection uses HTTP protocol for Presto. You can enabled SSL/TLS and configure LDAPS for Presto on Amazon EMR by referring to the documentation [here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/presto-ssl.html)"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "id": "54d94b87",
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "from pyhive import trino\n",
226 | "import requests\n",
227 | "\n",
228 | "hostName = \"127.0.0.1\"\n",
229 | "userName = \"hadoop\"\n",
230 | "schemaName = \"default\"\n",
231 | "catalogName = \"hive\"\n",
232 | "trinoPort = 8889\n",
233 | "\n",
234 | "## Starting with Trino release 0.351 rename client protocol headers to start with X-Trino- instead of X-Presto-\n",
235 | "headers = {\n",
236 | " 'X-Presto-User': userName,\n",
237 | " 'X-Presto-Schema': schemaName,\n",
238 | " 'X-Presto-Catalog': catalogName\n",
239 | "}\n",
240 | "\n",
241 | "trinoSession = requests.Session()\n",
242 | "trinoSession.headers.update(headers)\n",
243 | "\n",
244 | "def trinoConnection(trinoSession, hostName, trinoPort):\n",
245 | " conn = trino.connect(requests_session=trinoSession,\n",
246 | " host=hostName,\n",
247 | " port=trinoPort\n",
248 | " )\n",
249 | "\n",
250 | " cur = conn.cursor()\n",
251 | " cur.execute('SELECT id FROM trino_sample_table LIMIT 3')\n",
252 | " result = cur.fetchall()\n",
253 | "\n",
254 | " return result\n",
255 | "\n",
256 | "# Call above function\n",
257 | "tableData = trinoConnection(trinoSession, hostName, trinoPort)\n",
258 | "\n",
259 | "# Print the results\n",
260 | "print(tableData)"
261 | ]
262 | }
263 | ],
264 | "metadata": {
265 | "kernelspec": {
266 | "display_name": "Python 3",
267 | "language": "python",
268 | "name": "python3"
269 | },
270 | "language_info": {
271 | "codemirror_mode": {
272 | "name": "ipython",
273 | "version": 3
274 | },
275 | "file_extension": ".py",
276 | "mimetype": "text/x-python",
277 | "name": "python",
278 | "nbconvert_exporter": "python",
279 | "pygments_lexer": "ipython3",
280 | "version": "3.7.4"
281 | }
282 | },
283 | "nbformat": 4,
284 | "nbformat_minor": 5
285 | }
286 |
--------------------------------------------------------------------------------
/examples/table-with-sql-hivecontext-presto.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Table operations using `Spark SQL`, `Hive context`, and `Presto`\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Creating an external table using `%%sql` magic and querying the table.\n",
11 | "* Querying the table using a hive context.\n",
12 | "* Connecting to the table using a Presto connector and querying the table."
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "***\n",
20 | "\n",
21 | "## Prerequisites\n",
22 | "\n",
23 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
24 | "\n",
25 | "* The EMR cluster attached to this notebook should have the `Spark`, `Hive` and `Presto` applications installed.\n",
26 | "* This example uses a public dataset from s3, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
27 | "* This notebook uses the `PySpark` kernel.\n",
28 | "***"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Introduction\n",
36 | "In this example, we are going to create an external Hive table and query the table using `spark sql magic`, `hive context` and `presto`.\n",
37 | "\n",
38 | "We use the Amazon customer review dataset that is publically accessible in s3. \n",
39 | "This dataset is a collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale.\n",
40 | "***"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## Sql magic example\n",
48 | "\n",
49 | "Magic commands are pre-defined functions(`magics`) in Jupyter kernel that execute the supplied commands. \n",
50 | "Sql magic extension makes it possible to write SQL queries directly into code cells. \n",
51 | "For more information about these magic commands, see the GitHub repo .\n",
52 | "\n",
53 | "\n",
54 | "You can see all of the available magics with the help of `%lsmagic`."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "%lsmagic"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "Create a table `Books` from the Amazon customer reviews data for books using the sql magic `%%sql`.\n",
71 | "\n",
72 | "`%%sql` marks an entire cell as a SQL block which allows us to enter multi-line SQL statements."
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "%%sql\n",
82 | "CREATE EXTERNAL TABLE IF NOT EXISTS Books(review_id STRING,product_title STRING,star_rating INT,verified_purchase STRING,review_date DATE,year INT)\n",
83 | "STORED AS PARQUET LOCATION \"s3://amazon-reviews-pds/parquet/product_category=Books\""
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "Show existing tables."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "%%sql\n",
100 | "show tables"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "Show the details for the table `Books`."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "%%sql\n",
117 | "describe formatted Books"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "Execute a query to find the top 20 best reviewed books ordered by descending `star_ratings` and limited to 20 records."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "%%sql\n",
134 | "SELECT product_title, AVG(star_rating), count(review_id) AS review_count FROM Books\n",
135 | "WHERE review_date >= \"2015-08-28\" AND review_date <= \"2015-08-30\" AND verified_purchase=\"Y\"\n",
136 | "GROUP BY product_title\n",
137 | "ORDER BY SUM(star_rating) DESC\n",
138 | "LIMIT 20"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "***\n",
146 | "## Hive context example\n",
147 | "\n",
148 | "A `Hive context` is an instance of the Spark SQL execution engine that integrates with data stored in Hive. \n",
149 | "The following example shows how to query the table `Books` using the hive context."
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "Import dependencies."
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": null,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "from pyspark.sql import HiveContext"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "Initiate the hive context and display the list of tables in the default schema."
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "sqlContext = HiveContext(sc)\n",
182 | "sqlContext.sql(\"use default\")\n",
183 | "sqlContext.sql(\"show tables\").show()"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "Display the sample table records."
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "books = sqlContext.table(\"default.books\")\n",
200 | "books.show()"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "Execute a query to count the number of purchases with high customer ratings (ratings greater than or equal to 4)."
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "sqlContext.sql(\"Select count(product_title) as count_of_purchases_with_high_rating from books where star_rating >=4\").show()"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "***\n",
224 | "## Presto example\n",
225 | "\n",
226 | "Analyze data stored in a database via Presto with the PyHive Presto Python library."
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "Install `pyhive` and `requests` from the public PyPI repository."
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {},
240 | "outputs": [],
241 | "source": [
242 | "sc.install_pypi_package(\"pyhive\")\n",
243 | "sc.install_pypi_package(\"requests\")"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "Import dependencies."
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {},
257 | "outputs": [],
258 | "source": [
259 | "from pyhive import presto"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "Use the following configuration to connect to the database by using the Presto connector.\n",
267 | "\n",
268 | "`host` : Host name or ip address of the database server. \n",
269 | "`port` : Port of the database server. \n",
270 | "`catalog` : Name of the catalog. A Presto catalog contains schemas and references of a data source via a connector. \n",
271 | "`schema` : Name of the schema."
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": null,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "cursor = presto.connect(host = \"localhost\", port = 8889, catalog = \"hive\", schema = \"default\").cursor()"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "List the tables created in the `default` schema."
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "cursor.execute(\"show tables\")\n",
297 | "results = cursor.fetchall()\n",
298 | "print(results)"
299 | ]
300 | },
301 | {
302 | "cell_type": "markdown",
303 | "metadata": {},
304 | "source": [
305 | "Query the books table using presto to get the count of `product_title`"
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "execution_count": null,
311 | "metadata": {},
312 | "outputs": [],
313 | "source": [
314 | "cursor.execute(\"Select count(product_title) from Books\")\n",
315 | "results = cursor.fetchall()\n",
316 | "print(results)"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "***\n",
324 | "## Cleanup"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {},
330 | "source": [
331 | "Delete the table."
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {},
338 | "outputs": [],
339 | "source": [
340 | "%%sql\n",
341 | "DROP TABLE IF EXISTS Books"
342 | ]
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "Lastly, use the `uninstall_package` Pyspark API to uninstall the `pyhive` and `requests` libraries that were installed using the `install_package` API."
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": null,
354 | "metadata": {},
355 | "outputs": [],
356 | "source": [
357 | "sc.uninstall_package(\"pyhive\")\n",
358 | "sc.uninstall_package(\"requests\")"
359 | ]
360 | }
361 | ],
362 | "metadata": {
363 | "kernelspec": {
364 | "display_name": "PySpark",
365 | "language": "",
366 | "name": "pysparkkernel"
367 | },
368 | "language_info": {
369 | "codemirror_mode": {
370 | "name": "python",
371 | "version": 3
372 | },
373 | "mimetype": "text/x-python",
374 | "name": "pyspark",
375 | "pygments_lexer": "python3"
376 | }
377 | },
378 | "nbformat": 4,
379 | "nbformat_minor": 4
380 | }
381 |
--------------------------------------------------------------------------------
/examples/redshift-connect-from-spark-using-iam-role.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "5cc7fcb2-41bb-4c4f-9b9a-b01798ff622f",
6 | "metadata": {
7 | "execution": {
8 | "iopub.execute_input": "2022-11-20T06:27:55.989263Z",
9 | "iopub.status.busy": "2022-11-20T06:27:55.988861Z",
10 | "iopub.status.idle": "2022-11-20T06:27:56.010302Z",
11 | "shell.execute_reply": "2022-11-20T06:27:56.009491Z",
12 | "shell.execute_reply.started": "2022-11-20T06:27:55.989217Z"
13 | },
14 | "tags": []
15 | },
16 | "source": [
17 | "# Connect to Amazon Redshift with Pyspark using EMR-RedShift connector from EMR Studio using Role "
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "id": "6ae9b1b8-14d5-4164-baee-ca2f4d2ff0fc",
23 | "metadata": {},
24 | "source": [
25 | "## Prerequisites\n",
26 | "In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.\n",
27 | "\n",
28 | "- EMR EC2 cluster with release 6.9.0 on higher\n",
29 | "- This example we connect to Amazon Redshift cluster, hence the EMR cluster attached to this notebook must have the connectivity (VPC) and appropriate rules (Security Group).\n",
30 | " - EMR 6.9.0 cluster should be attached to this notebook and should have the Spark, JupyterEnterpriseGateway, and Livy applications installed. \n",
31 | "- Source table exists in RedShift with sample data\n",
32 | "- Target table exists in RedShift with or without data\n",
33 | "- To use EMR-RedShift connector with Amazon EMR Studio Notebooks, you must first copy the jar files from the local file system to HDFS, present on the master node of the EMR cluster, follow setup steps."
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "id": "637de15c-b1d6-4c43-beb3-79ef8a1dad36",
39 | "metadata": {},
40 | "source": [
41 | "## Introduction\n",
42 | "\n",
43 | "In this example we use Pyspark to connect to a table in Amazon Redshift using spark-redshift connector.\n",
44 | "\n",
45 | "Starting from EMR release 6.9.0, Redshift JDBC driver >= 2.1 is packaged into the environment. With the new version of JDBC driver, you can specify the JDBC URL without including the raw username and password. Instead, you can specify jdbc:redshift:iam:// scheme, which will make JDBC driver to use your EMR Serverless job execution role to fetch the credentials automatically. \n",
46 | "\n",
47 | "See [Here](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-iam-credentials-configure-jdbc-odbc.html) for more information on configuring JDBC connection to use IAM credentials.\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "id": "fcd9ed3f-a2a1-47d8-96ed-30ac570310d6",
53 | "metadata": {},
54 | "source": [
55 | "## Setup\n",
56 | "Create an S3 bucket location to be used as a temporary location for Redshift dataset. For example: s3://EXAMPLE-BUCKET/temporary-redshift-dataset/\n",
57 | "\n",
58 | "- Create an AWS IAM role which will be associated to the Amazon Redshift cluster. Make sure that this IAM role has access to read and write to the above mentioned S3 bucket location with the appropriate IAM policy. More details:\n",
59 | "\n",
60 | " [Create AWS IAM role for Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-role.html)\n",
61 | "\n",
62 | " [Associate IAM role with Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html)\n",
63 | "\n",
64 | "- Connect to the master node of the cluster using SSH and then copy the jar files from the local filesystem to HDFS as shown in the following examples. In the example, we create a directory in HDFS for clarity of file management. You can choose your own destination in HDFS, if desired.\n",
65 | "\n",
66 | " `hdfs dfs -mkdir -p /apps/emr_rs_connector/lib`\n",
67 | "\n",
68 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar /apps/emr_rs_connector/lib/RedshiftJDBC.jar`\n",
69 | "\n",
70 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar /apps/emr_rs_connector/lib/spark-redshift.jar`\n",
71 | "\n",
72 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar /apps/emr_rs_connector/lib/spark-avro.jar`\n",
73 | "\n",
74 | " `hdfs dfs -copyFromLocal /usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar /apps/emr_rs_connector/lib/minimal-json.jar`\n",
75 | "\n",
76 | " `hdfs dfs -ls /apps/emr_rs_connector/lib`\n",
77 | "\n"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "id": "f09b3ad6-eb00-401b-8834-87546fd02b90",
83 | "metadata": {},
84 | "source": [
85 | "## Configure to use jar file in studio notebook"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "id": "2cf647d9-831d-4276-b8bb-d7b2a4cd7d01",
92 | "metadata": {
93 | "tags": []
94 | },
95 | "outputs": [],
96 | "source": [
97 | "%%configure -f\n",
98 | "{\n",
99 | " \"conf\" : {\n",
100 | " \"spark.jars\":\"hdfs:///apps/emr_rs_connector/lib/RedshiftJDBC.jar,hdfs:///apps/emr_rs_connector/lib/minimal-json.jar,hdfs:///apps/emr_rs_connector/lib/spark-avro.jar,hdfs:///apps/emr_rs_connector/lib/spark-redshift.jar\",\n",
101 | " \"spark.pyspark.python\" : \"python3\",\n",
102 | " \"spark.pyspark.virtualenv.enable\" : \"true\",\n",
103 | " \"spark.pyspark.virtualenv.type\" : \"native\",\n",
104 | " \"spark.pyspark.virtualenv.bin.path\" : \"/usr/bin/virtualenv\"\n",
105 | "\n",
106 | " }\n",
107 | "}"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "id": "f3fb38b0-e7e8-4503-906b-96d6730c8867",
113 | "metadata": {},
114 | "source": [
115 | "## Connect to Amazon Redshift using pyspark"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "id": "de481081-525c-4749-bcfd-97dcf6e69e4e",
122 | "metadata": {
123 | "tags": []
124 | },
125 | "outputs": [],
126 | "source": [
127 | "%%pyspark\n",
128 | "\n",
129 | "import pyspark\n",
130 | "from pyspark.sql import SparkSession\n",
131 | "from pyspark import SparkContext\n",
132 | "from pyspark.sql import SQLContext\n",
133 | "\n",
134 | "\n",
135 | "#jdbc:redshift:iam://examplecluster...redshift.amazonaws.com:5439/DB\n",
136 | "str_jdbc_url=\"jdbc:redshift:iam://...redshift.amazonaws.com:5439/replace_DB_name?ApplicationName=EMRRedshiftSparkConnection\"\n",
137 | "str_src_table=\" \"\n",
138 | "str_tgt_table=\" \"\n",
139 | "str_s3_path=\" \"\n",
140 | "str_iam_role=\" \"\n",
141 | "\n",
142 | "#sc = SparkContext().getOrCreate() # Existing SC\n",
143 | "\n",
144 | "sql_context = SQLContext(sc)\n",
145 | "\n",
146 | "\n",
147 | "jdbcDF = sql_context.read\\\n",
148 | " .format(\"io.github.spark_redshift_community.spark.redshift\")\\\n",
149 | " .option(\"url\", str_jdbc_url)\\\n",
150 | " .option(\"dbtable\", str_src_table)\\\n",
151 | " .option(\"aws_iam_role\",str_iam_role)\\\n",
152 | " .option(\"tempdir\", str_s3_path)\\\n",
153 | " .load()\n",
154 | "\n",
155 | "jdbcDF.limit(5).show()\n",
156 | "\n",
157 | "\n",
158 | "jdbcDF.write \\\n",
159 | " .format(\"io.github.spark_redshift_community.spark.redshift\") \\\n",
160 | " .option(\"url\", str_jdbc_url) \\\n",
161 | " .option(\"dbtable\", str_tgt_table) \\\n",
162 | " .option(\"tempdir\", str_s3_path) \\\n",
163 | " .option(\"aws_iam_role\",str_iam_role) \\\n",
164 | " .mode(\"append\")\\\n",
165 | " .save()\n",
166 | "\n",
167 | "\n"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "id": "0ae2ea22-2709-45f5-bfeb-793d19a10500",
173 | "metadata": {},
174 | "source": [
175 | "## Connect to Amazon Redshift using scalaspark"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "id": "e23f7b64-b555-4bf4-a715-acc4586593e4",
182 | "metadata": {
183 | "tags": []
184 | },
185 | "outputs": [],
186 | "source": [
187 | "%%scalaspark\n",
188 | "\n",
189 | "//Declare the variables and replace the variables values as appropiate\n",
190 | "\n",
191 | "//jdbc:redshift:iam://examplecluster...redshift.amazonaws.com:5439/DB\n",
192 | "val str_jdbc_url=str_jdbc_url=\"jdbc:redshift:iam://...redshift.amazonaws.com:5439/replace_DB_name?ApplicationName=EMRRedshiftSparkConnection\"\n",
193 | "val str_src_table=\" \"\n",
194 | "val str_tgt_table=\" \"\n",
195 | "val str_s3_path=\" \"\n",
196 | "val str_iam_role=\" \"\n",
197 | "\n",
198 | "//Read data from source table\n",
199 | "val jdbcDF = (spark.read.format(\"io.github.spark_redshift_community.spark.redshift\")\n",
200 | " .option(\"url\", str_jdbc_url)\n",
201 | " .option(\"dbtable\", str_src_table)\n",
202 | " .option(\"tempdir\", str_s3_path)\n",
203 | " .option(\"aws_iam_role\", str_iam_role)\n",
204 | " .load())\n",
205 | "\n",
206 | "// Write data to target table\n",
207 | "\n",
208 | "jdbcDF.limit(5).show()\n",
209 | "\n",
210 | "\n",
211 | "jdbcDF.write.mode(\"append\").\n",
212 | " format(\"io.github.spark_redshift_community.spark.redshift\").option(\"url\", str_jdbc_url).option(\"dbtable\", str_tgt_table).option(\"aws_iam_role\", str_iam_role).option(\"tempdir\", str_s3_path).save()\n"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "id": "cf56be3c-38ab-4013-9e22-c739f04e6ca9",
218 | "metadata": {},
219 | "source": [
220 | "## Connect to Amazon Redshift using SparkR"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "id": "e309bfb9-b914-4b5f-a39b-1371122643b3",
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "%%rspark\n",
231 | "#Declare the variables and replace the variables values as appropiate\n",
232 | "\n",
233 | "#jdbc:redshift:iam://examplecluster...redshift.amazonaws.com:5439/DB\n",
234 | "str_jdbc_url=\"jdbc:redshift:iam://...redshift.amazonaws.com:5439/replace_DB_name?ApplicationName=EMRRedshiftSparkConnection\"\n",
235 | "str_src_table=\" \"\n",
236 | "str_tgt_table=\" \"\n",
237 | "str_s3_path=\" \"\n",
238 | "str_iam_role=\" \"\n",
239 | "\n",
240 | "# Read data from source table\n",
241 | "\n",
242 | "df <- read.df(\n",
243 | " NULL,\n",
244 | " \"io.github.spark_redshift_community.spark.redshift\",\n",
245 | " aws_iam_role = str_iam_role,\n",
246 | " tempdir = str_s3_path,\n",
247 | " dbtable = str_src_table,\n",
248 | " url = str_jdbc_url)\n",
249 | "\n",
250 | "showDF(df)"
251 | ]
252 | }
253 | ],
254 | "metadata": {
255 | "kernelspec": {
256 | "display_name": "PySpark",
257 | "language": "python",
258 | "name": "pysparkkernel"
259 | },
260 | "language_info": {
261 | "codemirror_mode": {
262 | "name": "python",
263 | "version": 3
264 | },
265 | "file_extension": ".py",
266 | "mimetype": "text/x-python",
267 | "name": "pyspark",
268 | "pygments_lexer": "python3"
269 | }
270 | },
271 | "nbformat": 4,
272 | "nbformat_minor": 5
273 | }
274 |
--------------------------------------------------------------------------------
/examples/query-hudi-dataset-with-spark-sql.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Query `Hudi` dataset using Spark SQL\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Hudi operations like Insert, Upsert, Delete, Read and Incremental querying."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "***\n",
18 | "\n",
19 | "## Prerequisites\n",
20 | "\n",
21 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
22 | "\n",
23 | "* To use Hudi with Amazon EMR Notebooks, you must first copy the Hudi jar files from the local file system to HDFS on the master node of the EMR cluster. You then use the notebook to configure your EMR notebook to use Hudi. Follow the `Setup` steps.\n",
24 | "* With Amazon EMR release version 5.28.0 and later, Amazon EMR installs Hudi components by default when Spark, Hive, or Presto is installed. The EMR cluster attached to this notebook should have the `Spark` and `Hive` applications installed.\n",
25 | "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
26 | "* This notebook uses the `Spark` kernel.\n",
27 | "***"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## Introduction\n",
35 | "Hudi is a data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. By efficiently managing how data is laid out in Amazon S3, Hudi allows data to be ingested and updated in near real time. Hudi carefully maintains metadata of the actions performed on the dataset to help ensure that the actions are atomic and consistent.\n",
36 | "\n",
37 | "You can use Hive, Spark, or Presto to query a Hudi dataset interactively or build data processing pipelines using incremental pull. Incremental pull refers to the ability to pull only the data that changed between two actions.\n",
38 | "\n",
39 | "The post: Work with a Hudi Dataset provides detailed information.\n",
40 | "\n",
41 | "***"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "## Setup\n",
49 | "1. Create an S3 bucket location to save your hudi dataset. For example: s3://EXAMPLE-BUCKET/my-hudi-dataset/\n",
50 | "\n",
51 | "2. Connect to the master node of the cluster using SSH and then copy the jar files from the local filesystem to HDFS as shown in the following examples. In the example, we create a directory in HDFS for clarity of file management. You can choose your own destination in HDFS, if desired.\n",
52 | "\n",
53 | "```\n",
54 | "hdfs dfs -mkdir -p /apps/hudi/lib hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar /apps/hudi/lib/spark-avro.jar\n",
55 | "```\n",
56 | "***"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "## Example"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "%%configure\n",
73 | "{ \n",
74 | " \"conf\": \n",
75 | " {\n",
76 | " \"spark.jars\":\"hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar\",\n",
77 | " \"spark.serializer\":\"org.apache.spark.serializer.KryoSerializer\",\n",
78 | " \"spark.sql.hive.convertMetastoreParquet\":\"false\"\n",
79 | " }\n",
80 | "}"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "Initialize a Spark Session for Hudi. \n",
88 | "When using Scala, make sure you import the following classes in your Spark session. This needs to be done once per Spark session."
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "import org.apache.spark.sql.SaveMode\n",
98 | "import org.apache.spark.sql.functions._\n",
99 | "import org.apache.hudi.DataSourceWriteOptions\n",
100 | "import org.apache.hudi.config.HoodieWriteConfig\n",
101 | "import org.apache.hudi.hive.MultiPartKeysValueExtractor\n",
102 | "\n",
103 | "// Create a DataFrame\n",
104 | "inputDF = spark.createDataFrame(\n",
105 | " [\n",
106 | " (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n",
107 | " (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n",
108 | " (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n",
109 | " (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n",
110 | " (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n",
111 | " (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n",
112 | " ],\n",
113 | " [\"id\", \"creation_date\", \"last_update_time\"]\n",
114 | ")\n",
115 | "\n",
116 | "// Specify common DataSourceWriteOptions in the single hudiOptions variable\n",
117 | "hudiOptions = {\n",
118 | "\"hoodie.table.name\": \"my_hudi_table\",\n",
119 | "\"hoodie.datasource.write.recordkey.field\": \"id\",\n",
120 | "\"hoodie.datasource.write.partitionpath.field\": \"creation_date\",\n",
121 | "\"hoodie.datasource.write.precombine.field\": \"last_update_time\",\n",
122 | "\"hoodie.datasource.hive_sync.enable\": \"true\",\n",
123 | "\"hoodie.datasource.hive_sync.table\": \"my_hudi_table\",\n",
124 | "\"hoodie.datasource.hive_sync.partition_fields\": \"creation_date\",\n",
125 | "\"hoodie.datasource.hive_sync.partition_extractor_class\": \"org.apache.hudi.hive.MultiPartKeysValueExtractor\"\n",
126 | "}\n",
127 | "\n",
128 | "// Write a DataFrame as a Hudi dataset to the S3 location that you created in Step 1.\n",
129 | "inputDF.write\n",
130 | ".format(\"org.apache.hudi\")\n",
131 | ".option(\"hoodie.datasource.write.operation\", \"insert\")\n",
132 | ".options(**hudiOptions)\n",
133 | ".mode(\"overwrite\")\n",
134 | ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1."
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "#### Upsert Data \n",
142 | "Upsert refers to the ability to insert records into an existing dataset if they do not already exist or to update them if they do. \n",
143 | "The following example demonstrates how to upsert data by writing a DataFrame. \n",
144 | "Unlike the previous insert example, the `OPERATION_OPT_KEY` value is set to `UPSERT_OPERATION_OPT_VAL`. \n",
145 | "In addition, `.mode(SaveMode.Append)` is specified to indicate that the record should be appended."
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {},
152 | "outputs": [],
153 | "source": [
154 | "// Create a new DataFrame from the first row of inputDF with a different creation_date value\n",
155 | "updateDF = inputDF.limit(1).withColumn(\"creation_date\", lit(\"new_value\"))\n",
156 | "\n",
157 | "updateDF.write\n",
158 | ".format(\"org.apache.hudi\")\n",
159 | ".option(\"hoodie.datasource.write.operation\", \"upsert\")\n",
160 | ".options(**hudiOptions)\n",
161 | ".mode(\"append\")\n",
162 | ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1."
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "#### Delete a Record\n",
170 | "To hard delete a record, you can upsert an empty payload. In this case, the `PAYLOAD_CLASS_OPT_KEY` option specifies the `EmptyHoodieRecordPayload` class. \n",
171 | "The example uses the same DataFrame, updateDF, used in the upsert example to specify the same record."
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {},
178 | "outputs": [],
179 | "source": [
180 | "updateDF.write\n",
181 | ".format(\"org.apache.hudi\")\n",
182 | ".option(\"hoodie.datasource.write.operation\", \"upsert\")\n",
183 | ".option(\"hoodie.datasource.write.payload.class\", \"org.apache.hudi.common.model.EmptyHoodieRecordPayload\")\n",
184 | ".options(**hudiOptions)\n",
185 | ".mode(\"append\")\n",
186 | ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1."
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "#### Read from a Hudi Dataset\n",
194 | "To retrieve data at the present point in time, Hudi performs snapshot queries by default. \n",
195 | "Following is an example of querying the dataset written to S3 in Write to a Hudi Dataset. \n",
196 | "Replace `s3://EXAMPLE-BUCKET/my-hudi-dataset` with your table path, and add wildcard asterisks for each partition level, plus one additional asterisk. \n",
197 | "In this example, there is one partition level, so we’ve added two wildcard symbols."
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "snapshotQueryDF = spark.read\n",
207 | " .format(\"org.apache.hudi\")\n",
208 | " .load(\"s3://EXAMPLE-BUCKET/my-hudi-dataset\" + \"/*/*\") // Change this to the S3 location that you created in Step 1.\n",
209 | " \n",
210 | "snapshotQueryDF.show()"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "#### Incremental Queries\n",
218 | "You can also perform incremental queries with Hudi to get a stream of records that have changed since a given commit timestamp. \n",
219 | "To do so, set the `QUERY_TYPE_OPT_KEY` field to `QUERY_TYPE_INCREMENTAL_OPT_VAL`. Then, add a value for `BEGIN_INSTANTTIME_OPT_KEY` to obtain all records written since the specified time. \n",
220 | "Incremental queries are typically ten times more efficient than their batch counterparts since they only process changed records.\n",
221 | "\n",
222 | "When you perform incremental queries, use the root (base) table path without the wildcard asterisks used for Snapshot queries."
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {},
229 | "outputs": [],
230 | "source": [
231 | "readOptions = {\n",
232 | " \"hoodie.datasource.query.type\": \"incremental\",\n",
233 | " \"hoodie.datasource.read.begin.instanttime\": ,\n",
234 | "}\n",
235 | "\n",
236 | "incQueryDF = spark.read\n",
237 | " .format(\"org.apache.hudi\")\n",
238 | " .options(**readOptions)\n",
239 | " .load(\"s3://EXAMPLE-BUCKET/my-hudi-dataset\") // Change this to the S3 location that you created in Step 1.\n",
240 | " \n",
241 | "incQueryDF.show()"
242 | ]
243 | }
244 | ],
245 | "metadata": {
246 | "kernelspec": {
247 | "display_name": "Spark",
248 | "language": "",
249 | "name": "sparkkernel"
250 | },
251 | "language_info": {
252 | "codemirror_mode": "text/x-scala",
253 | "mimetype": "text/x-scala",
254 | "name": "scala",
255 | "pygments_lexer": "scala"
256 | }
257 | },
258 | "nbformat": 4,
259 | "nbformat_minor": 4
260 | }
261 |
--------------------------------------------------------------------------------
/examples/table-with-hiveql-from-data-in-s3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Declare a table with HiveQL from data stored in Amazon S3\n",
8 | "\n",
9 | "#### Topics covered in this example\n",
10 | "* Installation of custom jar from s3."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "***\n",
18 | "\n",
19 | "## Prerequisites\n",
20 | "\n",
21 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
22 | "\n",
23 | "* The EMR cluster attached to this notebook should have the `Spark` and `Hive` applications installed.\n",
24 | "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
25 | "* This notebook uses the `Spark` kernel.\n",
26 | "***"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "## Introduction\n",
34 | "Ad serving machines produce two types of log files: `impression logs` and `click logs`. Every time we display an advertisement to a customer, we add an entry to the impression log. Every time a customer clicks on an advertisement, we add an entry to the click log.\n",
35 | "This example demonstrates how to combine the click and impression logs into a single table that specifies if there was a click for a specific ad and information about that click.\n",
36 | "\n",
37 | "The log data is stored on s3 in the elasticmapreduce bucket `s3://elasticmapreduce/samples/hive-ads/` and includes subdirectories called `tables/impressions` and `tables/clicks`. The directories contain additional directories named such that we can access the data as a partitioned table within Hive. The naming syntax is `[Partition column]` = `[Partition value]`. For example: dt=2009-04-13-05.\n",
38 | "\n",
39 | "The post: Contextual Advertising using Apache Hive and Amazon EMR provides detailed information.\n",
40 | "****"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## Example\n",
48 | "\n",
49 | "We need to use a `custom Serde` (Serializer-deserializer) to read the impressions and clicks data, which is stored in JSON format. Serdes enables Hive to read data stored in a custom format. Our Serde is stored in a JAR file located in Amazon S3 and we tell Spark about it via the following statement."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "%%configure -f\n",
59 | "{\n",
60 | " \"conf\": {\n",
61 | " \"spark.jars\": \"s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar\"\n",
62 | " }\n",
63 | "}"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "Now that our Serde is defined, we can tell Hive about our clicks and impressions data by creating an external table.\n",
71 | "\n",
72 | "The data for this table resides in Amazon S3. Creating the table is a quick operation because we’re just telling Hive about the existence of the data, not copying it. When we query this table Hive will read the table using Hadoop."
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "%%sql\n",
82 | "CREATE EXTERNAL TABLE impressions (\n",
83 | " requestBeginTime string, adId string, impressionId string, referrer string, \n",
84 | " userAgent string, userCookie string, ip string\n",
85 | " )\n",
86 | " PARTITIONED BY (dt string)\n",
87 | " ROW FORMAT \n",
88 | " serde \"com.amazon.elasticmapreduce.JsonSerde\"\n",
89 | " with serdeproperties ( \"paths\"=\"requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip\" )\n",
90 | " LOCATION \"s3://elasticmapreduce/samples/hive-ads/tables/impressions\""
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "The table is partitioned based on time. As yet, Hive doesn’t know which partitions exist in the table. We can tell Hive about the existence of a single partition using the following statement."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "%%sql\n",
107 | "ALTER TABLE impressions ADD PARTITION (dt=\"2009-04-13-08-05\")"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "If we were to query the table at this point the results would contain data from just this partition. We can instruct Hive to recover all partitions by inspecting the data stored in Amazon S3 using the `RECOVER PARTITIONS` statement."
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "%%sql\n",
124 | "ALTER TABLE impressions RECOVER PARTITIONS"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "We follow the same process to recover clicks."
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "%%sql\n",
141 | "CREATE EXTERNAL TABLE clicks (\n",
142 | " impressionId string\n",
143 | " )\n",
144 | " PARTITIONED BY (dt string)\n",
145 | " ROW FORMAT \n",
146 | " SERDE \"com.amazon.elasticmapreduce.JsonSerde\"\n",
147 | " WITH SERDEPROPERTIES ( \"paths\"=\"impressionId\" )\n",
148 | " LOCATION \"s3://elasticmapreduce/samples/hive-ads/tables/clicks\""
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "%%sql\n",
158 | "ALTER TABLE clicks RECOVER PARTITIONS"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "### Combining the Clicks and Impressions Tables\n",
166 | "We want to combine the clicks and impressions tables so that we have a record of whether or not each impression resulted in a click."
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "First we create a table called `joined_impressions`"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {},
180 | "outputs": [],
181 | "source": [
182 | "%%sql\n",
183 | "CREATE TABLE joined_impressions (\n",
184 | " requestBeginTime string, adId string, impressionId string, referrer string, \n",
185 | " userAgent string, userCookie string, ip string, clicked Boolean\n",
186 | " )\n",
187 | " PARTITIONED BY (day string, hour string)\n",
188 | " STORED AS SEQUENCEFILE"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "This table is partitioned as well. An advantage of partitioning tables stored in Amazon S3 is that if Hive needs only some of the partitions to answer the query then only the data from these partitions will be downloaded from Amazon S3.\n",
196 | "\n",
197 | "The joined_impressions table is stored in `SEQUENCEFILE` format, which is a native Hadoop file format that is more compressed and has better performance than JSON files.\n",
198 | "\n",
199 | "Next, we create some temporary tables in the job flow’s local HDFS partition to store intermediate impression and click data."
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "%%sql\n",
209 | "CREATE TABLE tmp_impressions (\n",
210 | " requestBeginTime string, adId string, impressionId string, referrer string, \n",
211 | " userAgent string, userCookie string, ip string\n",
212 | " )\n",
213 | " STORED AS SEQUENCEFILE"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "We insert data from the impressions table for the time duration we’re interested in. Note that because the impressions table is partitioned only the relevant partitions will be read."
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {},
227 | "outputs": [],
228 | "source": [
229 | "%%sql\n",
230 | "INSERT OVERWRITE TABLE tmp_impressions \n",
231 | " SELECT \n",
232 | " from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) requestBeginTime, \n",
233 | " i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip\n",
234 | " FROM \n",
235 | " impressions i\n",
236 | " WHERE \n",
237 | " i.dt >= \"2009-04-13-08-00\" and i.dt < \"2009-04-13-09-00\""
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "For clicks, we extend the period of time over which we join by 20 minutes. Meaning we accept a click that occurred up to 20 minutes after the impression."
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": null,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "%%sql\n",
254 | "CREATE TABLE tmp_clicks (\n",
255 | " impressionId string\n",
256 | " ) STORED AS SEQUENCEFILE"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "%%sql\n",
266 | "INSERT OVERWRITE TABLE tmp_clicks \n",
267 | " SELECT \n",
268 | " impressionId\n",
269 | " FROM \n",
270 | " clicks c \n",
271 | " WHERE \n",
272 | " c.dt >= \"2009-04-13-08-00\" AND c.dt < \"2009-04-13-09-20\""
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "Now we combine the impressions and clicks tables using a left outer join. This way any impressions that did not result in a click are preserved. This join also enables us to search for clicks that occurred after the time period. The query also excludes any clicks that did not originate from an impression in the selected time period."
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {},
286 | "outputs": [],
287 | "source": [
288 | "%%sql\n",
289 | "INSERT OVERWRITE TABLE joined_impressions PARTITION (day=\"2009-04-13\", hour=\"08\")\n",
290 | " SELECT \n",
291 | " i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, \n",
292 | " i.ip, (c.impressionId is not null) clicked\n",
293 | " FROM \n",
294 | " tmp_impressions i LEFT OUTER JOIN tmp_clicks c ON i.impressionId = c.impressionId"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "Because the joined_impressions table is located in Amazon S3 this data is now available for other job flows to use.\n",
302 | "\n",
303 | "Check the results."
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": null,
309 | "metadata": {},
310 | "outputs": [],
311 | "source": [
312 | "%%sql\n",
313 | "select requestBeginTime, adId, impressionId, referrer from joined_impressions limit 5"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "***\n",
321 | "## Cleanup"
322 | ]
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "Delete the tables."
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {},
335 | "outputs": [],
336 | "source": [
337 | "%%sql\n",
338 | "DROP TABLE IF EXISTS impressions"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {},
345 | "outputs": [],
346 | "source": [
347 | "%%sql\n",
348 | "DROP TABLE IF EXISTS clicks"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": null,
354 | "metadata": {},
355 | "outputs": [],
356 | "source": [
357 | "%%sql\n",
358 | "DROP TABLE IF EXISTS tmp_impressions"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {},
365 | "outputs": [],
366 | "source": [
367 | "%%sql\n",
368 | "DROP TABLE IF EXISTS tmp_clicks"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": null,
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "%%sql\n",
378 | "DROP TABLE IF EXISTS joined_impressions"
379 | ]
380 | }
381 | ],
382 | "metadata": {
383 | "kernelspec": {
384 | "display_name": "Spark",
385 | "language": "",
386 | "name": "sparkkernel"
387 | },
388 | "language_info": {
389 | "codemirror_mode": "text/x-scala",
390 | "mimetype": "text/x-scala",
391 | "name": "scala",
392 | "pygments_lexer": "scala"
393 | }
394 | },
395 | "nbformat": 4,
396 | "nbformat_minor": 4
397 | }
398 |
--------------------------------------------------------------------------------
/examples/install-notebook-scoped-libraries-at-runtime.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Install notebook scoped libraries at runtime\n",
8 | "\n",
9 | "\n",
10 | "#### Topics covered in this example\n",
11 | "* Installing notebook scoped libraries at runtime using `install_pypi_package`.\n",
12 | "* Converting spark dataFrame to pandas dataFrame.\n",
13 | "* `%%display` magic example."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "***\n",
21 | "\n",
22 | "## Prerequisites\n",
23 | "\n",
24 | "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n",
25 | "\n",
26 | "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
27 | "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
28 | "* This notebook uses the `PySpark` kernel.\n",
29 | "***"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "## Introduction\n",
37 | "This example shows how to use the notebook-scoped libraries feature of EMR Notebooks to import and install your favorite Python libraries at runtime on your EMR cluster, and use these libraries to enhance your data analysis and visualize your results in rich graphical plots.\n",
38 | "\n",
39 | "The example also shows how to use the `%%display` magic command and how to convert a spark dataFrame to a pandas dataFrame.\n",
40 | "\n",
41 | "The blogpost: Install Python libraries on a running cluster with EMR Notebooks and document: Installing and Using Kernels and Libraries provide detailed information.\n",
42 | "\n",
43 | "We use the Amazon customer review dataset that is publically accessible in s3. \n",
44 | "This dataset is a collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale.\n",
45 | "\n",
46 | "****"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Example"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "Examine the current notebook session configuration."
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "%%info"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Before starting the analysis, check the libraries that are already available on the cluster. \n",
77 | "Use `list_packages()` PySpark API, which lists all the Python libraries on the cluster."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "sc.list_packages()"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {
92 | "papermill": {
93 | "duration": 0.011228,
94 | "end_time": "2020-10-22T17:26:11.518665",
95 | "exception": false,
96 | "start_time": "2020-10-22T17:26:11.507437",
97 | "status": "completed"
98 | },
99 | "tags": []
100 | },
101 | "source": [
102 | "Load the Amazon customer reviews data for books into a Spark DataFrame."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "papermill": {
110 | "duration": 29.402308,
111 | "end_time": "2020-10-22T17:26:40.932705",
112 | "exception": false,
113 | "start_time": "2020-10-22T17:26:11.530397",
114 | "status": "completed"
115 | },
116 | "tags": []
117 | },
118 | "outputs": [],
119 | "source": [
120 | "df = spark.read.parquet(\"s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet\")"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {
126 | "papermill": {
127 | "duration": 0.008858,
128 | "end_time": "2020-10-22T17:26:40.952269",
129 | "exception": false,
130 | "start_time": "2020-10-22T17:26:40.943411",
131 | "status": "completed"
132 | },
133 | "tags": []
134 | },
135 | "source": [
136 | "Determine the schema and number of available columns in the dataset."
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {
143 | "papermill": {
144 | "duration": 0.775197,
145 | "end_time": "2020-10-22T17:26:41.738952",
146 | "exception": false,
147 | "start_time": "2020-10-22T17:26:40.963755",
148 | "status": "completed"
149 | },
150 | "tags": []
151 | },
152 | "outputs": [],
153 | "source": [
154 | "# Total columns\n",
155 | "print(f\"Total Columns: {len(df.dtypes)}\")\n",
156 | "df.printSchema()"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {
162 | "papermill": {
163 | "duration": 0.016013,
164 | "end_time": "2020-10-22T17:26:41.771853",
165 | "exception": false,
166 | "start_time": "2020-10-22T17:26:41.755840",
167 | "status": "completed"
168 | },
169 | "tags": []
170 | },
171 | "source": [
172 | "Determine the number of available rows in the dataset."
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {
179 | "papermill": {
180 | "duration": 0.080088,
181 | "end_time": "2020-10-22T17:26:41.867894",
182 | "exception": false,
183 | "start_time": "2020-10-22T17:26:41.787806",
184 | "status": "completed"
185 | },
186 | "tags": []
187 | },
188 | "outputs": [],
189 | "source": [
190 | "# Total row\n",
191 | "print(f\"Total Rows: {df.count():,}\")"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {
197 | "papermill": {
198 | "duration": 0.016419,
199 | "end_time": "2020-10-22T17:26:41.899107",
200 | "exception": false,
201 | "start_time": "2020-10-22T17:26:41.882688",
202 | "status": "completed"
203 | },
204 | "tags": []
205 | },
206 | "source": [
207 | "Check the total number of books."
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {
214 | "papermill": {
215 | "duration": 0.815754,
216 | "end_time": "2020-10-22T17:26:42.728573",
217 | "exception": false,
218 | "start_time": "2020-10-22T17:26:41.912819",
219 | "status": "completed"
220 | },
221 | "tags": []
222 | },
223 | "outputs": [],
224 | "source": [
225 | "# Total number of books\n",
226 | "num_of_books = df.select(\"product_id\").distinct().count()\n",
227 | "print(f\"Number of Books: {num_of_books:,}\")"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {
233 | "papermill": {
234 | "duration": 0.023551,
235 | "end_time": "2020-10-22T17:26:42.773267",
236 | "exception": false,
237 | "start_time": "2020-10-22T17:26:42.749716",
238 | "status": "completed"
239 | },
240 | "tags": []
241 | },
242 | "source": [
243 | "Import the Pandas library version 0.25.1 and the latest Matplotlib library from the public PyPI repository. \n",
244 | "Install them on the cluster attached to the notebook using the `install_pypi_package` API.\n",
245 | "\n",
246 | "The `install_pypi_package` PySpark API installs the libraries along with any associated dependencies. \n",
247 | "By default, it installs the latest version of the library that is compatible with the Python version you are using. \n",
248 | "A specific version of the library can be installed by specifying the library version from the below Pandas example."
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {
255 | "papermill": {
256 | "duration": 0.296075,
257 | "end_time": "2020-10-22T17:26:43.125913",
258 | "exception": false,
259 | "start_time": "2020-10-22T17:26:42.829838",
260 | "status": "completed"
261 | },
262 | "tags": []
263 | },
264 | "outputs": [],
265 | "source": [
266 | "sc.install_pypi_package(\"pandas==0.25.1\") #Install pandas version 0.25.1 \n",
267 | "sc.install_pypi_package(\"matplotlib\", \"https://pypi.org/simple\") #Install matplotlib from given PyPI repository"
268 | ]
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "metadata": {},
273 | "source": [
274 | "Verify that the imported packages are successfully installed."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "sc.list_packages()"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "`%%display` magic command is used to display a spark dataFrame as a beautiful HTML table with horizontal and vertical scroll bar. \n",
291 | "Use the `%%display` magic command to show the number of reviews provided across multiple years."
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": [
300 | "%%display\n",
301 | "df.groupBy(\"year\").count().orderBy(\"year\")"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "Analyze the trend for the number of reviews provided across multiple years. \n",
309 | "Use `toPandas()` to convert the Spark dataFrame to a Pandas dataFrame and visualize with Matplotlib."
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {},
316 | "outputs": [],
317 | "source": [
318 | "# Number of reviews across years\n",
319 | "num_of_reviews_by_year = df.groupBy(\"year\").count().orderBy(\"year\").toPandas()"
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "Analyze the number of book reviews by year and find the distribution of customer ratings."
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": [
335 | "import matplotlib.pyplot as plt\n",
336 | "plt.clf()\n",
337 | "num_of_reviews_by_year.plot(kind=\"area\", x=\"year\",y=\"count\", rot=70, color=\"#bc5090\", legend=None, figsize=(8,6))\n",
338 | "plt.xticks(num_of_reviews_by_year.year)\n",
339 | "plt.xlim(1995, 2015)\n",
340 | "plt.title(\"Number of reviews across years\")\n",
341 | "plt.xlabel(\"Year\")\n",
342 | "plt.ylabel(\"Number of Reviews\")"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "The preceding commands render the plot on the attached EMR cluster. To visualize the plot within your notebook, use `%matplot` magic."
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": null,
355 | "metadata": {},
356 | "outputs": [],
357 | "source": [
358 | "%matplot plt"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "Analyze the distribution of star ratings and visualize it using a pie chart."
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {},
372 | "outputs": [],
373 | "source": [
374 | "# Distribution of overall star ratings\n",
375 | "product_ratings_dist = df.groupBy(\"star_rating\").count().orderBy(\"count\").toPandas()"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": null,
381 | "metadata": {},
382 | "outputs": [],
383 | "source": [
384 | "plt.clf()\n",
385 | "labels = [f\"Star Rating: {rating}\" for rating in product_ratings_dist[\"star_rating\"]]\n",
386 | "reviews = [num_reviews for num_reviews in product_ratings_dist[\"count\"]]\n",
387 | "colors = [\"#00876c\", \"#89c079\", \"#fff392\", \"#fc9e5a\", \"#de425b\"]\n",
388 | "fig, ax = plt.subplots(figsize=(8,5))\n",
389 | "w,a,b = ax.pie(reviews, autopct=\"%1.1f%%\", colors=colors)\n",
390 | "plt.title(\"Distribution of star ratings for books\")\n",
391 | "ax.legend(w, labels, title=\"Star Ratings\", loc=\"center left\", bbox_to_anchor=(1, 0, 0.5, 1))"
392 | ]
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "metadata": {},
397 | "source": [
398 | "Print the pie chart using `%matplot` magic and visualize it"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {},
405 | "outputs": [],
406 | "source": [
407 | "%matplot plt"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "The pie chart shows that 80% of users gave a rating of 4 or higher. Approximately 10% of users rated their books 2 or lower. In general, customers are happy about their book purchases from Amazon."
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "***\n",
422 | "## Cleanup"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "Lastly, use the `uninstall_package` Pyspark API to uninstall the `pandas` and `matplotlib` libraries that were installed using the `install_package` API."
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "sc.uninstall_package(\"pandas\")\n",
439 | "sc.uninstall_package(\"matplotlib\")"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "sc.list_packages()"
449 | ]
450 | },
451 | {
452 | "cell_type": "markdown",
453 | "metadata": {},
454 | "source": [
455 | "After closing your notebook, the Pandas and Matplot libraries that you installed on the cluster using the `install_pypi_package` API are garbage and collected out of the cluster."
456 | ]
457 | }
458 | ],
459 | "metadata": {
460 | "celltoolbar": "Tags",
461 | "kernelspec": {
462 | "display_name": "PySpark",
463 | "language": "",
464 | "name": "pysparkkernel"
465 | },
466 | "language_info": {
467 | "codemirror_mode": {
468 | "name": "python",
469 | "version": 3
470 | },
471 | "mimetype": "text/x-python",
472 | "name": "pyspark",
473 | "pygments_lexer": "python3"
474 | },
475 | "papermill": {
476 | "duration": null,
477 | "end_time": null,
478 | "environment_variables": {},
479 | "exception": null,
480 | "input_path": "/home/notebook/work/demo_pyspark.ipynb",
481 | "output_path": "/home/notebook/work/executions/ex-IZXBKZLT803GPIP3MMFA31DW8ASYM/demo_pyspark.ipynb",
482 | "parameters": {
483 | "DATE": "10-20-2020",
484 | "TOP_K": 6,
485 | "US_STATES": [
486 | "Wisconsin",
487 | "Texas",
488 | "Nevada"
489 | ]
490 | },
491 | "start_time": "2020-10-22T17:25:45.424746",
492 | "version": "1.2.1"
493 | }
494 | },
495 | "nbformat": 4,
496 | "nbformat_minor": 4
497 | }
498 |
--------------------------------------------------------------------------------