├── .gitignore
├── LICENSE
├── Labs
    ├── Lab 1 Midway RCC and mpi4py
    │   ├── midway_cheat_sheet.md
    │   ├── mpi.sbatch
    │   ├── mpi_multi_job.sbatch
    │   └── mpi_rand_walk.py
    ├── Lab 10 Launching a Scalable API
    │   ├── api_demo
    │   │   ├── api.zip
    │   │   ├── api
    │   │   │   ├── application.py
    │   │   │   ├── requirements.txt
    │   │   │   └── templates
    │   │   │   │   └── index.html
    │   │   ├── launch_db.py
    │   │   └── terminate_db.py
    │   └── flask_basics
    │   │   ├── app.py
    │   │   └── templates
    │   │       ├── base.html
    │   │       └── scores.html
    ├── Lab 11 Visualizing Large Data
    │   └── 9W_VisualizingLargeData.ipynb
    ├── Lab 2 PyOpenCL
    │   ├── Lab_2_PyOpenCL_Random_Walk_Tutorial.ipynb
    │   ├── gpu.sbatch
    │   ├── gpu_rand_walk.py
    │   └── print_gpu_info.py
    ├── Lab 3 AWS EC2 and PyWren
    │   ├── Lab_3_PyWren.ipynb
    │   ├── isbn.txt
    │   └── pywren_workflow.png
    ├── Lab 4 Storing and Structuring Large Data
    │   ├── 5W_large_scale_databases.ipynb
    │   └── Lab 4 Working with Large Data Sources in S3.ipynb
    ├── Lab 5 Ingesting and Processing Large-Scale Data
    │   ├── Part I MapReduce
    │   │   ├── .mrjob.conf
    │   │   ├── mapreduce_lab5.py
    │   │   ├── mrjob_cheatsheet.md
    │   │   └── sample_us.tsv
    │   └── Part II Kinesis
    │   │   ├── Lab 5 Kinesis.ipynb
    │   │   ├── consumer.py
    │   │   ├── consumer_feed.png
    │   │   ├── producer.py
    │   │   └── simple_kinesis_architecture.png
    ├── Lab 6 PySpark EDA and ML
    │   ├── 7M_PySpark_Midway.ipynb
    │   ├── 7M_PySpark_Midway.py
    │   ├── Lab_6.ipynb
    │   ├── Local_Colab_Spark_Setup.ipynb
    │   ├── ec2_limit_increase.png
    │   ├── gpu_cluster_instructions.md
    │   ├── my-bootstrap-action.sh
    │   ├── my-configurations.json
    │   └── spark.sbatch
    ├── Lab 7 Exploring the Larger Spark Ecosystem
    │   ├── Lab_7_GraphFrames.ipynb
    │   ├── bootstrap
    │   ├── edges.csv
    │   ├── nodes.csv
    │   ├── spark_nlp.ipynb
    │   └── spark_streaming_emr.ipynb
    ├── Lab 8 Parallel Computing with Dask
    │   ├── Part I - Dask on EMR
    │   │   ├── Dask on an EMR Cluster.ipynb
    │   │   ├── bootstrap-dask
    │   │   └── dask_bootstrap_workflow.md
    │   └── Part II - Dask on Midway
    │   │   └── Dask ML on Midway.ipynb
    └── Lab 9 Accelerating Dask
    │   ├── 8W_Dask_Numba.ipynb
    │   ├── 8W_Dask_Rapids.ipynb
    │   ├── listings_bos.csv
    │   ├── listings_chi.csv
    │   ├── listings_sf.csv
    │   └── test_listings.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pyc 
 2 | build/*
 3 | dist/*
 4 | *.aux
 5 | *.bbl
 6 | *.blg
 7 | *.fdb_latexmk
 8 | *.idx
 9 | *.ilg
10 | *.ind
11 | *.lof
12 | *.log
13 | *.lot
14 | *.out
15 | *.pdfsync
16 | *.synctex.gz
17 | *.toc
18 | *.swp
19 | *.asv
20 | *.nav
21 | *.snm
22 | *.gz
23 | *.bib.bak
24 | *.fls
25 | *.m~
26 | *.sublime*
27 | .DS_Store
28 | *.dta
29 | *.ipynb_checkpoints*
30 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Jon Clindaniel
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/midway_cheat_sheet.md:
--------------------------------------------------------------------------------
 1 | # Login, Configuration, and Copying Files to the Midway RCC
 2 | See [Midway RCC user guide](https://rcc.uchicago.edu/docs/) for login details specific to your system and additional options. The instructions below assume a Unix-style command line interface.
 3 | 
 4 | ### SSH into Cluster with cNetID and Password
 5 | ```
 6 | ssh your_cnet_id@midway2.rcc.uchicago.edu
 7 | ```
 8 | Note that you'll now be able to move around your home directory using standard unix commands (`cd`, `pwd`, `ls`, etc.). If you are on a Windows machine, I (Jon) recommend enabling [Windows Subsystem for Linux and installing Ubuntu 18.04](https://docs.microsoft.com/en-us/windows/wsl/install-win10) to complete all of these tasks. This is what I do and I find this makes my life a lot easier than having to mentally transition between DOS and Unix commands (or needing to use an additional third-party tool).
 9 | 
10 | ### SCP files to/from local directory
11 | If you need to transfer files to and from Midway's storage, I find it easiest to just copy files via the `scp` command. Here, I copy a local file `local_file` in my current directory to my home directory on Midway. Then, I copy a file `remote_file` from my Midway home directory back to my current directory on my local machine. If you prefer not to use this command line approach, there are tutorials in the Midway documentation with [alternative approaches](https://rcc.uchicago.edu/docs/data-transfer/index.html).
12 | 
13 | ```
14 | scp local_file your_cnet_id@midway2.rcc.uchicago.edu:
15 | ```
16 | ```
17 | scp your_cnet_id@midway2.rcc.uchicago.edu:remote_file .
18 | ```
19 | 
20 | To copy an entire directory, use the `-r` flag:
21 | ```
22 | scp -r your_cnet_id@midway2.rcc.uchicago.edu:remote_directory .
23 | ```
24 | 
25 | ### Clone GitHub Repository
26 | Another good option is to clone a GitHub repository (for instance this public course repository, or your personal fork of it) to your home directory on Midway and access files/code this way. You can then pull updates from the course repository as new material is added.
27 | 
28 | ```
29 | git clone https://github.com/jonclindaniel/LargeScaleComputing_S21.git
30 | ```
31 | 
32 | ### Adjustments to text on the Command Line
33 | If need to make any adjustments to text files before/after running them (or create new ones) on Midway, can do so on the command line with text editing tools such as nano, vim, etc.
34 | ```
35 | nano mpi.sbatch
36 | ```
37 | 
38 | # Loading Modules and installing programs
39 | 
40 | ### Load appropriate modules for Python, MPI, and OpenCL development
41 | ```
42 | module load cuda
43 | module load python/anaconda-2019.03
44 | module load intelmpi/2018.2.199+intel-18.0
45 | ```
46 | 
47 | ### Install additional Python packages to local user directory from login node
48 | To install packages that are not already included in a module (such as installing mpi4py and PyOpenCL, which we will be using in this class, along with Mako, which is used in PyOpenCL programs), you can use `pip` to install these packages in your local user directory from the login node on Midway. Make sure you have loaded the modules above before you try to run the below commands to ensure you install the correct versions of the packages.
49 | 
50 | ```
51 | pip install --user mpi4py
52 | pip install --user pyopencl
53 | pip install --user mako
54 | ```
55 | 
56 | # Running jobs
57 | 
58 | Resource sharing and job scheduling is handled by Slurm software on the Midway RCC. You can see detailed information in the Midway documentation, but Slurm allows you to see which partitions are available at any given time via the `sinfo` command, submit batch jobs via the `sbatch` command (which will allow you to request specific node/interconnect resources for a period of time and run your code on those resources), and schedule interactive sessions via the `sinteractive` command. Listed below are some of the fundamental commands you will need to know how to perform for this class.
59 | 
60 | ### Run Batch jobs with Slurm scripts
61 | You will use Slurm scripts to request computational resources for a period of time and run your code. These scripts can be run with the `sbatch` commands, as demonstrated below. You can check out sample Slurm scripts in our GitHub course repository, for more detail on how these scripts are structured.
62 | 
63 | ```
64 | sbatch mpi.sbatch
65 | sbatch gpu.sbatch
66 | ```
67 | 
68 | Note, if you wrote your sbatch file on a Windows machine, you might need to convert the text into a Unix format for it to run properly on the Midway RCC. To do so, you can install `dos2unix` on WSL via `apt-get install dos2unix` and then run the converter on your sbatch file.
69 | 
70 | ```
71 | dos2unix gpu.sbatch
72 | ```
73 | 
74 | You can monitor the progress of your job with (the sbatch command will print out your job ID):
75 | ```
76 | squeue --job your_job_id
77 | ```
78 | 
79 | You can also cancel jobs with:
80 | ```
81 | scancel your_job_id
82 | ```
83 | 
84 | ### Check results of your batch job (assuming write to `.out` file)
85 | In your Slurm script, you will specify a `.out` file, where the output of your program will be written. You can download the file to your local machine to look at the results (via `scp`), or you can check the results on the Midway command line with the standard Unix `cat` command.
86 | 
87 | ```
88 | cat gpu.out
89 | ```
90 | 
91 | ### Run interactive jobs
92 | You should not perform intensive computational tasks on the login nodes. Use the `sinteractive` command to set up an interactive session on other Midway nodes if you want to have an interactive command line experience (you can specify exactly which nodes you would like to access; see the documentation for syntax). Here we request 4 cores for 15 minutes. Additionally, you can use the interactive mode to run Jupyter notebooks, which you can view in your browser (see documentation for more details).
93 | 
94 | ```
95 | sinteractive --time=00:15:00 --ntasks=4
96 | ```
97 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name=mpi
 4 | #SBATCH --output=mpi.out
 5 | #SBATCH --ntasks=4
 6 | #SBATCH --partition=broadwl
 7 | #SBATCH --constraint=fdr
 8 | 
 9 | # Load Python and MPI modules
10 | module load python/anaconda-2019.03
11 | module load intelmpi/2018.2.199+intel-18.0
12 | 
13 | # Run the python program with mpirun. The -n flag is not required;
14 | # mpirun will automatically figure out the best configuration from the
15 | # Slurm environment variables.
16 | mpirun python3 ./mpi_rand_walk.py
17 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi_multi_job.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name=mpi_multi_job
 4 | #SBATCH --ntasks=11
 5 | #SBATCH --partition=broadwl
 6 | #SBATCH --constraint=fdr
 7 | 
 8 | # Load Python and MPI modules
 9 | module load python/anaconda-2019.03
10 | module load intelmpi/2018.2.199+intel-18.0
11 | 
12 | # Run the python program with mpirun, using & to run jobs at the same time
13 | mpirun -n 1 python3 ./mpi_rand_walk.py > ./mpi_nprocs1.out &
14 | mpirun -n 10 python3 ./mpi_rand_walk.py > ./mpi_nprocs10.out &
15 | 
16 | # Wait until all simultaneous mpiruns are done
17 | wait
18 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi_rand_walk.py:
--------------------------------------------------------------------------------
 1 | from mpi4py import MPI
 2 | import matplotlib.pyplot as plt
 3 | import numpy as np
 4 | import time
 5 | 
 6 | def sim_rand_walks_parallel(n_runs):
 7 |     # Get rank of process and overall size of communicator:
 8 |     comm = MPI.COMM_WORLD
 9 |     rank = comm.Get_rank()
10 |     size = comm.Get_size()
11 | 
12 |     # Start time:
13 |     t0 = time.time()
14 | 
15 |     # Evenly distribute number of simulation runs across processes
16 |     N = int(n_runs / size)
17 | 
18 |     # Simulate N random walks on each MPI Process and specify as a NumPy Array
19 |     r_walks = []
20 |     for i in range(N):
21 |         steps = np.random.normal(loc = 0, scale = 1, size = 100)
22 |         steps[0] = 0
23 |         r_walks.append(100 + np.cumsum(steps))
24 |     r_walks_array = np.array(r_walks)
25 | 
26 |     # Gather all simulation arrays to buffer of expected size/dtype on rank 0
27 |     r_walks_all = None
28 |     if rank == 0:
29 |         r_walks_all = np.empty([N * size, 100], dtype = 'float')
30 |     comm.Gather(sendbuf = r_walks_array, recvbuf = r_walks_all, root = 0)
31 | 
32 |     # Print/plot simulation results on rank 0
33 |     if rank == 0:
34 |         # Calculate time elapsed after computing mean and std
35 |         average_finish = np.mean(r_walks_all[:, -1])
36 |         std_finish = np.std(r_walks_all[:, -1])
37 |         time_elapsed = time.time() - t0
38 | 
39 |         # Print time elapsed + simulation results
40 |         print("Simulated %d Random Walks in: %f seconds on %d MPI processes"
41 |                 % (n_runs, time_elapsed, size))
42 |         print("Average final position: %f, Standard Deviation: %f"
43 |                 % (average_finish, std_finish))
44 | 
45 |         # Plot Simulations and save to file
46 |         plt.plot(r_walks_all.transpose())
47 |         plt.savefig("r_walk_nprocs%d_nruns%d.png" % (size, n_runs))
48 | 
49 |     return
50 | 
51 | def main():
52 |     sim_rand_walks_parallel(n_runs = 10000)
53 | 
54 | if __name__ == '__main__':
55 |     main()
56 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/api.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 10 Launching a Scalable API/api_demo/api.zip


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/api/application.py:
--------------------------------------------------------------------------------
 1 | from flask import Flask, render_template, jsonify
 2 | import boto3
 3 | 
 4 | # Create an instance of Flask class (represents our application)
 5 | # Pass in name of application's module (__name__ evaluates to current module name)
 6 | app = Flask(__name__)
 7 | application = app # AWS EB requires it to be called "application"
 8 | 
 9 | # on EC2, needs to know region name as well; no config
10 | dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
11 | table = dynamodb.Table('books')
12 | 
13 | # Provide a landing page with some documentation on how to use API
14 | @app.route("/")
15 | def home():
16 |     return render_template('index.html')
17 | 
18 | # Get items from DynamoDB `books` table
19 | # Can provide additional API functionality with more complicated SQL queries
20 | @app.route("/api/isbn:<isbn>")
21 | def isbn(isbn):
22 |     response = table.get_item(Key={'isbn': str(isbn)})
23 |     return jsonify(response['Item'])
24 | 
25 | if __name__ == "__main__":
26 |     application.run()
27 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/api/requirements.txt:
--------------------------------------------------------------------------------
1 | flask
2 | boto3
3 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/api/templates/index.html:
--------------------------------------------------------------------------------
 1 | <!--
 2 | Note that you can provide CSS style info in addition to standard HTML:
 3 | -->
 4 | <head>
 5 |   <style>
 6 |     .header {
 7 |       background-color: steelblue;
 8 |       color: white;
 9 |       padding: 10px;
10 |     }
11 |   </style>
12 | </head>
13 | 
14 | <body>
15 |   <h2 class="header">
16 |     My Books API
17 |   </h2>
18 |   <p>
19 |     This is my books API and here's some documentation on how to use it:
20 |   </p>
21 |   <p>
22 |     Example Query: <code>URL_BASE/api/isbn:043591010</code>
23 |   </p>
24 | </body>
25 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/launch_db.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | 
 3 | def create_books_table():
 4 |     dynamodb = boto3.resource('dynamodb')
 5 | 
 6 |     table = dynamodb.create_table(
 7 |         TableName='books',
 8 |         KeySchema=[
 9 |             {
10 |                 'AttributeName': 'isbn',
11 |                 'KeyType': 'HASH'
12 |             }
13 |         ],
14 |         AttributeDefinitions=[
15 |             {
16 |                 'AttributeName': 'isbn',
17 |                 'AttributeType': 'S'
18 |             }
19 |         ],
20 |         ProvisionedThroughput={
21 |             'ReadCapacityUnits': 1,
22 |             'WriteCapacityUnits': 1
23 |         }
24 |     )
25 | 
26 |     # Wait until AWS confirms that table exists before moving on
27 |     table.meta.client.get_waiter('table_exists').wait(TableName='books')
28 | 
29 |     # Put some data into the table
30 |     table.put_item(
31 |         Item={
32 |              'isbn': '043591010',
33 |              'title': "Opening Spaces: An Anthology of Contemporary African Women's Writing",
34 |              'author': 'Yvonne Vera',
35 |              'year': '1999'
36 |         }
37 |     )
38 | 
39 |     table.put_item(
40 |         Item={
41 |              'isbn': '0394721179',
42 |              'title': "African Folktales: Traditional Stories of the Black World",
43 |              'author': 'Roger D. Abrahams',
44 |              'year': '1983'
45 |         }
46 |     )
47 | 
48 | if __name__ == "__main__":
49 |     create_books_table()
50 |     print("DynamoDB 'books' table has been created and populated")
51 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/api_demo/terminate_db.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | 
 3 | def delete_books_table():
 4 |     dynamodb = boto3.resource('dynamodb')
 5 |     table = dynamodb.Table('books')
 6 |     table.delete()
 7 | 
 8 | if __name__ == "__main__":
 9 |     delete_books_table()
10 |     print("DynamoDB 'books' table has been terminated")
11 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/flask_basics/app.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Run application with `python app.py` or `flask run` command
 3 | in terminal window
 4 | '''
 5 | from flask import Flask, render_template
 6 | import numpy as np
 7 | 
 8 | # Create an instance of Flask class (represents our application)
 9 | # Pass in name of application's module (__name__ evaluates to current module name)
10 | app = Flask(__name__)
11 | 
12 | # Define Python functions that will be triggered if we go to defined URL paths
13 | # Anything we `return` is rendered in our browser as HTML by default
14 | @app.route("/")
15 | def hello_world():
16 |     return "<p>Hello, World!</p>"
17 | 
18 | # We're not limited to short HTML phrases or a single URL path:
19 | @app.route("/scores")
20 | def scores():
21 |     student = {'name': 'Jon'}
22 |     assignments = [
23 |         {
24 |             'name': 'A1',
25 |             'score': 89
26 |         },
27 |         {
28 |             'name': 'A2',
29 |             'score': 82
30 |         },
31 |         {
32 |             'name': 'A3',
33 |             'score': 95
34 |         }
35 |     ]
36 | 
37 |     # Compute the average score:
38 |     avg = np.mean([assignment['score'] for assignment in assignments])
39 | 
40 |     # Render results using HTML template (Flask automatically looks
41 |     # for HTML templates in app/templates/ directory)
42 |     return render_template('scores.html',
43 |                             student=student,
44 |                             assignments=assignments,
45 |                             avg=avg)
46 | 
47 | # We can also run arbitrary Python code and return the results
48 | # of our computations:
49 | @app.route("/random")
50 | def random():
51 |     ran = np.random.randint(1, 10)
52 |     return "<p>Your random number is {}</p>".format(ran)
53 | 
54 | # And return results of computations based on user-defined input:
55 | @app.route("/num_chars/<word>")
56 | def num_chars(word):
57 |     return "<p> There are {} characters in {} </p>".format(len(word), word)
58 | 
59 | if __name__ == "__main__":
60 |     app.run() # allows us to run app via `python app.py`
61 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/flask_basics/templates/base.html:
--------------------------------------------------------------------------------
 1 | <html>
 2 |     <head>
 3 |       {% if student['name'] %}
 4 |       <title>{{ student['name'] }}'s Scores</title>
 5 |       {% else %}
 6 |       <title>MACS 30123 Scores</title>
 7 |       {% endif %}
 8 |     </head>
 9 |     <body>
10 |         <div><a href="/">Home</a></div>
11 |         <hr>
12 |         {% block content %}{% endblock %}
13 |     </body>
14 | </html>
15 | 


--------------------------------------------------------------------------------
/Labs/Lab 10 Launching a Scalable API/flask_basics/templates/scores.html:
--------------------------------------------------------------------------------
 1 | {% extends "base.html" %}
 2 | 
 3 | {% block content %}
 4 |     <h1>MACS 30123 Scores</h1>
 5 |     <table>
 6 |       <tr>
 7 |         <th>Assignment</th>
 8 |         <th>Score</th>
 9 |       </tr>
10 |       {% for assignment in assignments %}
11 |       <tr>
12 |         <td> {{ assignment['name'] }} </td>
13 |         <td> {{ assignment['score'] }} </td>
14 |       </tr>
15 |       {% endfor %}
16 |       </table>
17 |       <p> Your average score is {{ avg }}% </p>
18 | {% endblock %}
19 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/gpu.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=gpu   # job name
 3 | #SBATCH --output=gpu.out # output log file
 4 | #SBATCH --error=gpu.err  # error file
 5 | #SBATCH --time=00:05:00  # 5 minutes of wall time
 6 | #SBATCH --nodes=1        # 1 GPU node
 7 | #SBATCH --partition=gpu2 # GPU2 partition
 8 | #SBATCH --ntasks=1       # 1 CPU core to drive GPU
 9 | #SBATCH --gres=gpu:1     # Request 1 GPU
10 | 
11 | module load cuda
12 | module load python/anaconda-2019.03
13 | 
14 | python3 ./print_gpu_info.py
15 | # python3 ./gpu_rand_walk.py
16 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/gpu_rand_walk.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pyopencl as cl
 3 | import pyopencl.array as cl_array
 4 | import pyopencl.clrandom as clrand
 5 | import pyopencl.tools as cltools
 6 | from pyopencl.scan import GenericScanKernel
 7 | import matplotlib.pyplot as plt
 8 | import time
 9 | 
10 | def sim_rand_walks(n_runs):
11 |     # Set up context and command queue
12 |     ctx = cl.create_some_context()
13 |     queue = cl.CommandQueue(ctx)
14 | 
15 |     # mem_pool = cltools.MemoryPool(cltools.ImmediateAllocator(queue))
16 | 
17 |     t0 = time.time()
18 | 
19 |     # Generate an array of Normal Random Numbers on GPU of length n_sims*n_steps
20 |     n_steps = 100
21 |     rand_gen = clrand.PhiloxGenerator(ctx)
22 |     ran = rand_gen.normal(queue, (n_runs*n_steps), np.float32, mu=0, sigma=1)
23 | 
24 |     # Establish boundaries for each simulated walk (i.e. start and end)
25 |     # Necessary so that we perform scan only within rand walks and not between
26 |     seg_boundaries = [1] + [0]*(n_steps-1)
27 |     seg_boundaries = np.array(seg_boundaries, dtype=np.uint8)
28 |     seg_boundary_flags = np.tile(seg_boundaries, int(n_runs))
29 |     seg_boundary_flags = cl_array.to_device(queue, seg_boundary_flags)
30 | 
31 |     # GPU: Define Segmented Scan Kernel, scanning simulations: f(n-1) + f(n)
32 |     prefix_sum = GenericScanKernel(ctx, np.float32,
33 |                 arguments="__global float *ary, __global char *segflags, "
34 |                     "__global float *out",
35 |                 input_expr="ary[i]",
36 |                 scan_expr="across_seg_boundary ? b : (a+b)", neutral="0",
37 |                 is_segment_start_expr="segflags[i]",
38 |                 output_statement="out[i] = item + 100",
39 |                 options=[])
40 | 
41 |     # Allocate space for result of kernel on device
42 |     '''
43 |     Note: use a Memory Pool (commented out above and below) if you're invoking
44 |     multiple times to avoid wasting time creating brand new memory areas each
45 |     time you invoke the kernel: https://documen.tician.de/pyopencl/tools.html
46 |     '''
47 |     # dev_result = cl_array.arange(queue, len(ran), dtype=np.float32,
48 |     #                                allocator=mem_pool)
49 |     dev_result = cl_array.empty_like(ran)
50 | 
51 |     # Enqueue and Run Scan Kernel
52 |     prefix_sum(ran, seg_boundary_flags, dev_result)
53 | 
54 |     # Get results back on CPU to plot and do final calcs, just as in Lab 1
55 |     r_walks_all = (dev_result.get()
56 |                          .reshape(n_runs, n_steps)
57 |                          .transpose()
58 |                          )
59 | 
60 |     average_finish = np.mean(r_walks_all[-1])
61 |     std_finish = np.std(r_walks_all[-1])
62 |     final_time = time.time()
63 |     time_elapsed = final_time - t0
64 | 
65 |     print("Simulated %d Random Walks in: %f seconds"
66 |                 % (n_runs, time_elapsed))
67 |     print("Average final position: %f, Standard Deviation: %f"
68 |                 % (average_finish, std_finish))
69 | 
70 |     # Plot Random Walk Paths
71 |     '''
72 |     Note: Scan already only starts scanning at the second entry, but for the
73 |     sake of the plot, let's set all of our random walk starting positions to 100
74 |     and then plot the random walk paths.
75 |     '''
76 |     r_walks_all[0] = [100]*n_runs
77 |     plt.plot(r_walks_all)
78 |     plt.savefig("r_walk_nruns%d_gpu.png" % n_runs)
79 | 
80 |     return
81 | 
82 | def main():
83 |     sim_rand_walks(n_runs = 10000)
84 | 
85 | if __name__ == '__main__':
86 |     main()
87 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/print_gpu_info.py:
--------------------------------------------------------------------------------
 1 | import pyopencl as cl
 2 | 
 3 | def print_device_info() :
 4 |  print('\n' + '=' * 60 + '\nOpenCL Platforms and Devices')
 5 |  for platform in cl.get_platforms():
 6 |   print('=' * 60)
 7 |   print('Platform - Name: ' + platform.name)
 8 |   print('Platform - Vendor: ' + platform.vendor)
 9 |   print('Platform - Version: ' + platform.version)
10 |   print('Platform - Profile: ' + platform.profile)
11 |  for device in platform.get_devices():
12 |   print(' ' + '-' * 56)
13 |   print(' Device - Name: ' \
14 |     + device.name)
15 |   print(' Device - Type: ' \
16 |     + cl.device_type.to_string(device.type))
17 |   print(' Device - Max Clock Speed: {0} Mhz'\
18 |     .format(device.max_clock_frequency))
19 |   print(' Device - Compute Units: {0}'\
20 |     .format(device.max_compute_units))
21 |   print(' Device - Local Memory: {0:.0f} KB'\
22 |     .format(device.local_mem_size/1024.0))
23 |   print(' Device - Constant Memory: {0:.0f} KB'\
24 |     .format(device.max_constant_buffer_size/1024.0))
25 |   print(' Device - Global Memory: {0:.0f} GB'\
26 |     .format(device.global_mem_size/1073741824.0))
27 |   print(' Device - Max Buffer/Image Size: {0:.0f} MB'\
28 |     .format(device.max_mem_alloc_size/1048576.0))
29 |   print(' Device - Max Work Group Size: {0:.0f}'\
30 |     .format(device.max_work_group_size))
31 |   print('\n')
32 | 
33 | if __name__ == "__main__":
34 |   print_device_info()
35 | 


--------------------------------------------------------------------------------
/Labs/Lab 3 AWS EC2 and PyWren/isbn.txt:
--------------------------------------------------------------------------------
  1 | 0435910108
  2 | 1906523371
  3 | 0394721179
  4 | 0813190762
  5 | 1558615008
  6 | 190652324X
  7 | 1592211518
  8 | 0803211023
  9 | 0253211107
 10 | 0912469099
 11 | 0140100040
 12 | 1558615342
 13 | 0795701845
 14 | 0471380601
 15 | 0814781438
 16 | 1906523142
 17 | 0844212024
 18 | 186914001X
 19 | 1770091459
 20 | 1558614060
 21 | 0739105620
 22 | 0325070253
 23 | 1588264912
 24 | 1594606471
 25 | 1904456731
 26 | 0795701063
 27 | 0869809180
 28 | 1563976986
 29 | 1919931236
 30 | 0325002118
 31 | 0739105639
 32 | 0814781446
 33 | 0939691027
 34 | 044657922X
 35 | 0547241631
 36 | 1607886154
 37 | 0486285537
 38 | 0393930572
 39 | 161552164X
 40 | 0195092627
 41 | 0618982728
 42 | 0393929949
 43 | 1400500281
 44 | 0486296040
 45 | 006059537X
 46 | 0393979210
 47 | 1439181454
 48 | 0451528247
 49 | 0547241607
 50 | 0393326659
 51 | 0618155872
 52 | 0385000197
 53 | 0393927393
 54 | 0393979202
 55 | 0393977781
 56 | 0312463197
 57 | 0393061817
 58 | 0142437530
 59 | 1557837600
 60 | 0425234282
 61 | 1435114760
 62 | 1400030935
 63 | 0205655106
 64 | 0061626392
 65 | 1598530208
 66 | 0060540427
 67 | 0061728942
 68 | 1565841476
 69 | 0486457575
 70 | 1402221118
 71 | 0679745254
 72 | 0393930157
 73 | 0135145287
 74 | 0553262637
 75 | 0385044011
 76 | 0312421001
 77 | 0806645776
 78 | 0073384895
 79 | 0312452837
 80 | 0393927415
 81 | 1598530593
 82 | 014023778X
 83 | 0806627158
 84 | 0205763103
 85 | 054720180X
 86 | 0140170367
 87 | 0321484894
 88 | 0486272842
 89 | 0321012690
 90 | 0395765285
 91 | 0820334316
 92 | 0132216477
 93 | 048627165X
 94 | 0802150322
 95 | 0393930149
 96 | 1598530216
 97 | 1616836865
 98 | 1616823763
 99 | 0743299779
100 | 039333645X
101 | 0689808690
102 | 0451529154
103 | 0060882506
104 | 0142001945
105 | 0060927631
106 | 0641974124
107 | 0393927423
108 | 0312474911
109 | 1558850031
110 | 0521540283
111 | 0822220555
112 | 0077239040
113 | 0307268349
114 | 1598530690
115 | 0609808400
116 | 1560255501
117 | 1934633941
118 | 0618197338
119 | 044990508X
120 | 1571312692
121 | 1598530410
122 | 193108209X
123 | 0896086267
124 | 142360699X
125 | 0806639628
126 | 0375761276
127 | 023108028X
128 | 1598530313
129 | 0061174076
130 | 0321113411
131 | 0205779395
132 | 054720194X
133 | 0375415041
134 | 0679767991
135 | 0684823071
136 | 0451529634
137 | 081297509X
138 | 188184708X
139 | 0618197354
140 | 0814775721
141 | 0451527828
142 | 1400077184
143 | 155783749X
144 | 039331698X
145 | 0813538629
146 | 0486294501
147 | 0618751181
148 | 1440151512
149 | 0822220768
150 | 0896086992
151 | 1581825234
152 | 1598530518
153 | 0618379029
154 | 0618427058
155 | 1590170385
156 | 0077239059
157 | 0195144562
158 | 1569472181
159 | 0312412088
160 | 0393318737
161 | 1893996387
162 | 0312115520
163 | 0618983228
164 | 0743299752
165 | 1598530429
166 | 0806628448
167 | 0393040011
168 | 1615580190
169 | 0811844943
170 | 0826217796
171 | 0321116240
172 | 1558490418
173 | 0963290622
174 | 1573833320
175 | 0060924209
176 | 019516251X
177 | 014310506X
178 | 0345395026
179 | 1595580557
180 | 0231109091
181 | 0385423098
182 | 0440378648
183 | 1557837481
184 | 0805211144
185 | 0684868695
186 | 0688171613
187 | 1596911484
188 | 037575881X
189 | 0807131512
190 | 032101149X
191 | 0618256636
192 | 0140390871
193 | 0767918460
194 | 0393930130
195 | 0375755020
196 | 1428262504
197 | 0131937928
198 | 0679736336
199 | 0813545757
200 | 0451528735
201 | 0674027639
202 | 0385491182
203 | 0393318281
204 | 0393927431
205 | 0321088999
206 | 1573241385
207 | 1558853995
208 | 0963290614
209 | 0813533309
210 | 032101006X
211 | 0981519423
212 | 0395926882
213 | 0231112599
214 | 0618902813
215 | 1575257629
216 | 0375703004
217 | 0385422431
218 | 1575255898
219 | 1604330376
220 | 0553375180
221 | 0965834441
222 | 0375414584
223 | 0807028231
224 | 0874177596
225 | 0822220423
226 | 0451527445
227 | 1557836965
228 | 0321093836
229 | 1555837476
230 | 0195122712
231 | 0786869186
232 | 0877970602
233 | 0618532986
234 | 0791419045
235 | 1572331321
236 | 1555839975
237 | 1575252716
238 | 0814727476
239 | 0415924448
240 | 0374525218
241 | 1405188626
242 | 1604732148
243 | 1880000768
244 | 0313308608
245 | 0804111677
246 | 0345383176
247 | 0345382226
248 | 1931082561
249 | 0140390146
250 | 0820321044
251 | 0918949254
252 | 0393310906
253 | 1400033217
254 | 159853033X
255 | 1589801768
256 | 0321116232
257 | 0631204490
258 | 0813531640
259 | 0816513848
260 | 0786436840
261 | 0312596227
262 | 0072564024
263 | 0812979923
264 | 0231081227
265 | 1575256185
266 | 0253206375
267 | 1575255596
268 | 0295969741
269 | 0881508594
270 | 0140273050
271 | 1570758468
272 | 0060874112
273 | 156584744X
274 | 0811807843
275 | 019955465X
276 | 0982338201
277 | 1934633194
278 | 0940450674
279 | 0813529301
280 | 0673469565
281 | 0140435875
282 | 1590304551
283 | 0375413324
284 | 1594482829
285 | 157525591X
286 | 1416537465
287 | 0060923520
288 | 0140296395
289 | 1579127738
290 | 097662964X
291 | 1887366768
292 | 0743299736
293 | 0743257596
294 | 0814757219
295 | 0743243501
296 | 0520222121
297 | 1598530054
298 | 1597140015
299 | 156368019X
300 | 1887178988
301 | 1558853804
302 | 0816527938
303 | 0691133247
304 | 1931082332
305 | 0195093607
306 | 091478370X
307 | 080712320X
308 | 0688175864
309 | 0393316858
310 | 1439181470
311 | 0395986052
312 | 0078454816
313 | 0195122135
314 | 0618131736
315 | 0914783599
316 | 0230620655
317 | 0809329549
318 | 0486410986
319 | 1551117282
320 | 094045078X
321 | 1563681277
322 | 0312191766
323 | 0874519187
324 | 0385003587
325 | 061890283X
326 | 0809015641
327 | 0470237694
328 | 0743476972
329 | 1559362758
330 | 159853047X
331 | 0802132790
332 | 1929642377
333 | 0195125630
334 | 0786713879
335 | 1593761945
336 | 0820332259
337 | 1598530151
338 | 1416906355
339 | 1560373792
340 | 0945353820
341 | 159714049X
342 | 0226322742
343 | 0195395921
344 | 962634248X
345 | 0816524939
346 | 0595362974
347 | 0813538866
348 | 0870498746
349 | 1931082278
350 | 0976876027
351 | 1400076757
352 | 0879054700
353 | 1555536778
354 | 1880399989
355 | 1933859865
356 | 0631206523
357 | 0743299760
358 | 0813529662
359 | 0920661599
360 | 0743203887
361 | 0967023432
362 | 0761131655
363 | 1557837473
364 | 0816645612
365 | 0395980755
366 | 0932379192
367 | 0072400196
368 | 0979485223
369 | 1879960583
370 | 0813527260
371 | 1931010471
372 | 0813520185
373 | 0073221538
374 | 0618570489
375 | 0873514289
376 | 0673469778
377 | 0446676918
378 | 0879103663
379 | 0061175358
380 | 0130266876
381 | 039304677X
382 | 157525249X
383 | 089672638X
384 | 0813531624
385 | 1403979995
386 | 0978625110
387 | 0872863131
388 | 0815320809
389 | 1590212282
390 | 0230606326
391 | 0873384660
392 | 1559363320
393 | 0979485215
394 | 0395875099
395 | 0321198433
396 | 0573695393
397 | 0842028293
398 | 0936839228
399 | 1573441880
400 | 1558853316
401 | 1932663177
402 | 0393316718
403 | 023105419X
404 | 0940450666
405 | 061853301X
406 | 0060749539
407 | 029598824X
408 | 0674030915
409 | 1555533019
410 | 158648009X
411 | 0807136360
412 | 0070113963
413 | 0415234689
414 | 0252064100
415 | 0874807905
416 | 157525297X
417 | 1575254247
418 | 0814776558
419 | 0684802406
420 | 080213694X
421 | 1598530488
422 | 0873515420
423 | 1888160381
424 | 1555838278
425 | 0813122791
426 | 1932183701
427 | 1575254468
428 | 0671047566
429 | 1598875787
430 | 0910043078
431 | 1597140376
432 | 189251401X
433 | 0195020588
434 | 0806513497
435 | 0875652530
436 | 0809327635
437 | 0879058323
438 | 1567920780
439 | 0813007267
440 | 1587298716
441 | 0820327751
442 | 1583220224
443 | 0813190665
444 | 155936291X
445 | 188736675X
446 | 0873514378
447 | 1931082766
448 | 0156135396
449 | 1883011760
450 | 089886710X
451 | 0743257588
452 | 0872865010
453 | 0872865282
454 | 0415936829
455 | 0201846705
456 | 1888160438
457 | 082632889X
458 | 0826348181
459 | 0875653421
460 | 1557837120
461 | 1558853464
462 | 0271027215
463 | 0452011744
464 | 0870498762
465 | 0804011079
466 | 0674006038
467 | 0618249338
468 | 0806133678
469 | 0140230262
470 | 0673990176
471 | 1558491112
472 | 0573697558
473 | 1931082499
474 | 0945575939
475 | 1565849868
476 | 1580051588
477 | 0268041342
478 | 0881461830
479 | 0253214920
480 | 0060953225
481 | 1575253860
482 | 1585360953
483 | 0316347108
484 | 1566890012
485 | 1892514117
486 | 1566891418
487 | 0944350615
488 | 1931082529
489 | 0199273154
490 | 0814321429
491 | 0295977469
492 | 0822955601
493 | 0878054790
494 | 0978848934
495 | 0520216849
496 | 0275935671
497 | 0877457956
498 | 159179188X
499 | 1573240370
500 | 1560253789
501 | 


--------------------------------------------------------------------------------
/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png


--------------------------------------------------------------------------------
/Labs/Lab 4 Storing and Structuring Large Data/5W_large_scale_databases.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Implementing Large-Scale Database Solutions\n",
  8 |     "## DynamoDB\n",
  9 |     "First, let's create a DynamoDB table. Let's say that we're collecting and storing streaming Twitter data in our database. We'll use Twitter 'username' as our primary key here, since this will be unique to each user and will make for a good input for DynamoDB's hash function (you can also specify a sort key if you would like, though). We'll also set our Read and Write Capacity down to the minimum for this demo, but you can [scale this up](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html) if you need more throughput for your application."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {},
 16 |    "outputs": [
 17 |     {
 18 |      "name": "stdout",
 19 |      "output_type": "stream",
 20 |      "text": [
 21 |       "0\n",
 22 |       "2021-04-24 12:55:12.129000-05:00\n"
 23 |      ]
 24 |     }
 25 |    ],
 26 |    "source": [
 27 |     "import boto3\n",
 28 |     "\n",
 29 |     "dynamodb = boto3.resource('dynamodb')\n",
 30 |     "\n",
 31 |     "table = dynamodb.create_table(\n",
 32 |     "    TableName='twitter',\n",
 33 |     "    KeySchema=[\n",
 34 |     "        {\n",
 35 |     "            'AttributeName': 'username',\n",
 36 |     "            'KeyType': 'HASH'\n",
 37 |     "        }\n",
 38 |     "    ],\n",
 39 |     "    AttributeDefinitions=[\n",
 40 |     "        {\n",
 41 |     "            'AttributeName': 'username',\n",
 42 |     "            'AttributeType': 'S'\n",
 43 |     "        }\n",
 44 |     "    ],\n",
 45 |     "    ProvisionedThroughput={\n",
 46 |     "        'ReadCapacityUnits': 1,\n",
 47 |     "        'WriteCapacityUnits': 1\n",
 48 |     "    }    \n",
 49 |     ")\n",
 50 |     "\n",
 51 |     "# Wait until AWS confirms that table exists before moving on\n",
 52 |     "table.meta.client.get_waiter('table_exists').wait(TableName='twitter')\n",
 53 |     "\n",
 54 |     "# get data about table (should currently be no items in table)\n",
 55 |     "print(table.item_count)\n",
 56 |     "print(table.creation_date_time)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "OK, so we currently have an empty DynamoDB table. Let's actually put some items into our table:"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 2,
 69 |    "metadata": {},
 70 |    "outputs": [
 71 |     {
 72 |      "data": {
 73 |       "text/plain": [
 74 |        "{'ResponseMetadata': {'RequestId': '0BDCGRSJQE4725IML53BV0RJAJVV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
 75 |        "  'HTTPStatusCode': 200,\n",
 76 |        "  'HTTPHeaders': {'server': 'Server',\n",
 77 |        "   'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n",
 78 |        "   'content-type': 'application/x-amz-json-1.0',\n",
 79 |        "   'content-length': '2',\n",
 80 |        "   'connection': 'keep-alive',\n",
 81 |        "   'x-amzn-requestid': '0BDCGRSJQE4725IML53BV0RJAJVV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
 82 |        "   'x-amz-crc32': '2745614147'},\n",
 83 |        "  'RetryAttempts': 0}}"
 84 |       ]
 85 |      },
 86 |      "execution_count": 2,
 87 |      "metadata": {},
 88 |      "output_type": "execute_result"
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "table.put_item(\n",
 93 |     "   Item={\n",
 94 |     "        'username': 'macs30123',\n",
 95 |     "        'num_followers': 100,\n",
 96 |     "        'num_tweets': 5\n",
 97 |     "    }\n",
 98 |     ")\n",
 99 |     "\n",
100 |     "table.put_item(\n",
101 |     "   Item={\n",
102 |     "        'username': 'jon_c',\n",
103 |     "        'num_followers': 10,\n",
104 |     "        'num_tweets': 0\n",
105 |     "    }\n",
106 |     ")"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "We can then easily get items from our table using the `get_item` method and providing our key:"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 3,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "name": "stdout",
123 |      "output_type": "stream",
124 |      "text": [
125 |       "{'num_tweets': Decimal('5'), 'num_followers': Decimal('100'), 'username': 'macs30123'}\n"
126 |      ]
127 |     }
128 |    ],
129 |    "source": [
130 |     "response = table.get_item(\n",
131 |     "    Key={\n",
132 |     "        'username': 'macs30123'\n",
133 |     "    }\n",
134 |     ")\n",
135 |     "item = response['Item']\n",
136 |     "print(item)"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "We can also update existing items using the `update_item` method:"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": 4,
149 |    "metadata": {},
150 |    "outputs": [
151 |     {
152 |      "data": {
153 |       "text/plain": [
154 |        "{'ResponseMetadata': {'RequestId': '8632M7FSUI78RENPC4O73IIN37VV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
155 |        "  'HTTPStatusCode': 200,\n",
156 |        "  'HTTPHeaders': {'server': 'Server',\n",
157 |        "   'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n",
158 |        "   'content-type': 'application/x-amz-json-1.0',\n",
159 |        "   'content-length': '2',\n",
160 |        "   'connection': 'keep-alive',\n",
161 |        "   'x-amzn-requestid': '8632M7FSUI78RENPC4O73IIN37VV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
162 |        "   'x-amz-crc32': '2745614147'},\n",
163 |        "  'RetryAttempts': 0}}"
164 |       ]
165 |      },
166 |      "execution_count": 4,
167 |      "metadata": {},
168 |      "output_type": "execute_result"
169 |     }
170 |    ],
171 |    "source": [
172 |     "table.update_item(\n",
173 |     "    Key={\n",
174 |     "        'username': 'macs30123'\n",
175 |     "    },\n",
176 |     "    UpdateExpression='SET num_tweets = :val1',\n",
177 |     "    ExpressionAttributeValues={\n",
178 |     "        ':val1': 6\n",
179 |     "    }\n",
180 |     ")"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "Then, if we take a look again at this item, we'll see that it's been updated (note, though, that DynamoDB tables are [*eventually consistent* unless we specify otherwise](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html), so this might not always return the expected result immediately):"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 5,
193 |    "metadata": {},
194 |    "outputs": [
195 |     {
196 |      "name": "stdout",
197 |      "output_type": "stream",
198 |      "text": [
199 |       "{'num_followers': Decimal('100'), 'num_tweets': Decimal('6'), 'username': 'macs30123'}\n"
200 |      ]
201 |     }
202 |    ],
203 |    "source": [
204 |     "response = table.get_item(\n",
205 |     "    Key={\n",
206 |     "        'username': 'macs30123'\n",
207 |     "    }\n",
208 |     ")\n",
209 |     "item = response['Item']\n",
210 |     "print(item)"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "Note as well, that even though it is not optimal to perform complicated queries in DynamoDB tables, we can write and run SQL-like queries to run again our DynamoDB tables if we want to:"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 6,
223 |    "metadata": {},
224 |    "outputs": [
225 |     {
226 |      "name": "stdout",
227 |      "output_type": "stream",
228 |      "text": [
229 |       "[{'num_followers': Decimal('100'), 'num_tweets': Decimal('6'), 'username': 'macs30123'}]\n"
230 |      ]
231 |     }
232 |    ],
233 |    "source": [
234 |     "response = table.meta.client.execute_statement(\n",
235 |     "    Statement='''\n",
236 |     "              SELECT *\n",
237 |     "              FROM twitter\n",
238 |     "              WHERE num_followers > 20\n",
239 |     "              '''\n",
240 |     ")\n",
241 |     "item = response['Items']\n",
242 |     "print(item)"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "metadata": {},
248 |    "source": [
249 |     "Finally, you should make sure to delete your table (if you no longer plan to use it), so that you do not incur further charges while it is running:"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": 7,
255 |    "metadata": {},
256 |    "outputs": [
257 |     {
258 |      "data": {
259 |       "text/plain": [
260 |        "{'TableDescription': {'TableName': 'twitter',\n",
261 |        "  'TableStatus': 'DELETING',\n",
262 |        "  'ProvisionedThroughput': {'NumberOfDecreasesToday': 0,\n",
263 |        "   'ReadCapacityUnits': 1,\n",
264 |        "   'WriteCapacityUnits': 1},\n",
265 |        "  'TableSizeBytes': 0,\n",
266 |        "  'ItemCount': 0,\n",
267 |        "  'TableArn': 'arn:aws:dynamodb:us-east-1:009068789081:table/twitter',\n",
268 |        "  'TableId': '7bad4d6e-c716-4cf6-a3fa-5fd0738f3258'},\n",
269 |        " 'ResponseMetadata': {'RequestId': 'S6NO2AI592GHJMIH01Q6IO594BVV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
270 |        "  'HTTPStatusCode': 200,\n",
271 |        "  'HTTPHeaders': {'server': 'Server',\n",
272 |        "   'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n",
273 |        "   'content-type': 'application/x-amz-json-1.0',\n",
274 |        "   'content-length': '316',\n",
275 |        "   'connection': 'keep-alive',\n",
276 |        "   'x-amzn-requestid': 'S6NO2AI592GHJMIH01Q6IO594BVV4KQNSO5AEMVJF66Q9ASUAAJG',\n",
277 |        "   'x-amz-crc32': '2358790140'},\n",
278 |        "  'RetryAttempts': 0}}"
279 |       ]
280 |      },
281 |      "execution_count": 7,
282 |      "metadata": {},
283 |      "output_type": "execute_result"
284 |     }
285 |    ],
286 |    "source": [
287 |     "table.delete()"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "## RDS\n",
295 |     "\n",
296 |     "We can also create and interact with scalable cloud relational databases via `boto3`. Let's launch a MySQL database via AWS's RDS service. Note that we can explicitly scale up the hardware (e.g. instance class, and allocated storage) for our database via the `create_db_instance` parameters. We can also add additional read replicas of our database instance that we launch via [the `create_db_instance_read_replica` method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.create_db_instance_read_replica) or create a cluster of a certain size from the start using [the `create_db_cluster` method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.create_db_cluster)."
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": 8,
302 |    "metadata": {},
303 |    "outputs": [
304 |     {
305 |      "name": "stdout",
306 |      "output_type": "stream",
307 |      "text": [
308 |       "relational-db is available at relational-db.cr2k4l3gqki8.us-east-1.rds.amazonaws.com on Port 3306\n"
309 |      ]
310 |     }
311 |    ],
312 |    "source": [
313 |     "rds = boto3.client('rds')\n",
314 |     "\n",
315 |     "response = rds.create_db_instance(\n",
316 |     "    DBInstanceIdentifier='relational-db',\n",
317 |     "    DBName='twitter',\n",
318 |     "    MasterUsername='username',\n",
319 |     "    MasterUserPassword='password',\n",
320 |     "    DBInstanceClass='db.t2.micro',\n",
321 |     "    Engine='MySQL',\n",
322 |     "    AllocatedStorage=5\n",
323 |     ")\n",
324 |     "\n",
325 |     "# Wait until DB is available to continue\n",
326 |     "rds.get_waiter('db_instance_available').wait(DBInstanceIdentifier='relational-db')\n",
327 |     "\n",
328 |     "# Describe where DB is available and on what port\n",
329 |     "db = rds.describe_db_instances()['DBInstances'][0]\n",
330 |     "ENDPOINT = db['Endpoint']['Address']\n",
331 |     "PORT = db['Endpoint']['Port']\n",
332 |     "DBID = db['DBInstanceIdentifier']\n",
333 |     "\n",
334 |     "print(DBID,\n",
335 |     "      \"is available at\", ENDPOINT,\n",
336 |     "      \"on Port\", PORT,\n",
337 |     "     )   "
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "In order to access our MySQL database, we'll need to adjust some security settings associated with our server, though. By default, we're not able to access port 3306 on our database server over the internet and we will need to change this setting in order to connect to our database from our local machine. In practice, you should limit the allowed IP range as much as possible (to your home or office, for example) to avoid intruders from connecting to your databases. For the purposes of this demo, though, I am going to make it possible to connect to my database from anywhere on the internet (IP range 0.0.0.0/0):"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": 9,
350 |    "metadata": {},
351 |    "outputs": [
352 |     {
353 |      "name": "stdout",
354 |      "output_type": "stream",
355 |      "text": [
356 |       "Permissions already adjusted.\n"
357 |      ]
358 |     }
359 |    ],
360 |    "source": [
361 |     "# Get Name of Security Group\n",
362 |     "SGNAME = db['VpcSecurityGroups'][0]['VpcSecurityGroupId']\n",
363 |     "\n",
364 |     "# Adjust Permissions for that security group so that we can access it on Port 3306\n",
365 |     "# If already SG is already adjusted, print this out\n",
366 |     "try:\n",
367 |     "    ec2 = boto3.client('ec2')\n",
368 |     "    data = ec2.authorize_security_group_ingress(\n",
369 |     "            GroupId=SGNAME,\n",
370 |     "            IpPermissions=[\n",
371 |     "                {'IpProtocol': 'tcp',\n",
372 |     "                 'FromPort': PORT,\n",
373 |     "                 'ToPort': PORT,\n",
374 |     "                 'IpRanges': [{'CidrIp': '0.0.0.0/0'}]}\n",
375 |     "            ]\n",
376 |     "    )\n",
377 |     "except ec2.exceptions.ClientError as e:\n",
378 |     "    if e.response[\"Error\"][\"Code\"] == 'InvalidPermission.Duplicate':\n",
379 |     "        print(\"Permissions already adjusted.\")\n",
380 |     "    else:\n",
381 |     "        print(e)"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {},
387 |    "source": [
388 |     "Alright, we're ready to connect to our database! This is a MySQL database, so let's install a Python package that will allow us to effectively handle this connection:"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "code",
393 |    "execution_count": 10,
394 |    "metadata": {},
395 |    "outputs": [
396 |     {
397 |      "name": "stdout",
398 |      "output_type": "stream",
399 |      "text": [
400 |       "Requirement already satisfied: mysql-connector-python in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (8.0.24)\r\n",
401 |       "Requirement already satisfied: protobuf>=3.0.0 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (from mysql-connector-python) (3.14.0)\r\n",
402 |       "Requirement already satisfied: six>=1.9 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (from protobuf>=3.0.0->mysql-connector-python) (1.15.0)\r\n"
403 |      ]
404 |     }
405 |    ],
406 |    "source": [
407 |     "! pip install mysql-connector-python # Install mysql-connector if you haven't already"
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "markdown",
412 |    "metadata": {},
413 |    "source": [
414 |     "Then, we can just connect to the database and run queries in the same way that you have seen while working with SQLite databases in CS 122 (using the SQLite3 package). Very cool!"
415 |    ]
416 |   },
417 |   {
418 |    "cell_type": "code",
419 |    "execution_count": 11,
420 |    "metadata": {},
421 |    "outputs": [],
422 |    "source": [
423 |     "import mysql.connector\n",
424 |     "conn =  mysql.connector.connect(host=ENDPOINT, user=\"username\", passwd=\"password\", port=PORT, database='twitter')\n",
425 |     "cur = conn.cursor()"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": 12,
431 |    "metadata": {},
432 |    "outputs": [],
433 |    "source": [
434 |     "create_table = '''\n",
435 |     "               CREATE TABLE IF NOT EXISTS users (\n",
436 |     "                   username VARCHAR(10),\n",
437 |     "                   num_followers INT,\n",
438 |     "                   num_tweets INT,\n",
439 |     "                   PRIMARY KEY (username)\n",
440 |     "               )\n",
441 |     "               '''\n",
442 |     "insert_data  = '''\n",
443 |     "               INSERT INTO users (username, num_followers, num_tweets)\n",
444 |     "               VALUES \n",
445 |     "                   ('macs30123', 100, 5),\n",
446 |     "                   ('jon_c', 10, 0)\n",
447 |     "               '''\n",
448 |     "\n",
449 |     "for op in [create_table, insert_data]:\n",
450 |     "    cur.execute(op)"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {},
456 |    "source": [
457 |     "Our relational database is optimized for performing small, fast queries like these and will tend to out-perform our DynamoDB table at these kinds of operations:"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "code",
462 |    "execution_count": 13,
463 |    "metadata": {},
464 |    "outputs": [
465 |     {
466 |      "name": "stdout",
467 |      "output_type": "stream",
468 |      "text": [
469 |       "[('jon_c', 10, 0), ('macs30123', 100, 5)]\n"
470 |      ]
471 |     }
472 |    ],
473 |    "source": [
474 |     "cur.execute('''SELECT * FROM users''')\n",
475 |     "query_results = cur.fetchall()\n",
476 |     "print(query_results)"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "code",
481 |    "execution_count": 14,
482 |    "metadata": {},
483 |    "outputs": [
484 |     {
485 |      "name": "stdout",
486 |      "output_type": "stream",
487 |      "text": [
488 |       "[('macs30123',)]\n"
489 |      ]
490 |     }
491 |    ],
492 |    "source": [
493 |     "cur.execute('''SELECT username FROM users WHERE num_followers > 20''')\n",
494 |     "query_results = cur.fetchall()\n",
495 |     "print(query_results)"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "markdown",
500 |    "metadata": {},
501 |    "source": [
502 |     "Once we're done executing SQL queries on our MySQL database, we can close our connection to the database and delete the database on AWS so that we're no longer charged for it:"
503 |    ]
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "execution_count": 15,
508 |    "metadata": {},
509 |    "outputs": [
510 |     {
511 |      "name": "stdout",
512 |      "output_type": "stream",
513 |      "text": [
514 |       "deleting\n",
515 |       "RDS Database has been deleted\n"
516 |      ]
517 |     }
518 |    ],
519 |    "source": [
520 |     "conn.close()\n",
521 |     "response = rds.delete_db_instance(DBInstanceIdentifier='relational-db',\n",
522 |     "                       SkipFinalSnapshot=True\n",
523 |     "                      )\n",
524 |     "print(response['DBInstance']['DBInstanceStatus'])\n",
525 |     "\n",
526 |     "# wait until DB is deleted before proceeding\n",
527 |     "rds.get_waiter('db_instance_deleted').wait(DBInstanceIdentifier='relational-db')\n",
528 |     "print(\"RDS Database has been deleted\")"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "markdown",
533 |    "metadata": {},
534 |    "source": [
535 |     "# Data Warehousing with Redshift\n",
536 |     "\n",
537 |     "When you need to run especially big queries against large datasets, it can make sense to perform these in a Data Warehouse like AWS Redshift. Recall that Redshift clusters organize our data in columnar storage (instead of rows, like a standard relational database) and can efficiently perform operations on these columns in parallel.\n",
538 |     "\n",
539 |     "Let's spin up a Redshift cluster to see how this works (for our small Twitter demonstration data). Notice that we do need to provide the particular type of hardware that we want each one of our nodes to be, as well as the number of nodes that we want to include in our cluster (we can increase this for greater parallelism and storage capacity). For this demo, let's just select a two of one of the smaller nodes."
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "execution_count": 16,
545 |    "metadata": {},
546 |    "outputs": [
547 |     {
548 |      "name": "stdout",
549 |      "output_type": "stream",
550 |      "text": [
551 |       "mycluster is available at mycluster.cvbbnglvrhqs.us-east-1.redshift.amazonaws.com on Port 5439\n"
552 |      ]
553 |     }
554 |    ],
555 |    "source": [
556 |     "redshift = boto3.client('redshift')\n",
557 |     "\n",
558 |     "response = redshift.create_cluster(\n",
559 |     "    ClusterIdentifier='myCluster',\n",
560 |     "    DBName='twitter',\n",
561 |     "    NodeType='dc1.large',\n",
562 |     "    NumberOfNodes=2,\n",
563 |     "    MasterUsername='username',\n",
564 |     "    MasterUserPassword='Password123'\n",
565 |     ")\n",
566 |     "\n",
567 |     "# Wait until cluster is available before proceeding\n",
568 |     "redshift.get_waiter('cluster_available').wait(ClusterIdentifier='myCluster')\n",
569 |     "\n",
570 |     "# Describe where cluster is available and on what port\n",
571 |     "cluster = redshift.describe_clusters(ClusterIdentifier='myCluster')['Clusters'][0]\n",
572 |     "ENDPOINT = cluster['Endpoint']['Address']\n",
573 |     "PORT = cluster['Endpoint']['Port']\n",
574 |     "CLUSTERID = cluster['ClusterIdentifier']\n",
575 |     "\n",
576 |     "print(CLUSTERID,\n",
577 |     "      \"is available at\", ENDPOINT,\n",
578 |     "      \"on Port\", PORT,\n",
579 |     "     )"
580 |    ]
581 |   },
582 |   {
583 |    "cell_type": "markdown",
584 |    "metadata": {},
585 |    "source": [
586 |     "Again, we'll need to make sure that we can connect with our cluster from our local machine. For the purposes of this demo, we'll open the port up to the Internet (although, again, you should only allow a narrow IP range in your own applications)."
587 |    ]
588 |   },
589 |   {
590 |    "cell_type": "code",
591 |    "execution_count": 17,
592 |    "metadata": {},
593 |    "outputs": [
594 |     {
595 |      "name": "stdout",
596 |      "output_type": "stream",
597 |      "text": [
598 |       "Permissions already adjusted.\n"
599 |      ]
600 |     }
601 |    ],
602 |    "source": [
603 |     "# Get Name of Security Group\n",
604 |     "SGNAME = cluster['VpcSecurityGroups'][0]['VpcSecurityGroupId']\n",
605 |     "\n",
606 |     "# Adjust Permissions for that security group so that we can access it on Port 3306\n",
607 |     "# If already SG is already adjusted, print this out\n",
608 |     "try:\n",
609 |     "    ec2 = boto3.client('ec2')\n",
610 |     "    data = ec2.authorize_security_group_ingress(\n",
611 |     "            GroupId=SGNAME,\n",
612 |     "            IpPermissions=[\n",
613 |     "                {'IpProtocol': 'tcp',\n",
614 |     "                 'FromPort': PORT,\n",
615 |     "                 'ToPort': PORT,\n",
616 |     "                 'IpRanges': [{'CidrIp': '0.0.0.0/0'}]}\n",
617 |     "            ]\n",
618 |     "    )\n",
619 |     "except ec2.exceptions.ClientError as e:\n",
620 |     "    if e.response[\"Error\"][\"Code\"] == 'InvalidPermission.Duplicate':\n",
621 |     "        print(\"Permissions already adjusted.\")\n",
622 |     "    else:\n",
623 |     "        print(e)"
624 |    ]
625 |   },
626 |   {
627 |    "cell_type": "markdown",
628 |    "metadata": {},
629 |    "source": [
630 |     "Redshift was originally forked from PostgreSQL, so the best way to connect with it is via a PostgreSQL Python adaptor (rather than the MySQL adaptor we used previously). We'll use `psycopg2` here."
631 |    ]
632 |   },
633 |   {
634 |    "cell_type": "code",
635 |    "execution_count": 18,
636 |    "metadata": {},
637 |    "outputs": [
638 |     {
639 |      "name": "stdout",
640 |      "output_type": "stream",
641 |      "text": [
642 |       "Requirement already satisfied: psycopg2 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (2.7.7)\r\n"
643 |      ]
644 |     }
645 |    ],
646 |    "source": [
647 |     "! pip install psycopg2"
648 |    ]
649 |   },
650 |   {
651 |    "cell_type": "markdown",
652 |    "metadata": {},
653 |    "source": [
654 |     "Note that once we import the package and connect, we can use the same workflow that we used for our MySQL database (and our local SQLite databases in CS 122) to execute SQL queries:"
655 |    ]
656 |   },
657 |   {
658 |    "cell_type": "code",
659 |    "execution_count": 19,
660 |    "metadata": {},
661 |    "outputs": [
662 |     {
663 |      "name": "stderr",
664 |      "output_type": "stream",
665 |      "text": [
666 |       "/home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use \"pip install psycopg2-binary\" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.\n",
667 |       "  \"\"\")\n"
668 |      ]
669 |     }
670 |    ],
671 |    "source": [
672 |     "import psycopg2\n",
673 |     "conn = psycopg2.connect(dbname='twitter', host=ENDPOINT, user=\"username\", password=\"Password123\", port=PORT)\n",
674 |     "cur = conn.cursor()\n",
675 |     "\n",
676 |     "for op in [create_table, insert_data]:\n",
677 |     "    cur.execute(op)"
678 |    ]
679 |   },
680 |   {
681 |    "cell_type": "code",
682 |    "execution_count": 20,
683 |    "metadata": {},
684 |    "outputs": [
685 |     {
686 |      "name": "stdout",
687 |      "output_type": "stream",
688 |      "text": [
689 |       "[('macs30123', 100, 5), ('jon_c', 10, 0)]\n"
690 |      ]
691 |     }
692 |    ],
693 |    "source": [
694 |     "cur.execute('''SELECT * FROM users''')\n",
695 |     "query_results = cur.fetchall()\n",
696 |     "print(query_results)"
697 |    ]
698 |   },
699 |   {
700 |    "cell_type": "code",
701 |    "execution_count": 21,
702 |    "metadata": {},
703 |    "outputs": [
704 |     {
705 |      "name": "stdout",
706 |      "output_type": "stream",
707 |      "text": [
708 |       "[('macs30123',)]\n"
709 |      ]
710 |     }
711 |    ],
712 |    "source": [
713 |     "cur.execute('''SELECT username FROM users WHERE num_followers > 20''')\n",
714 |     "query_results = cur.fetchall()\n",
715 |     "print(query_results)"
716 |    ]
717 |   },
718 |   {
719 |    "cell_type": "markdown",
720 |    "metadata": {},
721 |    "source": [
722 |     "Then, once we're done, we can close our connection and delete our Redshift cluster in the same way as our RDS instance:"
723 |    ]
724 |   },
725 |   {
726 |    "cell_type": "code",
727 |    "execution_count": 22,
728 |    "metadata": {},
729 |    "outputs": [
730 |     {
731 |      "name": "stdout",
732 |      "output_type": "stream",
733 |      "text": [
734 |       "deleting\n",
735 |       "Redshift Cluster has been deleted\n"
736 |      ]
737 |     }
738 |    ],
739 |    "source": [
740 |     "conn.close()\n",
741 |     "response = redshift.delete_cluster(ClusterIdentifier='myCluster',\n",
742 |     "                       SkipFinalClusterSnapshot=True\n",
743 |     "                      )\n",
744 |     "print(response['Cluster']['ClusterStatus'])\n",
745 |     "\n",
746 |     "redshift.get_waiter('cluster_deleted').wait(ClusterIdentifier='myCluster')\n",
747 |     "print(\"Redshift Cluster has been deleted\")"
748 |    ]
749 |   }
750 |  ],
751 |  "metadata": {
752 |   "@webio": {
753 |    "lastCommId": null,
754 |    "lastKernelId": null
755 |   },
756 |   "kernelspec": {
757 |    "display_name": "Python 3",
758 |    "language": "python",
759 |    "name": "python3"
760 |   },
761 |   "language_info": {
762 |    "codemirror_mode": {
763 |     "name": "ipython",
764 |     "version": 3
765 |    },
766 |    "file_extension": ".py",
767 |    "mimetype": "text/x-python",
768 |    "name": "python",
769 |    "nbconvert_exporter": "python",
770 |    "pygments_lexer": "ipython3",
771 |    "version": "3.7.10"
772 |   }
773 |  },
774 |  "nbformat": 4,
775 |  "nbformat_minor": 4
776 | }
777 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/.mrjob.conf:
--------------------------------------------------------------------------------
 1 | runners:
 2 |   emr:
 3 |     # Specify a pem key to start up an EMR cluster on your behalf
 4 |     ec2_key_pair: MACS_30123
 5 |     ec2_key_pair_file: ~/MACS_30123.pem
 6 | 
 7 |     # Specify type/# of EC2 instances you want your code to run on
 8 |     core_instance_type: m5.xlarge
 9 |     num_core_instances: 3
10 |     region: us-east-1
11 | 
12 |     # if cluster idles longer than 60 minutes, terminate the cluster
13 |     max_mins_idle: 60.0
14 | 
15 |     # to read from/write to S3; note colons instead of "=":
16 |     aws_access_key_id: <your key ID>
17 |     aws_secret_access_key: <your secret>
18 |     aws_session_token: <your session token>
19 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mapreduce_lab5.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Lab 5 (Part I): Batch Processing Data with MapReduce
 3 | 
 4 | What's the most used word in 5-star customer reviews on Amazon?
 5 | 
 6 | We can answer this question using the mrjob package to investigate customer
 7 | reviews available as a part of Amazon's public S3 customer reviews dataset.
 8 | 
 9 | For this demo, we'll use a small sample of this 100m+ review dataset that
10 | Amazon provides (s3://amazon-reviews-pds/tsv/sample_us.tsv).
11 | 
12 | In order to run the code below, be sure to `pip install mrjob` if you have not 
13 | done so already.
14 | '''
15 | 
16 | from mrjob.job import MRJob
17 | from mrjob.step import MRStep
18 | import re
19 | 
20 | WORD_RE = re.compile(r"[\w']+")
21 | 
22 | class MRMostUsedWord(MRJob):
23 | 
24 |     def mapper_get_words(self, _, row):
25 |         '''
26 |         If a review's star rating is 5, yield all of the words in the review
27 |         '''
28 |         data = row.split('\t')
29 |         if data[7] == '5':
30 |             for word in WORD_RE.findall(data[13]):
31 |                 yield (word.lower(), 1)
32 | 
33 |     def combiner_count_words(self, word, counts):
34 |         '''
35 |         Sum all of the words available so far
36 |         '''
37 |         yield (word, sum(counts))
38 | 
39 |     def reducer_count_words(self, word, counts):
40 |         '''
41 |         Arrive at a total count for each word in the 5 star reviews
42 |         '''
43 |         yield None, (sum(counts), word)
44 | 
45 |     # discard the key; it is just None
46 |     def reducer_find_max_word(self, _, word_count_pairs):
47 |         '''
48 |         Yield the word that occurs the most in the 5 star reviews
49 |         '''
50 |         yield max(word_count_pairs)
51 | 
52 |     def steps(self):
53 |         return [
54 |             MRStep(mapper=self.mapper_get_words,
55 |                    combiner=self.combiner_count_words,
56 |                    reducer=self.reducer_count_words),
57 |             MRStep(reducer=self.reducer_find_max_word)
58 |         ]
59 | 
60 | if __name__ == '__main__':
61 |     MRMostUsedWord.run()
62 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mrjob_cheatsheet.md:
--------------------------------------------------------------------------------
 1 | # mrjob Cheat Sheet
 2 | 
 3 | To run/debug `mrjob` code locally from your command line:
 4 | 
 5 | ```
 6 | python mapreduce_lab5.py sample_us.tsv
 7 | ```
 8 | 
 9 | To run your `mrjob` code on an AWS EMR cluster, you should first ensure that your configuration file is set with your EC2 pem file name and file location, as well as your current credentials from AWS Education/Vocareum. Note that the credentials are listed with ":" here and not "=" as they are in your `credentials` file. `mrjob` assumes that this (`.mrjob.conf`) file will be located in your home directory (at `~/.mrjob.conf`), so you will need to put the file there. Otherwise, you will need to designate your configuration [as a command line option](https://mrjob.readthedocs.io/en/latest/cmd.html#create-cluster) when you start your `mrjob` job using the `-c` flag.
10 | 
11 | ~/.mrjob.conf
12 | ```
13 | runners:
14 |   emr:
15 |     # Specify a pem key to start up an EMR cluster on your behalf
16 |     ec2_key_pair: MACS_30123
17 |     ec2_key_pair_file: ~/MACS_30123.pem
18 | 
19 |     # Specify type/# of EC2 instances you want your code to run on
20 |     core_instance_type: m5.xlarge
21 |     num_core_instances: 3
22 |     region: us-east-1
23 | 
24 |     # to read from/write to S3; note ":" instead of "=" from `credentials`:
25 |     aws_access_key_id: <your key ID>
26 |     aws_secret_access_key: <your secret>
27 |     aws_session_token: <your session token>
28 | ```
29 | 
30 | To run your `mrjob` code on an EMR cluster (of the size and type specified in your configuration file), you can run the following command on the command line (Note: before running this code, ensure that you have created the default IAM "EMR roles" for your cluster via the AWS console by following the instructions in the lab video for today of starting up and terminating an EMR cluster). Note that this command will start up a cluster, run your job, and then automatically terminate the cluster for you.
31 | 
32 | ```
33 | python mapreduce_lab5.py -r emr sample_us.tsv
34 | ```
35 | 
36 | To create a stand-alone cluster that you can use to run multiple jobs on, you can run:
37 | ```
38 | mrjob create-cluster
39 | ```
40 | 
41 | When you create a cluster, `mrjob` will print out the ID number for the cluster ("j" followed by a bunch of numbers and characters). If you copy this job-id number and specify it after the `--cluster-id` flag on the command line, your code will be run on this already-running cluster. Here, we additionally write the results of our job out to a text file (`mr.out`) via the `> mr.out` addendum at the end of the line.
42 | 
43 | ```
44 | python mapreduce_lab5.py -r emr sample_us.tsv --cluster-id=<cluster-id #> > mr.out
45 | ```
46 | 
47 | When you're done running jobs on your cluster, you can terminate the cluster with the following command so that you don't have to pay for it any longer than you need to:
48 | 
49 | ```
50 | mrjob terminate-cluster <cluster-id #>
51 | ```
52 | 
53 | For additional configuration options, consult the [`mrjob` documentation](https://mrjob.readthedocs.io/en/latest/index.html).
54 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/sample_us.tsv:
--------------------------------------------------------------------------------
 1 | marketplace	customer_id	review_id	product_id	product_parent	product_title	product_category	star_rating	helpful_votes	total_votes	vine	verified_purchase	review_headline	review_body	review_date
 2 | US	18778586	RDIJS7QYB6XNR	B00EDBY7X8	122952789	Monopoly Junior Board Game	Toys	5	0	0	N	Y	Five Stars	Excellent!!!	2015-08-31
 3 | US	24769659	R36ED1U38IELG8	B00D7JFOPC	952062646	56 Pieces of Wooden Train Track Compatible with All Major Train Brands	Toys	5	0	0	N	Y	Good quality track at excellent price	Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends.	2015-08-31
 4 | US	44331596	R1UE3RPRGCOLD	B002LHA74O	818126353	Super Jumbo Playing Cards by S&S Worldwide	Toys	2	1	1	N	Y	Two Stars	Cards are not as big as pictured.	2015-08-31
 5 | US	23310293	R298788GS6I901	B00ARPLCGY	261944918	Barbie Doll and Fashions Barbie Gift Set	Toys	5	0	0	N	Y	my daughter loved it and i liked the price and it came ...	my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop!	2015-08-31
 6 | US	38745832	RNX4EXOBBPN5	B00UZOPOFW	717410439	Emazing Lights eLite Flow Glow Sticks - Spinning Light LED Toy	Toys	1	1	1	N	Y	DONT BUY THESE!	Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds.	2015-08-31
 7 | US	13394189	R3BPETL222LMIM	B009B7F6CA	873028700	Melissa & Doug Water Wow Coloring Book - Vehicles	Toys	5	0	0	N	Y	Five Stars	Great item. Pictures pop thru and add detail as &#34;painted.&#34;  Pictures dry and it can be repainted.	2015-08-31
 8 | US	2749569	R3SORMPJZO3F2J	B0101EHRSM	723424342	Big Bang Cosmic Pegasus (Pegasis) Metal 4D High Performance Generic Battling Top BB-105	Toys	3	2	2	N	Y	Three Stars	To keep together, had to use crazy  glue.	2015-08-31
 9 | US	41137196	R2RDOJQ0WBZCF6	B00407S11Y	383363775	Fun Express Insect Finger Puppets 12ct Toy	Toys	5	0	0	N	Y	Five Stars	I was pleased with the product.	2015-08-31
10 | US	433677	R2B8VBEPB4YEZ7	B00FGPU7U2	780517568	Fisher-Price Octonauts Shellington's On-The-Go Pod Toy	Toys	5	0	0	N	Y	Five Stars	Children like it	2015-08-31
11 | US	1297934	R1CB783I7B0U52	B0013OY0S0	269360126	Claw Climber Goliath/ Disney's Gargoyles	Toys	1	0	1	N	Y	Shame on the seller !!!	Showed up not how it's shown . Was someone's old toy. with paint on it.	2015-08-31
12 | US	52006292	R2D90RQQ3V8LH	B00519PJTW	493486387	100 Foot Multicolor Pennant Banner	Toys	5	0	0	N	Y	Five Stars	Really liked these. They were a little larger than I thought, but still fun.	2015-08-31
13 | US	32071052	R1Y4ZOUGFMJ327	B001TCY2DO	459122467	Pig Jumbo Foil Balloon	Toys	5	0	0	N	Y	Nice huge balloon	Nice huge balloon! Had my local grocery store fill it up for a very small fee, it was totally worth it!	2015-08-31
14 | US	7360347	R2BUV9QJI2A00X	B00DOQCWF8	226984155	Minecraft Animal Toy (6-Pack)	Toys	5	0	1	N	Y	Five Stars	Great deal	2015-08-31
15 | US	11613707	RSUHRJFJIRB3Z	B004C04I4I	375659886	Disney Baby: Eeyore Large Plush	Toys	4	0	0	N	Y	Four Stars	As Advertised	2015-08-31
16 | US	13545982	R1T96CG98BBA15	B00NWGEKBY	933734136	Team Losi 8IGHT-E RTR AVC Electric 4WD Buggy Vehicle (1/8 Scale)	Toys	3	2	4	N	Y	... servo so expect to spend 150 more on a good servo immediately be the stock one breaks right	Comes w a 15$ servo so expect to spend 150 more on a good servo immediately be the stock one breaks right away	2015-08-31
17 | US	43880421	R2ATXF4QQ30YW	B00000JS5S	341842639	Hot Wheels 48- Car storage Case With Easy Grip Carrying Case	Toys	5	0	0	N	Y	Five Stars	awesome ! Thanks!	2015-08-31
18 | US	1662075	R1YS3DS218NNMD	B00XPWXYDK	210135375	ZuZo 2.4GHz 4 CH 6 Axis Gyro RC Quadcopter Drone with Camera & LED Lights, 38 x 38 x 7cm	Toys	5	4	4	N	N	The closest relevance I have to items like these is while in the army I was trained ...	I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.<br /><br />*Disclaimer I received this product at a discount for my unbiased review.	2015-08-31
19 | US	18461411	R2SDXLTLF92O0H	B00VPXX92W	705054378	Teenage Mutant Ninja Turtles T-Machines Tiger Claw in Safari Truck Diecast Vehicle	Toys	5	0	0	N	Y	Five Stars	It was a birthday present for my grandson and he LOVES IT!!	2015-08-31
20 | US	27225859	R4R337CCDWLNG	B00YRA3H4U	223420727	Franklin Sports MLB Fold Away Batting Tee	Toys	3	0	1	Y	N	Got wrong product in the shipment	Got a wrong product from Amazon Vine and unable to provide a good review. We received a pair of cute girls gloves and a baseball ball instead, while we were expecting a boys batting tee. The gloves are cute, however made for at least 6+ yrs or above...more likely 8-9 yrs old girls.<br /><br />Can't provide a fair review as we were not able to use the product.	2015-08-31
21 | US	20494593	R32Z6UA4S5Q630	B009T8BSQY	787701676	Alien Frontiers: Factions	Toys	1	0	0	N	Y	Overpriced.	You need expansion packs 3-5 if you want access to the player aids for the Factions expansion. The base game of Alien Frontiers just plays so much smoother than adding Factions with the expansion packs. All this will do is pigeonhole you into a certain path to victory.	2015-08-31
22 | US	6762003	R1H1HOVB44808I	B00PXWS1CY	996611871	Holy Stone F180C Mini RC Quadcopter Drone with Camera 2.4GHz 6-Axis Gyro Bonus Battery and 8 Blades	Toys	5	1	1	N	N	Five Stars	Awesome customer service and a cool little drone! Especially for the price!	2015-08-31
23 | US	25402244	R4UVQIRZ5T1FM	1591749352	741582499	Klutz Sticker Design Studio: Create Your Own Custom Stickers Craft Kit	Toys	4	1	2	N	Y	Great product for little girls!	I got these for my daughters for plane trip. I liked that the zipper pouch was attached for markers.  However, that pouch fell off.  But the girls have loved coloring their own stickers. Would def buy this again.	2015-08-31
24 | US	32910511	R226K8IJLRPTIR	B00V5DM3RE	587799706	Yoga Joes - Green Army Men Toys	Toys	5	0	1	N	Y	Creative and fun!	My girlfriend and I are both into yoga and I gave her a set of the Yoga Joes for her new home yoga room. When she saw them, she was impressed that I had found little green army men like her brother used to play with. Then she realized they were doing yoga and she almost exploded with delight. You should have seen the look on her face. Needless to say, the gift was a huge hit. They are absolutely brilliant!	2015-08-31
25 | US	18206299	R3031Q42BKAN7J	B00UMSVHD4	135383196	Lalaloopsy Girls Basic Doll- Prairie Dusty Trails	Toys	4	1	1	N	N	i like it but i absoloutely hate that some dolls don't ...	i like it but i absoloutely hate that some dolls don't have pets like this one so I'm not stoked and i really would have liked to see her pet	2015-08-31
26 | US	26599182	R44NP0QG6E98W	B00JLKI69W	375626298	WOW Toys Town Advent Calendar	Toys	3	1	1	N	Y	We love how well they are made	We have MANY Wow toys in our home.  We love how well they are made.  The advent calendar is an exception.  The plastic is thinner than our other wow toys and the barn animals won't even stand up on their own due to the head weighing more than the body and uneven base of the toy.  Very disappointing when we love the concept.  The story to read along with everyday is great.  I would have preferred quality (and would have paid more for it) instead of a cheap &#34;knock-off&#34; of the rest of their toy line.  It is very obvious by the weight of the toys sloppy paint jobs which items in our &#34;Wow Toys&#34; bin are part advent calendar vs. the rest of the toys we have.  Also there is a lot of overlap between the wow town advent calendar & winter wonderland calendar, which makes me want to look for alternatives for this year's advent calendar options.	2015-08-31
27 | US	128540	R24VKWVWUMV3M3	B004S8F7QM	829220659	Cards Against Humanity	Toys	5	0	0	N	Y	Five Stars	Tons of fun	2015-08-31
28 | US	125518	R2MW3TEBPWKENS	B00MZ6BR3Q	145562057	Monster High Haunted Student Spirits Porter Geiss Doll	Toys	5	0	0	N	Y	Five Stars	Love it great add to collecton	2015-08-31
29 | US	15048896	R3N01IESYEYW01	B001CTOC8O	278247652	Star Wars Clone Wars Clone Trooper Child's Deluxe Captain Rex Costume	Toys	5	0	1	N	Y	Five Stars	Exactly as described.  Fits my 7-yr old well!	2015-08-31
30 | US	12191231	RKLAK7EPEG5S6	B00BMKL5WY	906199996	LEGO Creative Tower Building Kit XXL 1600 Pieces 10664	Toys	5	1	2	N	Y	The children LOVED them and best part was that it helped the ...	Purchased these Lego's to help aid me with teaching my Sunday School class. The children LOVED them and best part was that it helped the children remember past lessons. The true Lego brand seem to work/snap together/fit together better than the generic brands. (Plus I had some Lego snobs in my class that only would use the real Lego brand and shunned the generic brand). Only wish Lego's were a little cheaper but you get what you pay for and I would recommend for quality to purchase this set.	2015-08-31
31 | US	18409006	R1HOJ5GOA2JWM0	B00L71H0F4	692305292	Barkology Princess the Poodle Hand Puppet	Toys	2	1	1	N	Y	My little dog can bite my hand right through the puupet.	IT's OK, but not as good as the old Bite Meez puppets. This puppet is very thin. My little yorkie often bites my hands right through the stuffing. It not durable enough to really play with.	2015-08-31
32 | US	42523709	RO5VL1EAPX6O3	B004CLZRM4	59085350	Intex Mesh Lounge (Colors May Vary)	Toys	1	0	0	N	Y	Save your money...don't buy!	This was to be a gift for my husband for our new pool. Did not receive the color I ordered but most of all after only one month of use (not continuously) the mesh pulled away from the material and the inflatable side. Completely shredded and no longer of use. It was stored properly and was not kept outside or in the pool. Poorly made, better off going to W**-M*** and getting something on clearance.	2015-08-31
33 | US	45601416	R3OSJU70OIBWVE	B000PEOMC8	895316207	Intex River Run I Sport Lounge, Inflatable Water Float, 53 in Diameter	Toys	5	0	0	N	Y	but I've bought one in the past and loved	Ended up sending this guy back because I didnt need it, but I've bought one in the past and loved it	2015-08-31
34 | US	47546726	R3NFZZCJSROBT4	B008W1BPWQ	397107238	Peppa Pig 7 Wood Puzzles In Wooden Storage Box (styles will vary)	Toys	3	0	0	N	Y	Three Stars	The product is good, but the box was broken.	2015-08-31
35 | US	21448082	R47XBGQFP039N	B00FZX71BI	480992295	Paraboard - Parallel Charging Board for Lipos with EC5 Connectors	Toys	5	0	0	N	Y	CAN SAVE TME IF YOU UNDESTAND HOW IT WORKS .	Works well ,quality product but this style of board will charge multiple batteries at the same time SAFELY  ( IF )  ALL BATTERIES ARE OF THE SAME CELL COUNT , THE SAME BATTERY COMPOSITION ( LIPO, NIMH-etc ) AND THEY MUST HAVE INDIVIDUAL CELL VOLTAGES THAT ARE VERY CLOSE AND EQUAL TO EACH BATTERY CONNECTED AT THE SAME TIME . When board is connected to most if not all chargers it can only read total and individual cell voltage of ONE OF THE BATTERIES AND MAY OVER OR UNDER CHARGE THE OTHERS TO SOME DEGREE , TOTAL RATE OF CHARGE IS DIVIDED EQUALLY BETWEEN BATTERIES CONNECTED AT THE SAME TIME . Close monitoring is a must  when using like all high discharge batteies . I  have only personally expeienced one lipo battery meltdown and it is a very  SHORT IF NOT  NON EXISTANT WINDOW OF OPPERUNITY TO PREVENT OR MINIMISE THE COLATTERAL DAMAGE ONCE THE PROCESS STARTS . Read and understand all charging and battery instructions .	2015-08-31
36 | US	12612039	R1JS8G26X4RM2G	B00D4NJSJE	408940178	The Game of Life Money and Asset Board Game, Fame Edition	Toys	5	0	1	N	Y	Five Stars	Great gift!	2015-08-31
37 | US	44928701	R1ORWPFQ9EDYA0	B000HZZT7W	967346376	LCR Dice Game (Red Chips)	Toys	5	0	0	N	Y	Five Stars	We play this game quite a bit with friends.	2015-08-31
38 | US	43173394	R1YIX4SO32U0GT	B002G54DAA	57447684	BCW - Deluxe Currency Slab - Regular Bill - Dollar / Currency Collecting Supplies	Toys	5	0	1	N	Y	BCW - Deluxe Currency Slab	Fits my $20 bill perfectly.	2015-08-31
39 | US	11210951	R1W3QQZ8JKECCI	B003JT0L4Y	876626440	Ocean Life Stamps Birthday Party Supplies Loot Bag Accessories 24 Pieces per Unit	Toys	5	0	0	N	Y	Fun for birthday party favor	I ordered these for my 3 year old son's birthday party as party favors. They were a huge hit & the perfect fit for a 3 year old!	2015-08-31
40 | US	12918717	RZX17JIYIPAR	B00KQUNNZ8	644368246	New Age Scare Halloween Party Pumpkin and Bat Hanging Round Lantern Decoration, Paper, 9" Pack of 3	Toys	5	0	0	N	N	Love the prints!	These paper lanterns are adorable! The colors are bright, the patterns are fun & trendy & they're a good size! They came well packaged & are easy to assemble, they're also really easy to take apart & flatten back down! We'll get to use them for a few Halloweens I'm sure! They even came with the string needed to hang! I'm glad I grabbed them, they're really cute, my kids are excited to add to the Halloween decor & are already asking to hang them!<br />I received this item at a discount for my unbiased review.	2015-08-31
41 | US	47781982	RIDVQ4P3WJR42	B00WTGGGRO	162262449	Pokemon - Double Dragon Energy (97/108) - XY Roaring Skies	Toys	5	1	1	N	Y	Five Stars	My Grandson loves these cards.  Thank you	2015-08-31
42 | US	34874898	R1WQ3ME3JAG2O1	B00WAKEQLW	824555589	Whiffer Sniffers Mystery Pack 1 Scented Backpack Clip	Toys	1	0	6	N	Y	One Star	Received a pineapple rather than the advertised s'more	2015-08-31
43 | US	20962528	RNTPOUDQIICBF	B00M5AT30G	548190970	AmiGami Fox and Owl Figure 2-Pack	Toys	4	0	0	N	Y	Four Stars	Christmas gift for 6yr	2015-08-31
44 | US	47781982	R3AHZWWOL0IAV0	B00GNDY40U	438056479	Pokemon - Gyarados (31/113) - Legendary Treasures	Toys	5	0	0	N	Y	Five Stars	My Grandson loves these cards.  Thank you	2015-08-31
45 | US	13328687	R3PDXKS9O2Z20B	B00WJ1OPMW	120071056	LeapFrog LeapTV Letter Factory Adventures Educational, Active Video Game	Toys	5	0	0	N	N	they LOVE this game	Even though both of my kids are at the top of this age recommendation level, they LOVE this game!  I love how it caters to the kinesthetic learner by asking them to move their bodies into the shape of the letters.  It even takes teamwork as sometimes two people are required to finish the letter.  My kids know all of their letter sounds and shapes, but this didn't stop them from playing the game over and over.	2015-08-31
46 | US	16245463	R23URALWA7IHWL	B00IGXV9UI	765869385	Disney Planes: Fire & Rescue Scoop & Spray Firefighter Dusty	Toys	5	0	0	N	Y	Five Stars	My 5 year old son loves this.	2015-08-31
47 | US	11916403	R36L8VKT9ZSUY6	B00JVY9J1M	771795950	Winston Zeddmore & Ecto-1: Funko POP! Rides x Ghostbusters Vinyl Figure	Toys	5	0	0	N	Y	Five Stars	love it	2015-08-31
48 | US	5543658	R23JRQR6VMY4TV	B008AL15M8	211944547	Yu-Gi-Oh! - Solemn Judgment (GLD5-EN045) - Gold Series: Haunted Mine - Limited Edition - Ghost/Gold Hybrid Rare	Toys	5	0	0	N	Y	Absolutely one of the best traps in the game	Absolutely one of the best traps in the game. It is never a dead and always live since you can always pay half your lifepoints for its cost. It's main power is that it can stop any card. Hopefully this card comes off the Forbidden/Limited list soon.	2015-08-31
49 | US	41168357	R3T73PQZZ9F6GT	B00CAEEDC0	72805974	Seat Pets Car Seat Toy	Toys	5	0	0	N	Y	Five Stars	really soft and cute	2015-08-31
50 | US	32866903	R300I65NW30Y19	B000TFLAZA	149264874	Baby Einstein Octoplush	Toys	5	0	0	N	Y	Five Stars	baby loved it - so attractive and very nice	2015-08-31
51 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/Lab 5 Kinesis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Lab 5 (Part II): Ingesting Streaming Data with Kinesis\n",
  8 |     "### MACS 30123: Large-Scale Computing for the Social Sciences\n",
  9 |     "\n",
 10 |     "In this second part of the lab, we'll explore how we can use Kinesis to ingest streaming text data, of the sort we might encounter on Twitter.\n",
 11 |     "\n",
 12 |     "To avoid requiring you to set up Twitter API access, we will create Twitter-like text and metadata using the `testdata` package to perform this demonstration. It should be easy enough to plug your streaming Twitter feed into this workflow if you desire to do so as an individual exercise (for instance, as a part of a final project!). Additionally, once you have this pipeline running, you can scale it up even further to include many more producers and consumers, if you would like, as discussed in lecture and the readings.\n",
 13 |     "\n",
 14 |     "Recall from the lecture and readings that in a Kinesis workflow, \"producers\" send data into a Kinesis stream and \"consumers\" draw data out of that stream to perform operations on it (i.e. real-time processing, archiving raw data, etc.). To make this a bit more concrete, we are going to implement a simplified version of this workflow in this lab, in which we spin up Producer and Consumer (t2.nano) EC2 Instances and create a Kinesis stream. Our Producer instance will run a producer script (which writes our Twitter-like text data into a Kinesis stream) and our Consumer instance will run a consumer script (which reads the Twitter-like data and calculates a simple demo statistic -- the average unique word count per tweet, as a real-time running average).\n",
 15 |     "\n",
 16 |     "You can visualize this data pipeline, like so:\n",
 17 |     "\n",
 18 |     "<img src=\"simple_kinesis_architecture.png\" width=\"800\" align=\"left\" />\n"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "To begin implementing this pipeline, let's import `boto3` and initialize the AWS services we'll be using in this lab (EC2 and Kinesis)."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 1,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import boto3\n",
 35 |     "import time\n",
 36 |     "\n",
 37 |     "session = boto3.Session()\n",
 38 |     "\n",
 39 |     "kinesis = session.client('kinesis')\n",
 40 |     "ec2 = session.resource('ec2')\n",
 41 |     "ec2_client = session.client('ec2')"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Then, we need to create the Kinesis stream that our Producer EC2 instance will write streaming tweets to. Because we're only setting this up to handle traffic from one consumer and one producer, we'll just use one shard, but we could increase our throughput capacity by increasing the ShardCount if we wanted to do so."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 2,
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "response = kinesis.create_stream(StreamName = 'test_stream',\n",
 58 |     "                                 ShardCount = 1\n",
 59 |     "                                )\n",
 60 |     "\n",
 61 |     "# Is the stream active and ready to be written to/read from? Wait until it exists before moving on:\n",
 62 |     "waiter = kinesis.get_waiter('stream_exists')\n",
 63 |     "waiter.wait(StreamName='test_stream')"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "OK, now we're ready to set up our producer and consumer EC2 instances that will write to and read from this Kinesis stream. Let's spin up our two EC2 instances (specified by the `MaxCount` parameter) using one of the Amazon Linux AMIs.  Notice here that you will need to specify your `.pem` file for the `KeyName` parameter, as well as create a custom security group/group ID. Designating a security group is necessary because, by default, AWS does not allow inbound ssh traffic into EC2 instances (they create custom ssh-friendly security groups each time you run the GUI wizard in the console). Thus, if you don't set this parameter, you will not be able to ssh into the EC2 instances that you create here with `boto3`. You can follow along in the lab video for further instructions on how you can set up one of these security groups.\n",
 71 |     "\n",
 72 |     "Also, we need to specify an IAM Instance Profile so that our EC2 instances will have the permissions necessary to interact with other AWS services on our behalf. Here, I'm using one of the profiles we create in Part I of Lab 5 (a default AWS profile for launching EC2 instances within an EMR cluster), as this gives us all of the necessary permissions"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "instances = ec2.create_instances(ImageId='ami-0915e09cc7ceee3ab',\n",
 82 |     "                                 MinCount=1,\n",
 83 |     "                                 MaxCount=2,\n",
 84 |     "                                 InstanceType='t2.micro',\n",
 85 |     "                                 KeyName='MACS_30123',\n",
 86 |     "                                 SecurityGroupIds=['sg-0e921f64abdfac4e6'],\n",
 87 |     "                                 SecurityGroups=['MACS_30123'],\n",
 88 |     "                                 IamInstanceProfile=\n",
 89 |     "                                     {'Name': 'EMR_EC2_DefaultRole'},\n",
 90 |     "                                )\n",
 91 |     "\n",
 92 |     "# Wait until EC2 instances are running before moving on\n",
 93 |     "waiter = ec2_client.get_waiter('instance_running')\n",
 94 |     "waiter.wait(InstanceIds=[instance.id for instance in instances])"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "While we wait for these instances to start running, let's set up the Python scripts that we want to run on each instance. First of all, we have to define a script for our Producer instance, which continuously produces Twitter-like data using the `testdata` package and puts that data into our Kinesis stream."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 4,
107 |    "metadata": {},
108 |    "outputs": [
109 |     {
110 |      "name": "stdout",
111 |      "output_type": "stream",
112 |      "text": [
113 |       "Overwriting producer.py\n"
114 |      ]
115 |     }
116 |    ],
117 |    "source": [
118 |     "%%file producer.py\n",
119 |     "\n",
120 |     "import boto3\n",
121 |     "import testdata\n",
122 |     "import json\n",
123 |     "\n",
124 |     "kinesis = boto3.client('kinesis', region_name='us-east-1')\n",
125 |     "\n",
126 |     "# Continously write Twitter-like data into Kinesis stream\n",
127 |     "while 1 == 1:\n",
128 |     "    test_tweet = {'username': testdata.get_username(),\n",
129 |     "                  'tweet':    testdata.get_ascii_words(280)\n",
130 |     "                  }\n",
131 |     "    kinesis.put_record(StreamName = \"test_stream\",\n",
132 |     "                       Data = json.dumps(test_tweet),\n",
133 |     "                       PartitionKey = \"partitionkey\"\n",
134 |     "                      )"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "Then, we can define a script for our Consumer instance that gets the latest tweet out of the stream, one at a time. After processing each tweet, we then print out the average unique word count per processed tweet as a running average, before jumping on to the next indexed tweet in our Kinesis stream shard to do the same thing for as long as our program is running."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 5,
147 |    "metadata": {},
148 |    "outputs": [
149 |     {
150 |      "name": "stdout",
151 |      "output_type": "stream",
152 |      "text": [
153 |       "Overwriting consumer.py\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "%%file consumer.py\n",
159 |     "\n",
160 |     "import boto3\n",
161 |     "import time\n",
162 |     "import json\n",
163 |     "\n",
164 |     "kinesis = boto3.client('kinesis', region_name='us-east-1')\n",
165 |     "\n",
166 |     "shard_it = kinesis.get_shard_iterator(StreamName = \"test_stream\",\n",
167 |     "                                     ShardId = 'shardId-000000000000',\n",
168 |     "                                     ShardIteratorType = 'LATEST'\n",
169 |     "                                     )[\"ShardIterator\"]\n",
170 |     "\n",
171 |     "i = 0\n",
172 |     "s = 0\n",
173 |     "while 1==1:\n",
174 |     "    out = kinesis.get_records(ShardIterator = shard_it,\n",
175 |     "                              Limit = 1\n",
176 |     "                             )\n",
177 |     "    for o in out['Records']:\n",
178 |     "        jdat = json.loads(o['Data'])\n",
179 |     "        s = s + len(set(jdat['tweet'].split()))\n",
180 |     "        i = i + 1\n",
181 |     "\n",
182 |     "    if i != 0:\n",
183 |     "        print(\"Average Unique Word Count Per Tweet: \" + str(s/i))\n",
184 |     "        print(\"Sample of Current Tweet: \" + jdat['tweet'][:20])\n",
185 |     "        print(\"\\n\")\n",
186 |     "        \n",
187 |     "    shard_it = out['NextShardIterator']\n",
188 |     "    time.sleep(0.2)"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "As our final preparation step, we'll grab all of the public DNS names of our instances (web addresses that you normally copy from the GUI console to manually ssh into  and record the names of our code files, so that we can easily ssh/scp into the instances and pass them our Python scripts to run."
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 6,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "instance_dns = [instance.public_dns_name \n",
205 |     "                 for instance in ec2.instances.all() \n",
206 |     "                 if instance.state['Name'] == 'running'\n",
207 |     "               ]\n",
208 |     "\n",
209 |     "code = ['producer.py', 'consumer.py']"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "To copy our files over to our instances and programmatically run commands on them, we can use Python's `scp` and `paramiko` packages. You'll need to install these via `pip install paramiko scp` if you have not already done so."
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": 7,
222 |    "metadata": {},
223 |    "outputs": [
224 |     {
225 |      "name": "stdout",
226 |      "output_type": "stream",
227 |      "text": [
228 |       "Requirement already satisfied: paramiko in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (2.7.2)\r\n",
229 |       "Requirement already satisfied: scp in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (0.13.3)\r\n",
230 |       "Requirement already satisfied: cryptography>=2.5 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (3.4.6)\r\n",
231 |       "Requirement already satisfied: pynacl>=1.0.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (1.4.0)\r\n",
232 |       "Requirement already satisfied: bcrypt>=3.1.3 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (3.2.0)\r\n",
233 |       "Requirement already satisfied: cffi>=1.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.5)\r\n",
234 |       "Requirement already satisfied: six>=1.4.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.15.0)\r\n",
235 |       "Requirement already satisfied: pycparser in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko) (2.20)\r\n"
236 |      ]
237 |     }
238 |    ],
239 |    "source": [
240 |     "! pip install paramiko scp"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "Once we have `scp` and `paramiko` installed, we can copy our producer and consumer Python scripts over to the EC2 instances (designating our first EC2 instance in `instance_dns` as the producer and second EC2 instance as the consumer instance). If you have a slower (or more unstable) internet connection, you might need to increase the time.sleep() time in the code and try to run this code several times in order for it to fully run.\n",
248 |     "\n",
249 |     "Note that, on each instance, we install `boto3` (so that we can access Kinesis through our scripts) and then copy our producer/consumer Python code over to our producer/consumer EC2 instance via `scp`. After we've done this, we install the `testdata` package on the producer instance (which it needs in order to create fake tweets) and instruct it to run our Python producer script. This will write tweets into our Kinesis stream until we stop the script and terminate the producer EC2 instance.\n",
250 |     "\n",
251 |     "We could also instruct our consumer to get tweets from the stream immediately after this command and this would automatically collect and process the tweets according to the consumer.py script. For the purposes of this demonstration, though, we'll manually ssh into that instance and run the code from the terminal so that we can see the real-time consumption a bit more easily."
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": 9,
257 |    "metadata": {},
258 |    "outputs": [
259 |     {
260 |      "name": "stdout",
261 |      "output_type": "stream",
262 |      "text": [
263 |       "Producer Instance is Running producer.py\n",
264 |       ".........................................\n",
265 |       "Connect to Consumer Instance by running: ssh -i \"MACS_30123.pem\" ec2-user@ec2-54-89-73-205.compute-1.amazonaws.com\n"
266 |      ]
267 |     }
268 |    ],
269 |    "source": [
270 |     "import paramiko\n",
271 |     "from scp import SCPClient\n",
272 |     "ssh_producer, ssh_consumer = paramiko.SSHClient(), paramiko.SSHClient()\n",
273 |     "\n",
274 |     "# Initialization of SSH tunnels takes a bit of time; otherwise get connection error on first attempt\n",
275 |     "time.sleep(5)\n",
276 |     "\n",
277 |     "# Install boto3 on each EC2 instance and Copy our producer/consumer code onto producer/consumer EC2 instances\n",
278 |     "instance = 0\n",
279 |     "stdin, stdout, stderr = [[None, None] for i in range(3)]\n",
280 |     "for ssh in [ssh_producer, ssh_consumer]:\n",
281 |     "    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())\n",
282 |     "    ssh.connect(instance_dns[instance],\n",
283 |     "                username = 'ec2-user',\n",
284 |     "                key_filename='/home/jclindaniel/MACS_30123.pem')\n",
285 |     "    \n",
286 |     "    with SCPClient(ssh.get_transport()) as scp:\n",
287 |     "        scp.put(code[instance])\n",
288 |     "    \n",
289 |     "    if instance == 0:\n",
290 |     "        stdin[instance], stdout[instance], stderr[instance] = \\\n",
291 |     "            ssh.exec_command(\"sudo pip install boto3 testdata\")\n",
292 |     "    else:\n",
293 |     "        stdin[instance], stdout[instance], stderr[instance] = \\\n",
294 |     "            ssh.exec_command(\"sudo pip install boto3\")\n",
295 |     "\n",
296 |     "    instance += 1\n",
297 |     "\n",
298 |     "# Block until Producer has installed boto3 and testdata, then start running Producer script:\n",
299 |     "producer_exit_status = stdout[0].channel.recv_exit_status() \n",
300 |     "if producer_exit_status == 0:\n",
301 |     "    ssh_producer.exec_command(\"python %s\" % code[0])\n",
302 |     "    print(\"Producer Instance is Running producer.py\\n.........................................\")\n",
303 |     "else:\n",
304 |     "    print(\"Error\", producer_exit_status)\n",
305 |     "\n",
306 |     "# Close ssh and show connection instructions for manual access to Consumer Instance\n",
307 |     "ssh_consumer.close; ssh_producer.close()\n",
308 |     "\n",
309 |     "print(\"Connect to Consumer Instance by running: ssh -i \\\"MACS_30123.pem\\\" ec2-user@%s\" % instance_dns[1])"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "metadata": {},
315 |    "source": [
316 |     "If you run the command above (with the correct path to your actual `.pem` file), you should be inside your Consumer EC2 instance. If you run `python consumer.py`, you should also see a real-time count of the average number of unique words per tweet (along with a sample of the text in the most recent tweet), as in the screenshot:\n",
317 |     "\n",
318 |     "![](consumer_feed.png)\n",
319 |     "\n",
320 |     "Cool! Now we can scale this basic architecture up to perform any number of real-time data analyses, if we so desire. Also, if we execute our consumer code remotely via paramiko as well, the process will be entirely remote, so we don't need to keep any local resources running in order to keep streaming/processing real-time data.\n",
321 |     "\n",
322 |     "As a final note, when you are finished observing the real-time feed from your consumer instance, **be sure to terminate your EC2 instances and delete your Kinesis stream**. You don't want to be paying for these to run continuously! You can do so programmatically by running the following `boto3` code:"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": 10,
328 |    "metadata": {},
329 |    "outputs": [
330 |     {
331 |      "name": "stdout",
332 |      "output_type": "stream",
333 |      "text": [
334 |       "EC2 Instances Successfully Terminated\n",
335 |       "Kinesis Stream Successfully Deleted\n"
336 |      ]
337 |     }
338 |    ],
339 |    "source": [
340 |     "# Terminate EC2 Instances:\n",
341 |     "ec2_client.terminate_instances(InstanceIds=[instance.id for instance in instances])\n",
342 |     "\n",
343 |     "# Confirm that EC2 instances were terminated:\n",
344 |     "waiter = ec2_client.get_waiter('instance_terminated')\n",
345 |     "waiter.wait(InstanceIds=[instance.id for instance in instances])\n",
346 |     "print(\"EC2 Instances Successfully Terminated\")\n",
347 |     "\n",
348 |     "# Delete Kinesis Stream (if it currently exists):\n",
349 |     "try:\n",
350 |     "    response = kinesis.delete_stream(StreamName='test_stream')\n",
351 |     "except kinesis.exceptions.ResourceNotFoundException:\n",
352 |     "    pass\n",
353 |     "\n",
354 |     "# Confirm that Kinesis Stream was deleted:\n",
355 |     "waiter = kinesis.get_waiter('stream_not_exists')\n",
356 |     "waiter.wait(StreamName='test_stream')\n",
357 |     "print(\"Kinesis Stream Successfully Deleted\")"
358 |    ]
359 |   }
360 |  ],
361 |  "metadata": {
362 |   "kernelspec": {
363 |    "display_name": "Python 3",
364 |    "language": "python",
365 |    "name": "python3"
366 |   },
367 |   "language_info": {
368 |    "codemirror_mode": {
369 |     "name": "ipython",
370 |     "version": 3
371 |    },
372 |    "file_extension": ".py",
373 |    "mimetype": "text/x-python",
374 |    "name": "python",
375 |    "nbconvert_exporter": "python",
376 |    "pygments_lexer": "ipython3",
377 |    "version": "3.7.10"
378 |   }
379 |  },
380 |  "nbformat": 4,
381 |  "nbformat_minor": 4
382 | }
383 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | import time
 4 | import json
 5 | 
 6 | kinesis = boto3.client('kinesis', region_name='us-east-1')
 7 | 
 8 | shard_it = kinesis.get_shard_iterator(StreamName = "test_stream",
 9 |                                      ShardId = 'shardId-000000000000',
10 |                                      ShardIteratorType = 'LATEST'
11 |                                      )["ShardIterator"]
12 | 
13 | i = 0
14 | s = 0
15 | while 1==1:
16 |     out = kinesis.get_records(ShardIterator = shard_it,
17 |                               Limit = 1
18 |                              )
19 |     for o in out['Records']:
20 |         jdat = json.loads(o['Data'])
21 |         s = s + len(set(jdat['tweet'].split()))
22 |         i = i + 1
23 | 
24 |     if i != 0:
25 |         print("Average Unique Word Count Per Tweet: " + str(s/i))
26 |         print("Sample of Current Tweet: " + jdat['tweet'][:20])
27 |         print("\n")
28 |         
29 |     shard_it = out['NextShardIterator']
30 |     time.sleep(0.2)
31 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/producer.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | import testdata
 4 | import json
 5 | 
 6 | kinesis = boto3.client('kinesis', region_name='us-east-1')
 7 | 
 8 | # Continously write Twitter-like data into Kinesis stream
 9 | while 1 == 1:
10 |     test_tweet = {'username': testdata.get_username(),
11 |                   'tweet':    testdata.get_ascii_words(280)
12 |                   }
13 |     kinesis.put_record(StreamName = "test_stream",
14 |                        Data = json.dumps(test_tweet),
15 |                        PartitionKey = "partitionkey"
16 |                       )
17 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/7M_PySpark_Midway.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark.sql.functions import *
 3 | 
 4 | # Start Spark Session
 5 | spark = SparkSession.builder.getOrCreate()
 6 | 
 7 | # Read data
 8 | data = spark.read.csv('/project2/macs30123/AWS_book_reviews/*.csv',
 9 |                       header='true',
10 |                       inferSchema='true')
11 | 
12 | # Recast columns to correct data type
13 | data = (data.withColumn('star_rating', col('star_rating').cast('int'))
14 |             .withColumn('total_votes', col('total_votes').cast('int'))
15 |             .withColumn('helpful_votes', col('helpful_votes').cast('int'))
16 |        )
17 | 
18 | # Summarize data by star_rating
19 | stars_votes = (data.groupBy('star_rating')
20 |                   .sum('total_votes', 'helpful_votes')
21 |                   .sort('star_rating', ascending=False)
22 |              )
23 | 
24 | # Drop rows with NaN values and then print out resulting data:
25 | stars_votes_clean = stars_votes.dropna()
26 | stars_votes_clean.show()
27 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/Local_Colab_Spark_Setup.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "yoPc0xms3gmI"
  7 |    },
  8 |    "source": [
  9 |     "<a href=\"https://colab.research.google.com/drive/1zIS7W6gldB9osDXPlLdJ5pd8p20Owevl?usp=sharing\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {
 15 |     "id": "qpaNU3ETh0Qc"
 16 |    },
 17 |    "source": [
 18 |     "\n",
 19 |     "\n",
 20 |     "# Setting up PySpark in a Colab Notebook\n",
 21 |     "\n",
 22 |     "You can run Spark both locally and on a cluster. Here, I'll demonstrate how you can set up Spark to run in a Colab notebook for debugging purposes.\n",
 23 |     "\n",
 24 |     "You can also set up Spark locally in a similar way if you want to take advantage of multiple CPU cores (and/or GPU) on your laptop (the setup will vary slightly, though, depending on your operating system and you'll need to figure out these specifics on your own; however, this setup does work in WSL for me if I run the follow bash script in my terminal window using `sudo`). This being said, this local option should be for testing purposes on sample datasets only. If you want to run big PySpark jobs, you will want to run these in an EMR notebook (with an EMR cluster as your backend) or on the Midway Cluster.\n",
 25 |     "\n",
 26 |     "First, we need to install Spark and PySpark, by running the following commands:"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": null,
 32 |    "metadata": {
 33 |     "id": "R8f1D7wfaCgF"
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "%%bash\n",
 38 |     "apt-get update\n",
 39 |     "apt-get install -y openjdk-8-jdk-headless -qq > /dev/null\n",
 40 |     "\n",
 41 |     "wget -q \"https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz\" > /dev/null\n",
 42 |     "tar -xvf spark-3.1.1-bin-hadoop2.7.tgz > /dev/null\n",
 43 |     "\n",
 44 |     "pip install pyspark findspark"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {
 50 |     "id": "fHjZLeId0nvR"
 51 |    },
 52 |    "source": [
 53 |     "OK, now that we have Spark, we need to set a path to it, so PySpark knows where to find it. We do this using the `os` Python library below.\n",
 54 |     "\n",
 55 |     "On my machine (WSL, Ubuntu 20.04), where I unpacked Spark in my home directory, this can be achieved with:\n",
 56 |     "```\n",
 57 |     "os.environ[\"SPARK_HOME\"] = \"/home/jclindaniel/spark-3.1.1-bin-hadoop2.7\"\n",
 58 |     "```\n",
 59 |     "\n",
 60 |     "In Colab, it is automatically downloaded to the `/content` directory, so we indicate that as its location here. Then, we run `findspark` to find Spark for us on the machine, and finally start up a SparkSession running on all available cores (`local[4]` means your code will run on 4 threads locally, `local[*]` means that your code will run as many threads as there are logical cores on your machine)."
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 2,
 66 |    "metadata": {
 67 |     "id": "CljIupW0aE06"
 68 |    },
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "# Set path to Spark\n",
 72 |     "import os\n",
 73 |     "os.environ[\"SPARK_HOME\"] = \"/content/spark-3.1.1-bin-hadoop2.7\"\n",
 74 |     "\n",
 75 |     "# Find Spark so that we can access session within our notebook\n",
 76 |     "import findspark\n",
 77 |     "findspark.init()\n",
 78 |     "\n",
 79 |     "# Start SparkSession on all available cores\n",
 80 |     "from pyspark.sql import SparkSession\n",
 81 |     "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {
 87 |     "id": "VNTQOBLthDrC"
 88 |    },
 89 |    "source": [
 90 |     "Now that we've installed everything and set up our paths correctly, we can run (small) Spark jobs both in Colab notebooks and locally (for bigger jobs, you will want to run these jobs on an EMR cluster, though. Remember, for instance, that Google only allocates us one CPU core and up to one GPU for free)!\n",
 91 |     "\n",
 92 |     "Let's make sure our setup is working by doing couple of simple things with the pyspark.sql package on the Amazon Customer Review Sample Dataset."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {
 99 |     "id": "fbXWBQfSAX8q"
100 |    },
101 |    "outputs": [],
102 |    "source": [
103 |     "! pip install wget\n",
104 |     "import wget\n",
105 |     "\n",
106 |     "wget.download('https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv', 'sample_data/sample_us.tsv')"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 4,
112 |    "metadata": {
113 |     "id": "KrXWEMxjeFx1"
114 |    },
115 |    "outputs": [],
116 |    "source": [
117 |     "# Read TSV file from default data download directory in Colab\n",
118 |     "data = spark.read.csv('sample_data/sample_us.tsv',\n",
119 |     "                      sep=\"\\t\",\n",
120 |     "                      header=True,\n",
121 |     "                      inferSchema=True)"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 5,
127 |    "metadata": {
128 |     "colab": {
129 |      "base_uri": "https://localhost:8080/"
130 |     },
131 |     "id": "2qvOOIYqeWw9",
132 |     "outputId": "44f68fd3-ab6a-442a-a94d-7c41c37ed175"
133 |    },
134 |    "outputs": [
135 |     {
136 |      "name": "stdout",
137 |      "output_type": "stream",
138 |      "text": [
139 |       "root\n",
140 |       " |-- marketplace: string (nullable = true)\n",
141 |       " |-- customer_id: integer (nullable = true)\n",
142 |       " |-- review_id: string (nullable = true)\n",
143 |       " |-- product_id: string (nullable = true)\n",
144 |       " |-- product_parent: integer (nullable = true)\n",
145 |       " |-- product_title: string (nullable = true)\n",
146 |       " |-- product_category: string (nullable = true)\n",
147 |       " |-- star_rating: integer (nullable = true)\n",
148 |       " |-- helpful_votes: integer (nullable = true)\n",
149 |       " |-- total_votes: integer (nullable = true)\n",
150 |       " |-- vine: string (nullable = true)\n",
151 |       " |-- verified_purchase: string (nullable = true)\n",
152 |       " |-- review_headline: string (nullable = true)\n",
153 |       " |-- review_body: string (nullable = true)\n",
154 |       " |-- review_date: string (nullable = true)\n",
155 |       "\n"
156 |      ]
157 |     }
158 |    ],
159 |    "source": [
160 |     "data.printSchema()"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 6,
166 |    "metadata": {
167 |     "colab": {
168 |      "base_uri": "https://localhost:8080/"
169 |     },
170 |     "id": "ngb25JINcUNr",
171 |     "outputId": "8f3f2dd2-b47d-449b-d6c0-34ead56d5e7d"
172 |    },
173 |    "outputs": [
174 |     {
175 |      "name": "stdout",
176 |      "output_type": "stream",
177 |      "text": [
178 |       "+-----------+----------------+\n",
179 |       "|star_rating|sum(total_votes)|\n",
180 |       "+-----------+----------------+\n",
181 |       "|          5|              13|\n",
182 |       "|          4|               3|\n",
183 |       "|          3|               8|\n",
184 |       "|          2|               2|\n",
185 |       "|          1|               8|\n",
186 |       "+-----------+----------------+\n",
187 |       "\n"
188 |      ]
189 |     }
190 |    ],
191 |    "source": [
192 |     "(data.groupBy('star_rating')\n",
193 |     "     .sum('total_votes')\n",
194 |     "     .sort('star_rating', ascending=False)\n",
195 |     "     .show()\n",
196 |     ")"
197 |    ]
198 |   }
199 |  ],
200 |  "metadata": {
201 |   "colab": {
202 |    "collapsed_sections": [],
203 |    "name": "Local/Colab Spark Setup",
204 |    "provenance": [],
205 |    "toc_visible": true
206 |   },
207 |   "kernelspec": {
208 |    "display_name": "Python 3",
209 |    "language": "python",
210 |    "name": "python3"
211 |   },
212 |   "language_info": {
213 |    "codemirror_mode": {
214 |     "name": "ipython",
215 |     "version": 3
216 |    },
217 |    "file_extension": ".py",
218 |    "mimetype": "text/x-python",
219 |    "name": "python",
220 |    "nbconvert_exporter": "python",
221 |    "pygments_lexer": "ipython3",
222 |    "version": "3.7.6"
223 |   }
224 |  },
225 |  "nbformat": 4,
226 |  "nbformat_minor": 1
227 | }
228 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/ec2_limit_increase.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 6 PySpark EDA and ML/ec2_limit_increase.png


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/gpu_cluster_instructions.md:
--------------------------------------------------------------------------------
 1 | ## Launching a Spark GPU Cluster via AWS EMR
 2 | 
 3 | In order to launch a Spark GPU cluster via AWS EMR (and interact with it via EMR notebook), you will first need to create a personal AWS account. AWS doesn't currently allow us to launch these GPU clusters from within our free AWS educate accounts.
 4 | 
 5 | Once you have a personal account, you will need to request a limit increase on the number of GPU instances you can use from within the account (to follow along with these instructions, you should request 16 vCPUs on G-series EC2 instances), as displayed in the limit increase request form available in the EC2 dashboard:
 6 | 
 7 | ![](ec2_limit_increase.png)
 8 | 
 9 | Once AWS has increased your allowed number of GPU EC2 instances:
10 | 1. Log into your AWS Console with an IAM User account (you may encounter problems starting a PySpark kernel in your EMR Notebook if you do not complete this step).
11 | 2. Go to the EMR dashboard and click "Create Cluster," and then "Advanced"
12 | 3. In "Step 1: Software and Steps," Select EMR release label 6.2.0, and ensure that the only applications that are selected are Hadoop, Spark, Livy, and JupyterEnterpriseGateway.
13 | 4. Also in "Step 1: Software and Steps," but in the "Edit software settings" section, copy and paste everything from `my-configurations.json` (located in this directory) into the text box. Alternatively, you can upload the JSON to S3 and read it in.
14 | 5. In "Step 2: Hardware", select 1 Master node of style "m5.xlarge," 1 core node of type "g4dn.2xlarge," and 1 task node of type "g4dn.2xlarge" (for a total of 2 NVIDIA T4 GPUs in your cluster).
15 | 6. In "Step 3: General Cluster Settings," Enter bootstrap file location in the "Bootstrap Actions" section by clicking "Custom action" and then "Configure and add" (the necessary bootstrap file is available in a public S3 bucket at: `s3://macs30123/my-bootstrap-action.sh` and you can also modify the file and upload it to your own S3 bucket).
16 | 7. Accept all other defaults (but select an EC2 key pair/PEM if you would like to `ssh` into your cluster) and click the "Next" button until your cluster launches.
17 | 8. Then, once your cluster has launched, you can connect it to an EMR Notebook workspace like normal and run your PySpark code on a GPU cluster! Just as a heads-up these GPU instances each cost $0.75/hr, so be careful how long you run your cluster!
18 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/my-bootstrap-action.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | set -ex
4 | 
5 | sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
6 | sudo chmod a+rwx -R /sys/fs/cgroup/devices
7 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/my-configurations.json:
--------------------------------------------------------------------------------
 1 | [
 2 | 	{
 3 | 		"Classification":"spark",
 4 | 		"Properties":{
 5 | 			"enableSparkRapids":"true"
 6 | 		}
 7 | 	},
 8 | 	{
 9 | 		"Classification":"yarn-site",
10 | 		"Properties":{
11 | 			"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
12 | 			"yarn.resource-types":"yarn.io/gpu",
13 | 			"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
14 | 			"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
15 | 			"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
16 | 			"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
17 | 			"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
18 | 			"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
19 | 		}
20 | 	},
21 | 	{
22 | 		"Classification":"container-executor",
23 | 		"Properties":{
24 | 
25 | 		},
26 | 		"Configurations":[
27 | 			{
28 | 				"Classification":"gpu",
29 | 				"Properties":{
30 | 					"module.enabled":"true"
31 | 				}
32 | 			},
33 | 			{
34 | 				"Classification":"cgroups",
35 | 				"Properties":{
36 | 					"root":"/sys/fs/cgroup",
37 | 					"yarn-hierarchy":"yarn"
38 | 				}
39 | 			}
40 | 		]
41 | 	},
42 | 	{
43 |         "Classification":"spark-defaults",
44 |         "Properties":{
45 |         "spark.plugins":"com.nvidia.spark.SQLPlugin",
46 |         "spark.sql.sources.useV1SourceList":"",
47 |         "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
48 |         "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar",
49 |         "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
50 |         "spark.rapids.sql.concurrentGpuTasks":"2",
51 |         "spark.executor.resource.gpu.amount":"1",
52 |         "spark.executor.cores":"8",
53 |         "spark.task.cpus ":"1",
54 |         "spark.task.resource.gpu.amount":"0.125",
55 |         "spark.rapids.memory.pinnedPool.size":"2G",
56 |         "spark.executor.memoryOverhead":"2G",
57 |         "spark.locality.wait":"0s",
58 |         "spark.sql.shuffle.partitions":"200",
59 |         "spark.sql.files.maxPartitionBytes":"256m",
60 |         "spark.sql.adaptive.enabled":"false"
61 |         }
62 | 	},
63 | 	{
64 | 		"Classification":"capacity-scheduler",
65 | 		"Properties":{
66 | 			"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
67 | 		}
68 | 	}
69 | ]
70 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML/spark.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name=spark
 4 | #SBATCH --partition=broadwl
 5 | #SBATCH --output=spark.out
 6 | #SBATCH --error=spark.err
 7 | #SBATCH --ntasks=10
 8 | 
 9 | module load python/anaconda-2019.03
10 | module load spark/2.4.3
11 | 
12 | export PYSPARK_DRIVER_PYTHON=/software/Anaconda3-2019.03-el7-x86_64/bin/python
13 | 
14 | spark-submit --master local[*] 7M_PySpark_Midway.py
15 | 


--------------------------------------------------------------------------------
/Labs/Lab 7 Exploring the Larger Spark Ecosystem/bootstrap:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | set -x -e
 3 | 
 4 | echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
 5 | export HADOOP_CONF_DIR=/etc/hadoop/conf
 6 | export SPARK_JARS_DIR=/usr/lib/spark/jars
 7 | export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc
 8 | 
 9 | sudo python3 -m pip install awscli boto spark-nlp
10 | 
11 | set +x
12 | exit 0
13 | 


--------------------------------------------------------------------------------
/Labs/Lab 7 Exploring the Larger Spark Ecosystem/spark_streaming_emr.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "93541f0c",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Querying Streaming Spark DataFrames in an EMR Notebook\n",
  9 |     "\n",
 10 |     "In this notebook, we will read data from a modified version of the Kinesis stream from Lab 5 into a Spark streaming DataFrame. Once we've loaded our streaming DataFrame, we'll perform a simple query on it and write the results of our query to S3 for further analysis.\n",
 11 |     "\n",
 12 |     "We've modified the producer from Lab 5 to send tweet-like JSON data into our `test_stream` Kinesis stream in the form of `{\"username\": ..., \"age\": ..., \"num_followers\": ..., \"tweet\": ...}` (by adding additional test data for `age` and `num_followers`). If you're following along with the code in this notebook, be sure to use a similar producer script to put data into a Kinesis stream:\n",
 13 |     "\n",
 14 |     "```\n",
 15 |     "import boto3\n",
 16 |     "import testdata\n",
 17 |     "import json\n",
 18 |     "\n",
 19 |     "kinesis = boto3.client('kinesis', region_name='us-east-1')\n",
 20 |     "\n",
 21 |     "# Continously write Twitter-like data into Kinesis stream\n",
 22 |     "while 1 == 1:\n",
 23 |     "    test_tweet = {'username': testdata.get_username(),\n",
 24 |     "                  'age': testdata.get_int(18, 100),\n",
 25 |     "                  'num_followers': testdata.get_int(0, 10000),\n",
 26 |     "                  'tweet':    testdata.get_ascii_words(280)\n",
 27 |     "                  }\n",
 28 |     "    kinesis.put_record(StreamName = \"test_stream\",\n",
 29 |     "                       Data = json.dumps(test_tweet),\n",
 30 |     "                       PartitionKey = \"partitionkey\"\n",
 31 |     "                      )\n",
 32 |     "```\n",
 33 |     "\n",
 34 |     "First, let's add the [Spark Structured Streaming package](https://spark.apache.org/docs/2.4.7/structured-streaming-programming-guide.html) to our session configuration (we'll specifically add a version that makes it possible to interact with Kinesis streams):"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "id": "15f174dc",
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "%%configure -f\n",
 45 |     "{ \"conf\": {\"spark.jars.packages\": \"com.qubole.spark/spark-sql-kinesis_2.11/1.1.3-spark_2.4\" }}"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "id": "b7fd35cd",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "Then, we're ready to start reading from our Kinesis stream. For this demonstration, we'll start with the latest data in the stream, but we could get more granular if we would like to do so as well:"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "id": "43beb294",
 60 |    "metadata": {},
 61 |    "outputs": [
 62 |     {
 63 |      "data": {
 64 |       "application/vnd.jupyter.widget-view+json": {
 65 |        "model_id": "4e7d52b9d2c543dfbe8fdd0faaeaee6c",
 66 |        "version_major": 2,
 67 |        "version_minor": 0
 68 |       },
 69 |       "text/plain": [
 70 |        "VBox()"
 71 |       ]
 72 |      },
 73 |      "metadata": {},
 74 |      "output_type": "display_data"
 75 |     },
 76 |     {
 77 |      "name": "stdout",
 78 |      "output_type": "stream",
 79 |      "text": [
 80 |       "Starting Spark application\n"
 81 |      ]
 82 |     },
 83 |     {
 84 |      "data": {
 85 |       "text/html": [
 86 |        "<table>\n",
 87 |        "<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>13</td><td>application_1620656441511_0014</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"http://ip-172-31-20-214.ec2.internal:20888/proxy/application_1620656441511_0014/\" class=\"emr-proxy-link\" emr-resource=\"j-3QJ7YCEW096AW\n",
 88 |        "\" application-id=\"application_1620656441511_0014\">Link</a></td><td><a target=\"_blank\" href=\"http://ip-172-31-20-214.ec2.internal:8042/node/containerlogs/container_1620656441511_0014_01_000001/livy\" >Link</a></td><td>✔</td></tr></table>"
 89 |       ],
 90 |       "text/plain": [
 91 |        "<IPython.core.display.HTML object>"
 92 |       ]
 93 |      },
 94 |      "metadata": {},
 95 |      "output_type": "display_data"
 96 |     },
 97 |     {
 98 |      "data": {
 99 |       "application/vnd.jupyter.widget-view+json": {
100 |        "model_id": "",
101 |        "version_major": 2,
102 |        "version_minor": 0
103 |       },
104 |       "text/plain": [
105 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
106 |       ]
107 |      },
108 |      "metadata": {},
109 |      "output_type": "display_data"
110 |     },
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "SparkSession available as 'spark'.\n"
116 |      ]
117 |     },
118 |     {
119 |      "data": {
120 |       "application/vnd.jupyter.widget-view+json": {
121 |        "model_id": "",
122 |        "version_major": 2,
123 |        "version_minor": 0
124 |       },
125 |       "text/plain": [
126 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
127 |       ]
128 |      },
129 |      "metadata": {},
130 |      "output_type": "display_data"
131 |     },
132 |     {
133 |      "name": "stdout",
134 |      "output_type": "stream",
135 |      "text": [
136 |       "======================\n",
137 |       "DataFrame is streaming"
138 |      ]
139 |     }
140 |    ],
141 |    "source": [
142 |     "from pyspark.sql import SparkSession\n",
143 |     "from pyspark.sql.functions import from_json, col, json_tuple\n",
144 |     "import time\n",
145 |     "\n",
146 |     "stream_df = spark.readStream \\\n",
147 |     "                 .format('kinesis') \\\n",
148 |     "                 .option('streamName', 'test_stream') \\\n",
149 |     "                 .option('endpointUrl', 'https://kinesis.us-east-1.amazonaws.com')\\\n",
150 |     "                 .option('region', 'us-east-1') \\\n",
151 |     "                 .option('startingposition', 'LATEST')\\\n",
152 |     "                 .load()\n",
153 |     "\n",
154 |     "if stream_df.isStreaming:\n",
155 |     "    print('======================')\n",
156 |     "    print('DataFrame is streaming')"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "id": "3dce2a28",
162 |    "metadata": {},
163 |    "source": [
164 |     "Now that we have our streaming DataFrame ready, let's use Spark SQL `select` and `where` methods to query our streaming DataFrame. We'll then write this data out to one of an S3 bucket (you'll need to specify your own and then append it with `/data` and `/checkpoints` directories to follow along). Individual CSVs will be produced for each set of data that is processed in a micro-batch."
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 3,
170 |    "id": "a697c944",
171 |    "metadata": {},
172 |    "outputs": [
173 |     {
174 |      "data": {
175 |       "application/vnd.jupyter.widget-view+json": {
176 |        "model_id": "80acf397249c4d71924d15b417143965",
177 |        "version_major": 2,
178 |        "version_minor": 0
179 |       },
180 |       "text/plain": [
181 |        "VBox()"
182 |       ]
183 |      },
184 |      "metadata": {},
185 |      "output_type": "display_data"
186 |     },
187 |     {
188 |      "data": {
189 |       "application/vnd.jupyter.widget-view+json": {
190 |        "model_id": "",
191 |        "version_major": 2,
192 |        "version_minor": 0
193 |       },
194 |       "text/plain": [
195 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
196 |       ]
197 |      },
198 |      "metadata": {},
199 |      "output_type": "display_data"
200 |     }
201 |    ],
202 |    "source": [
203 |     "# start process of querying streaming data\n",
204 |     "query = stream_df.selectExpr('CAST(data AS STRING)', 'CAST(approximateArrivalTimestamp as TIMESTAMP)') \\\n",
205 |     "    .select('approximateArrivalTimestamp', \n",
206 |     "            json_tuple(col('data'), 'username', 'age', 'num_followers', 'tweet'\n",
207 |     "                      ).alias('username', 'age', 'num_followers', 'tweet')) \\\n",
208 |     "    .select('approximateArrivalTimestamp', 'username', 'age') \\\n",
209 |     "    .where('age > 35') \\\n",
210 |     "    .writeStream \\\n",
211 |     "    .queryName('counts') \\\n",
212 |     "    .outputMode('append') \\\n",
213 |     "    .format('csv') \\\n",
214 |     "    .option('path', 's3://mrjob-9caa69460249cdb9/data') \\\n",
215 |     "    .option('checkpointLocation','s3://mrjob-9caa69460249cdb9/checkpoints') \\\n",
216 |     "    .start()\n",
217 |     "\n",
218 |     "# let streaming query run for 15 seconds (and continue sending results to CSV in S3), then stop it\n",
219 |     "time.sleep(15)\n",
220 |     "\n",
221 |     "# Stop query; look at results of micro-batch queries in S3 bucket in `/data` directory\n",
222 |     "query.stop()"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "id": "dc9c0e66",
228 |    "metadata": {},
229 |    "source": [
230 |     "Cool! If we take a look at one of our resulting CSVs over in our S3 bucket (see head below), we can see that it produces the expected results (a selection of columns from the streaming data that is filtered by age). This is a great way to quickly process streaming data!\n",
231 |     "\n",
232 |     "```\n",
233 |     "2021-05-10T22:03:40.787Z,Hailiejade,83\n",
234 |     "2021-05-10T22:03:40.824Z,Fischer,79\n",
235 |     "2021-05-10T22:03:40.866Z,Leonard,46\n",
236 |     "2021-05-10T22:03:40.902Z,Vasquez,65\n",
237 |     "2021-05-10T22:03:40.937Z,Porter,86\n",
238 |     "2021-05-10T22:03:40.978Z,Joan,56\n",
239 |     "```"
240 |    ]
241 |   }
242 |  ],
243 |  "metadata": {
244 |   "kernelspec": {
245 |    "display_name": "PySpark",
246 |    "language": "",
247 |    "name": "pysparkkernel"
248 |   },
249 |   "language_info": {
250 |    "codemirror_mode": {
251 |     "name": "python",
252 |     "version": 3
253 |    },
254 |    "mimetype": "text/x-python",
255 |    "name": "pyspark",
256 |    "pygments_lexer": "python3"
257 |   }
258 |  },
259 |  "nbformat": 4,
260 |  "nbformat_minor": 5
261 | }
262 | 


--------------------------------------------------------------------------------
/Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/Dask on an EMR Cluster.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Dask on an AWS EMR Cluster\n",
  8 |     "\n",
  9 |     "This notebook is intended to be run on an AWS EMR cluster, configured using the steps listed in dask_bootstrap_workflow.md tutorial in this lab directory. The EMR cluster used in this tutorial has two worker m5.xlarge instances within it, each of which has 4 virtual CPU cores and 16 GB of memory (you're welcome to scale your cluster beyond this, though!). If you would like to learn more about working with Dask on EMR clusters, [check out the dask-yarn documentation](https://yarn.dask.org/en/latest/aws-emr.html)."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 7,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from dask_yarn import YarnCluster\n",
 19 |     "from dask.distributed import Client"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 15,
 25 |    "metadata": {},
 26 |    "outputs": [
 27 |     {
 28 |      "name": "stderr",
 29 |      "output_type": "stream",
 30 |      "text": [
 31 |       "distributed.scheduler - INFO - Clear task state\n",
 32 |       "distributed.scheduler - INFO -   Scheduler at: tcp://172.31.21.225:33211\n",
 33 |       "distributed.scheduler - INFO -   dashboard at:                    :45301\n",
 34 |       "distributed.scheduler - INFO - Receive client connection: Client-d2dd3a2c-2534-11eb-b6af-632ed99d0685\n",
 35 |       "distributed.core - INFO - Starting established connection\n",
 36 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.21.182:38719', name: dask.worker_3, memory: 0, processing: 0>\n",
 37 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.21.182:38719\n",
 38 |       "distributed.core - INFO - Starting established connection\n",
 39 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.25.119:40597', name: dask.worker_6, memory: 0, processing: 0>\n",
 40 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.25.119:40597\n",
 41 |       "distributed.core - INFO - Starting established connection\n",
 42 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.25.119:37545', name: dask.worker_2, memory: 0, processing: 0>\n",
 43 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.25.119:37545\n",
 44 |       "distributed.core - INFO - Starting established connection\n",
 45 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.20.204:44979', name: dask.worker_1, memory: 0, processing: 0>\n",
 46 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.20.204:44979\n",
 47 |       "distributed.core - INFO - Starting established connection\n",
 48 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.20.204:45273', name: dask.worker_5, memory: 0, processing: 0>\n",
 49 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.20.204:45273\n",
 50 |       "distributed.core - INFO - Starting established connection\n",
 51 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.27.243:40497', name: dask.worker_0, memory: 0, processing: 0>\n",
 52 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.27.243:40497\n",
 53 |       "distributed.core - INFO - Starting established connection\n",
 54 |       "distributed.scheduler - INFO - Register worker <Worker 'tcp://172.31.27.243:43533', name: dask.worker_4, memory: 0, processing: 0>\n",
 55 |       "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.27.243:43533\n",
 56 |       "distributed.core - INFO - Starting established connection\n"
 57 |      ]
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "# Create a cluster where each worker has 1 cores and 4 GiB of memory:\n",
 62 |     "cluster = YarnCluster(environment=\"/home/hadoop/environment.tar.gz\",\n",
 63 |     "                      worker_vcores = 1,\n",
 64 |     "                      worker_memory = \"4GiB\"\n",
 65 |     "                      )\n",
 66 |     "\n",
 67 |     "# Scale cluster out to 8 such workers:\n",
 68 |     "cluster.scale(8)\n",
 69 |     "\n",
 70 |     "# Connect to the cluster (before proceeding, you should wait for workers to be registered by the dask scheduler, as below):\n",
 71 |     "client = Client(cluster)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "Once everyone is registered, we can see our workers, virtual cores, and the amount of memory that our cluster is using overall; we can adjust all of this as need be if it doesn't match our hardware well. You click the link below to show your task graphs and execution status of your code as you run it."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 18,
 84 |    "metadata": {},
 85 |    "outputs": [
 86 |     {
 87 |      "data": {
 88 |       "text/html": [
 89 |        "<table style=\"border: 2px solid white;\">\n",
 90 |        "<tr>\n",
 91 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 92 |        "<h3 style=\"text-align: left;\">Client</h3>\n",
 93 |        "<ul style=\"text-align: left; list-style: none; margin: 0; padding: 0;\">\n",
 94 |        "  <li><b>Scheduler: </b>tcp://172.31.21.225:33211</li>\n",
 95 |        "  <li><b>Dashboard: </b><a href='/proxy/45301/status' target='_blank'>/proxy/45301/status</a></li>\n",
 96 |        "</ul>\n",
 97 |        "</td>\n",
 98 |        "<td style=\"vertical-align: top; border: 0px solid white\">\n",
 99 |        "<h3 style=\"text-align: left;\">Cluster</h3>\n",
100 |        "<ul style=\"text-align: left; list-style:none; margin: 0; padding: 0;\">\n",
101 |        "  <li><b>Workers: </b>7</li>\n",
102 |        "  <li><b>Cores: </b>7</li>\n",
103 |        "  <li><b>Memory: </b>30.06 GB</li>\n",
104 |        "</ul>\n",
105 |        "</td>\n",
106 |        "</tr>\n",
107 |        "</table>"
108 |       ],
109 |       "text/plain": [
110 |        "<Client: 'tcp://172.31.21.225:33211' processes=7 threads=7, memory=30.06 GB>"
111 |       ]
112 |      },
113 |      "execution_count": 18,
114 |      "metadata": {},
115 |      "output_type": "execute_result"
116 |     }
117 |    ],
118 |    "source": [
119 |     "client"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "To start, let's do some simple Dask array operations to demonstrate how arrays and array operations are split up into equal chunks across our 8 workers:"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 19,
132 |    "metadata": {},
133 |    "outputs": [
134 |     {
135 |      "data": {
136 |       "text/html": [
137 |        "<table>\n",
138 |        "<tr>\n",
139 |        "<td>\n",
140 |        "<table>\n",
141 |        "  <thead>\n",
142 |        "    <tr><td> </td><th> Array </th><th> Chunk </th></tr>\n",
143 |        "  </thead>\n",
144 |        "  <tbody>\n",
145 |        "    <tr><th> Bytes </th><td> 800 B </td> <td> 112 B </td></tr>\n",
146 |        "    <tr><th> Shape </th><td> (100,) </td> <td> (14,) </td></tr>\n",
147 |        "    <tr><th> Count </th><td> 8 Tasks </td><td> 8 Chunks </td></tr>\n",
148 |        "    <tr><th> Type </th><td> float64 </td><td> numpy.ndarray </td></tr>\n",
149 |        "  </tbody>\n",
150 |        "</table>\n",
151 |        "</td>\n",
152 |        "<td>\n",
153 |        "<svg width=\"170\" height=\"75\" style=\"stroke:rgb(0,0,0);stroke-width:1\" >\n",
154 |        "\n",
155 |        "  <!-- Horizontal lines -->\n",
156 |        "  <line x1=\"0\" y1=\"0\" x2=\"120\" y2=\"0\" style=\"stroke-width:2\" />\n",
157 |        "  <line x1=\"0\" y1=\"25\" x2=\"120\" y2=\"25\" style=\"stroke-width:2\" />\n",
158 |        "\n",
159 |        "  <!-- Vertical lines -->\n",
160 |        "  <line x1=\"0\" y1=\"0\" x2=\"0\" y2=\"25\" style=\"stroke-width:2\" />\n",
161 |        "  <line x1=\"16\" y1=\"0\" x2=\"16\" y2=\"25\" />\n",
162 |        "  <line x1=\"33\" y1=\"0\" x2=\"33\" y2=\"25\" />\n",
163 |        "  <line x1=\"50\" y1=\"0\" x2=\"50\" y2=\"25\" />\n",
164 |        "  <line x1=\"67\" y1=\"0\" x2=\"67\" y2=\"25\" />\n",
165 |        "  <line x1=\"84\" y1=\"0\" x2=\"84\" y2=\"25\" />\n",
166 |        "  <line x1=\"100\" y1=\"0\" x2=\"100\" y2=\"25\" />\n",
167 |        "  <line x1=\"117\" y1=\"0\" x2=\"117\" y2=\"25\" />\n",
168 |        "  <line x1=\"120\" y1=\"0\" x2=\"120\" y2=\"25\" style=\"stroke-width:2\" />\n",
169 |        "\n",
170 |        "  <!-- Colored Rectangle -->\n",
171 |        "  <polygon points=\"0.0,0.0 120.0,0.0 120.0,25.412616514582485 0.0,25.412616514582485\" style=\"fill:#ECB172A0;stroke-width:0\"/>\n",
172 |        "\n",
173 |        "  <!-- Text -->\n",
174 |        "  <text x=\"60.000000\" y=\"45.412617\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" >100</text>\n",
175 |        "  <text x=\"140.000000\" y=\"12.706308\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" transform=\"rotate(0,140.000000,12.706308)\">1</text>\n",
176 |        "</svg>\n",
177 |        "</td>\n",
178 |        "</tr>\n",
179 |        "</table>"
180 |       ],
181 |       "text/plain": [
182 |        "dask.array<ones, shape=(100,), dtype=float64, chunksize=(14,), chunktype=numpy.ndarray>"
183 |       ]
184 |      },
185 |      "execution_count": 19,
186 |      "metadata": {},
187 |      "output_type": "execute_result"
188 |     }
189 |    ],
190 |    "source": [
191 |     "import dask.array as da\n",
192 |     "\n",
193 |     "n = len(client.scheduler_info()['workers'])\n",
194 |     "a = da.ones(100, chunks=int(100/n))\n",
195 |     "a"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 22,
201 |    "metadata": {
202 |     "scrolled": true
203 |    },
204 |    "outputs": [
205 |     {
206 |      "data": {
207 |       "text/plain": [
208 |        "100.0"
209 |       ]
210 |      },
211 |      "execution_count": 22,
212 |      "metadata": {},
213 |      "output_type": "execute_result"
214 |     }
215 |    ],
216 |    "source": [
217 |     "a.sum().compute()"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 23,
223 |    "metadata": {},
224 |    "outputs": [
225 |     {
226 |      "data": {
227 |       "text/plain": [
228 |        "1.0000125081975038"
229 |       ]
230 |      },
231 |      "execution_count": 23,
232 |      "metadata": {},
233 |      "output_type": "execute_result"
234 |     }
235 |    ],
236 |    "source": [
237 |     "x = da.random.random((10000, 10000), chunks=(1000, 1000))\n",
238 |     "y = x + x.T\n",
239 |     "y.mean().compute()"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "We can also interact with large data sources in S3 via Dask DataFrames, using a lot of the familiar methods we employ in smaller scale applications in Pandas. Here, we read in 10GB of Amazon's customer book data and perform a few simple operations. If you take a look at the Dask task graph while this is running, you can see that our groupby and sum operations are being performed in parallel by our workers."
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": 24,
252 |    "metadata": {},
253 |    "outputs": [],
254 |    "source": [
255 |     "import dask.dataframe as dd\n",
256 |     "\n",
257 |     "df = dd.read_parquet(\"s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet\",\n",
258 |     "                     storage_options={'anon': True, 'use_ssl': False},\n",
259 |     "                     engine='fastparquet')"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 25,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "data": {
269 |       "text/html": [
270 |        "<div>\n",
271 |        "<style scoped>\n",
272 |        "    .dataframe tbody tr th:only-of-type {\n",
273 |        "        vertical-align: middle;\n",
274 |        "    }\n",
275 |        "\n",
276 |        "    .dataframe tbody tr th {\n",
277 |        "        vertical-align: top;\n",
278 |        "    }\n",
279 |        "\n",
280 |        "    .dataframe thead th {\n",
281 |        "        text-align: right;\n",
282 |        "    }\n",
283 |        "</style>\n",
284 |        "<table border=\"1\" class=\"dataframe\">\n",
285 |        "  <thead>\n",
286 |        "    <tr style=\"text-align: right;\">\n",
287 |        "      <th></th>\n",
288 |        "      <th>helpful_votes</th>\n",
289 |        "    </tr>\n",
290 |        "    <tr>\n",
291 |        "      <th>star_rating</th>\n",
292 |        "      <th></th>\n",
293 |        "    </tr>\n",
294 |        "  </thead>\n",
295 |        "  <tbody>\n",
296 |        "    <tr>\n",
297 |        "      <th>1</th>\n",
298 |        "      <td>10985502</td>\n",
299 |        "    </tr>\n",
300 |        "    <tr>\n",
301 |        "      <th>2</th>\n",
302 |        "      <td>5581929</td>\n",
303 |        "    </tr>\n",
304 |        "    <tr>\n",
305 |        "      <th>3</th>\n",
306 |        "      <td>7021927</td>\n",
307 |        "    </tr>\n",
308 |        "    <tr>\n",
309 |        "      <th>4</th>\n",
310 |        "      <td>11100563</td>\n",
311 |        "    </tr>\n",
312 |        "    <tr>\n",
313 |        "      <th>5</th>\n",
314 |        "      <td>44825468</td>\n",
315 |        "    </tr>\n",
316 |        "  </tbody>\n",
317 |        "</table>\n",
318 |        "</div>"
319 |       ],
320 |       "text/plain": [
321 |        "             helpful_votes\n",
322 |        "star_rating               \n",
323 |        "1                 10985502\n",
324 |        "2                  5581929\n",
325 |        "3                  7021927\n",
326 |        "4                 11100563\n",
327 |        "5                 44825468"
328 |       ]
329 |      },
330 |      "execution_count": 25,
331 |      "metadata": {},
332 |      "output_type": "execute_result"
333 |     }
334 |    ],
335 |    "source": [
336 |     "helpful_by_star = (df[['star_rating', 'helpful_votes']].groupby('star_rating')\n",
337 |     "                                                       .sum())\n",
338 |     "helpful_df = helpful_by_star.compute() # returns Pandas DataFrame\n",
339 |     "helpful_df"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "We can also plot and explore our data using standard Matplotlib plotting functionality:"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": 26,
352 |    "metadata": {},
353 |    "outputs": [
354 |     {
355 |      "data": {
356 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEPCAYAAABr4Y4KAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAASZklEQVR4nO3df7CVdZ3A8fcnYMUWAkap1XC75KyWImLc0CwVtIANxnQmAgfT0mT9sSpt1uS4s+qUrbbMtqW5ruMqZo6Y2M5q1jZNhT9W0+6VK0loZsuut0wvKiCCxI/P/nEOcMWr94D3nPPl3vdrhuHec55zzmeeC28envM950RmIkkq19uaPYAk6c0ZakkqnKGWpMIZakkqnKGWpMIZakkqXN1CHRE3RsTzEfF4Ddt+IyI6qr9+ExGr6zWXJO1pol7rqCPiWGAd8J3MHLcLtzsfOCIzz6jLYJK0h6nbEXVm3ge82P2yiDgwIv4rItoj4v6IeF8PNz0FuK1ec0nSnmZwgx/veuDszHwqIo4ErgWO33ZlRLwHGAv8rMFzSVKxGhbqiBgGHA3cERHbLt5rp83mAIszc0uj5pKk0jXyiPptwOrMnPAm28wBzmvMOJK0Z2jY8rzMXAv8T0TMAoiKw7ddHxEHA6OAhxo1kyTtCeq5PO82KtE9OCI6I+JMYC5wZkQ8BiwHPtHtJqcAi9K385Ok16jb8jxJUt/wlYmSVDhDLUmFq8uqj3333TdbWlrqcdeS1C+1t7evyszRPV1Xl1C3tLTQ1tZWj7uWpH4pIv73ja7z1IckFc5QS1LhDLUkFa5hLyHftGkTnZ2dvPrqq416SPVi6NChjBkzhiFDhjR7FElvomGh7uzsZPjw4bS0tNDtTZnUJJnJCy+8QGdnJ2PHjm32OJLeRMNOfbz66qvss88+RroQEcE+++zj/3CkPUBDz1Eb6bL485D2DD6ZKEmFa/QnvGzX8uV7+vT+Vl45482vX7mSmTNn8vjjvX7WLgCXXXYZw4YN46KLLnrDbTZu3MiMGTNYtWoVF198MbNnz+5xu4ULF9LW1sY111xT02P3ZuHChUydOpX999+/T+5P2pP1dUt2V28NeiuaFur+YOnSpWzatImOjo6GPu7ChQsZN26coZYGiAF16mPLli2cddZZHHrooUydOpUNGzbw9NNPM336dCZOnMgxxxzDE0888brbTZ48mfnz53P00Uczbtw4HnnkEZ5//nlOPfVUOjo6mDBhAk8//TQtLS2sWrUKgLa2NiZPntzrTGvWrKGlpYWtW7cCsH79eg444IDt/wAcddRRjB8/npNPPpmXXnqJxYsX09bWxty5c5kwYQIbNmygvb2d4447jokTJzJt2jSeffZZAL71rW9xyCGHMH78eObMmdN3O1JSQw2oUD/11FOcd955LF++nJEjR3LnnXcyb948rr76atrb21mwYAHnnntuj7d95ZVXePDBB7n22ms544wzeOc738kNN9zAMcccQ0dHBwceeOBuzTRixAgOP/xw7r33XgDuvvtupk2bxpAhQzjttNO46qqrWLZsGYcddhiXX345n/zkJ2ltbeXWW2+lo6ODwYMHc/7557N48WLa29s544wzuOSSSwC48sorWbp0KcuWLeO6667bvZ0mqekG1KmPsWPHMmHCBAAmTpzIypUrefDBB5k1a9b2bTZu3NjjbU855RQAjj32WNauXcvq1av7bK7Zs2dz++23M2XKFBYtWsS5557LmjVrWL16NccddxwAp59++mvm3ObJJ5/k8ccf52Mf+xhQ+V/DfvvtB8D48eOZO3cuJ510EieddFKfzSupsQZUqPfaa8eHng8aNIjnnnuOkSNH1nSOeeelbD0tbRs8ePD2Uxi7sj75xBNP5OKLL+bFF1+kvb2d448/nnXr1tV028zk0EMP5aGHXv9Rk/fccw/33Xcfd911F1/5yldYvnw5gwcPqB+51C8MqFMfO3vHO97B2LFjueOOO4BK9B577LEet7399tsBeOCBBxgxYgQjRox43TYtLS20t7cDcOedd9Y8x7Bhw5g0aRIXXnghM2fOZNCgQYwYMYJRo0Zx//33A3DLLbdsP7oePnw4L7/8MgAHH3wwXV1d20O9adMmli9fztatW3nmmWeYMmUKX//611m9enXN8ZdUlqYdXtVzKcuuuPXWWznnnHP46le/yqZNm5gzZw6HH37467YbNWoURx99NGvXruXGG2/s8b4uvfRSzjzzTL72ta9x5JFH7tIcs2fPZtasWSxZsmT7ZTfffDNnn30269ev573vfS833XQTAJ/5zGc4++yz2XvvvXnooYdYvHgxF1xwAWvWrGHz5s3Mnz+fgw46iFNPPZU1a9aQmXz+859n5MiRuzSTpDLU5cNtW1tbc+cPDlixYgXvf//7+/yxGmHy5MksWLCA1tbWZo/S5/bkn4sE/WcddUS0Z2aPkRnQpz4kaU/gM0s16H464q244oortp8P32bWrFnbl9NJUk8MdQNdcsklRlnSLmvoqY96nA/X7vPnIe0ZGhbqoUOH8sILLxiHQmz74IChQ4c2exRJvWjYqY8xY8bQ2dlJV1dXox5Svdj2UVySytawUA8ZMsSPfJKk3eDyPEkqnKGWpMIZakkqXM2hjohBEbE0In5Qz4EkSa+1K0fUFwIr6jWIJKlnNYU6IsYAM4Ab6juOJGlntR5R/wvwJWBr/UaRJPWk11BHxEzg+cxs72W7eRHRFhFtvqhFkvpOLUfUHwZOjIiVwCLg+Ij47s4bZeb1mdmama2jR4/u4zElaeDqNdSZeXFmjsnMFmAO8LPMPLXuk0mSANdRS1Lxdum9PjJzCbCkLpNIknrkEbUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFa7XUEfE0Ih4JCIei4jlEXF5IwaTJFUMrmGbjcDxmbkuIoYAD0TEjzLzF3WeTZJEDaHOzATWVb8dUv2V9RxKkrRDTeeoI2JQRHQAzwM/ycyH6zqVJGm7mkKdmVsycwIwBpgUEeN23iYi5kVEW0S0dXV19fGYkjRw7dKqj8xcDSwBpvdw3fWZ2ZqZraNHj+6b6SRJNa36GB0RI6tf7w18FHiiznNJkqpqWfWxH3BzRAyiEvbvZeYP6juWJGmbWlZ9LAOOaMAskqQe+MpESSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSpcr6GOiAMi4ucRsSIilkfEhY0YTJJUMbiGbTYDX8jMRyNiONAeET/JzF/XeTZJEjUcUWfms5n5aPXrl4EVwLvrPZgkqWKXzlFHRAtwBPBwXaaRJL1OzaGOiGHAncD8zFzbw/XzIqItItq6urr6ckZJGtBqCnVEDKES6Vsz8/s9bZOZ12dma2a2jh49ui9nlKQBrZZVHwH8O7AiM/+5/iNJkrqr5Yj6w8CngeMjoqP66+N1nkuSVNXr8rzMfACIBswiSeqBr0yUpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkq3OBmDyBp17V8+Z5mjwDAyitnNHuEAaHXUEfEjcBM4PnMHFf/kfxDKEnd1XLqYyEwvc5zSJLeQK+hzsz7gBcbMIskqQc+mShJheuzUEfEvIhoi4i2rq6uvrpbSRrw+izUmXl9ZrZmZuvo0aP76m4lacDz1IckFa7XUEfEbcBDwMER0RkRZ9Z/LEnSNr2uo87MUxoxiCSpZ576kKTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpyhlqTC+VFc2mP4yT8aqDyilqTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpzL8wrnkjRJHlFLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuFqCnVETI+IJyPitxHx5XoPJUnaoddQR8Qg4NvAXwOHAKdExCH1HkySVFHLEfUk4LeZ+bvM/BOwCPhEfceSJG1TS6jfDTzT7fvO6mWSpAaIzHzzDSJmAdMy83PV7z8NTMrM83fabh4wr/rtwcCTfT/uLtkXWNXkGUrhvtjBfbGD+2KHEvbFezJzdE9XDK7hxp3AAd2+HwP8YeeNMvN64PrdGq8OIqItM1ubPUcJ3Bc7uC92cF/sUPq+qOXUxy+Bv4qIsRHxZ8Ac4K76jiVJ2qbXI+rM3BwRfwv8GBgE3JiZy+s+mSQJqO3UB5n5Q+CHdZ6lrxVzGqYA7osd3Bc7uC92KHpf9PpkoiSpuXwJuSQVzlBLUuEMdT8UEe+LiBMiYthOl09v1kzNEhGTIuKD1a8PiYi/i4iPN3uuZouI7zR7hlJExEeqfy6mNnuWN9Lvz1FHxGcz86Zmz9EoEXEBcB6wApgAXJiZ/1m97tHM/EATx2uoiLiUynvUDAZ+AhwJLAE+Cvw4M69o3nSNExE7L6cNYArwM4DMPLHhQzVRRDySmZOqX59F5e/LfwBTgbsz88pmzteTgRDq/8vMv2z2HI0SEb8CPpSZ6yKiBVgM3JKZ34yIpZl5RHMnbJzqvpgA7AX8ERiTmWsjYm/g4cwc38z5GiUiHgV+DdwAJJVQ30blNRFk5r3Nm67xuv89iIhfAh/PzK6I+HPgF5l5WHMnfL2alueVLiKWvdFVwLsaOUsBBmXmOoDMXBkRk4HFEfEeKvtjINmcmVuA9RHxdGauBcjMDRGxtcmzNVIrcCFwCfDFzOyIiA0DLdDdvC0iRlE59RuZ2QWQma9ExObmjtazfhFqKjGeBry00+UBPNj4cZrqjxExITM7AKpH1jOBG4HijhTq7E8R8fbMXA9M3HZhRIwABkyoM3Mr8I2IuKP6+3P0n7/7u2ME0E6lDxkRf5GZf6w+p1PkwUx/+WH9ABi2LU7dRcSShk/TXKcBrzkqyMzNwGkR8W/NGalpjs3MjbA9VtsMAU5vzkjNk5mdwKyImAGsbfY8zZKZLW9w1Vbg5AaOUrN+f45akvZ0Ls+TpMIZakkqnKGWpMIZahUrIuZHxNsb/JiTI+Lobt+fHRGnNXIGaWc+mahiRcRKoDUza/6IpIgYVF07/WbbDK6uhOnpusuAdZm5YFdmlerJUKsI1VeFfY/KR70NAu6g8gKNJ4FVmTklIv4V+CCwN7A4My+t3nYllXXiU4FrMnNRD/e/hMqa+g9T+YSi3wB/D/wZ8AIwt3q/vwC2AF3A+cAJVMNdvY+Hqbz8eiRwZmbeXz3qXwi8j8pL91uA8zKzrY92jwa4/rKOWnu+6cAfMnMGbH9RymeBKd2OqC/JzBcjYhDw04gYn5nbXpX6amZ+pJfHGJmZx1XvfxRwVGZmRHwO+FJmfiEirqPbEXVEnLDTfQzOzEnVN3a6lMr7hpwLvJSZ4yNiHNDxFvaD9Dqeo1YpfgV8NCKuiohjMnNND9t8qvq+FUuBQ4FDul13ew2P0X2bMcCPq+8H8sXq/dXi+9Xf26kcOQN8BFgEkJmPA2/0lgbSbjHUKkJm/obKy7x/BfxjRPxD9+sjYixwEXBC9c2U7gGGdtvklRoepvs2V1M5TXIY8Dc73deb2Vj9fQs7/kda5MuO1X8YahUhIvYH1mfmd4EFwAeAl4Hh1U3eQSW0ayLiXVTevvStGAH8vvp195eTd3/MWj0AfAoq73nNwHtPFdWZ56hVisOAf6q+q90m4BzgQ8CPIuLZ6pOJS4HlwO+A/36Lj3cZcEdE/J7KE4hjq5ffTeXdBj9B5cnEWlwL3Fx9F8elVE599HTqRtotrvqQ3qLqk5tDMvPViDgQ+ClwUGb+qcmjqZ/wiFp6694O/DwihlA5X32OkVZf8oha/UpEfJvKWunuvjmQPo5N/Y+hlqTCuepDkgpnqCWpcIZakgpnqCWpcIZakgr3/6h7+/dH/MSEAAAAAElFTkSuQmCC\n",
357 |       "text/plain": [
358 |        "<Figure size 432x288 with 1 Axes>"
359 |       ]
360 |      },
361 |      "metadata": {
362 |       "needs_background": "light"
363 |      },
364 |      "output_type": "display_data"
365 |     }
366 |    ],
367 |    "source": [
368 |     "%matplotlib inline\n",
369 |     "import matplotlib.pyplot as plt\n",
370 |     "\n",
371 |     "helpful_df.plot(kind=\"bar\");"
372 |    ]
373 |   }
374 |  ],
375 |  "metadata": {
376 |   "kernelspec": {
377 |    "display_name": "Python 3",
378 |    "language": "python",
379 |    "name": "python3"
380 |   },
381 |   "language_info": {
382 |    "codemirror_mode": {
383 |     "name": "ipython",
384 |     "version": 3
385 |    },
386 |    "file_extension": ".py",
387 |    "mimetype": "text/x-python",
388 |    "name": "python",
389 |    "nbconvert_exporter": "python",
390 |    "pygments_lexer": "ipython3",
391 |    "version": "3.7.6"
392 |   }
393 |  },
394 |  "nbformat": 4,
395 |  "nbformat_minor": 4
396 | }
397 | 


--------------------------------------------------------------------------------
/Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/bootstrap-dask:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | HELP="Usage: bootstrap-dask [OPTIONS]
  3 | 
  4 | Adapted from Example AWS EMR Bootstrap Action to install and configure Dask and
  5 | Jupyter:
  6 | https://github.com/dask/dask-yarn/blob/master/deployment_resources/aws-emr/bootstrap-dask
  7 | 
  8 | By default it does the following things:
  9 | - Installs miniconda
 10 | - Installs dask, distributed, dask-yarn, pyarrow, and s3fs. This list can be
 11 |   extended using the --conda-packages flag below.
 12 | - Packages this environment for distribution to the workers.
 13 | - Installs and starts a jupyter notebook server running on port 8888. This can
 14 |   be disabled with the --no-jupyter flag below.
 15 | 
 16 | Options:
 17 |     --jupyter / --no-jupyter    Whether to also install and start a Jupyter
 18 |                                 Notebook Server. Default is True.
 19 |     --password, -pw             Set the password for the Jupyter Notebook
 20 |                                 Server. Default is 'dask-user'.
 21 |     --conda-packages            Extra packages to install from conda.
 22 | "
 23 | 
 24 | set -e
 25 | 
 26 | # Parse Inputs. This is specific to this script, and can be ignored
 27 | # -----------------------------------------------------------------
 28 | JUPYTER_PASSWORD="dask-user"
 29 | EXTRA_CONDA_PACKAGES=""
 30 | JUPYTER="true"
 31 | 
 32 | while [[ $# -gt 0 ]]; do
 33 |     case $1 in
 34 |         -h|--help)
 35 |             echo "$HELP"
 36 |             exit 0
 37 |             ;;
 38 |         --no-jupyter)
 39 |             JUPYTER="false"
 40 |             shift
 41 |             ;;
 42 |         --jupyter)
 43 |             JUPYTER="true"
 44 |             shift
 45 |             ;;
 46 |         -pw|--password)
 47 |             JUPYTER_PASSWORD="$2"
 48 |             shift
 49 |             shift
 50 |             ;;
 51 |         --conda-packages)
 52 |             shift
 53 |             PACKAGES=()
 54 |             while [[ $# -gt 0 ]]; do
 55 |                 case $1 in
 56 |                     -*)
 57 |                         break
 58 |                         ;;
 59 |                     *)
 60 |                         PACKAGES+=($1)
 61 |                         shift
 62 |                         ;;
 63 |                 esac
 64 |             done
 65 |             EXTRA_CONDA_PACKAGES="${PACKAGES[@]}"
 66 |             ;;
 67 |         *)
 68 |             echo "error: unrecognized argument: $1"
 69 |             exit 2
 70 |             ;;
 71 |     esac
 72 | done
 73 | 
 74 | 
 75 | # -----------------------------------------------------------------------------
 76 | # 1. Check if running on the master node. If not, there's nothing do.
 77 | # -----------------------------------------------------------------------------
 78 | grep -q '"isMaster": true' /mnt/var/lib/info/instance.json \
 79 | || { echo "Not running on master node, nothing to do" && exit 0; }
 80 | 
 81 | 
 82 | # -----------------------------------------------------------------------------
 83 | # 2. Install Miniconda
 84 | # -----------------------------------------------------------------------------
 85 | echo "Installing Miniconda"
 86 | curl https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh -o /tmp/miniconda.sh
 87 | bash /tmp/miniconda.sh -b -p $HOME/miniconda
 88 | rm /tmp/miniconda.sh
 89 | echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc
 90 | source $HOME/.bashrc
 91 | conda update conda -y
 92 | 
 93 | 
 94 | # -----------------------------------------------------------------------------
 95 | # 3. Install packages to use in packaged environment
 96 | #
 97 | # We install a few packages by default, and allow users to extend this list
 98 | # with a CLI flag:
 99 | #
100 | # - dask-yarn, for deploying Dask on YARN.
101 | # - pyarrow for working with hdfs, parquet, ORC, etc...
102 | # - s3fs for access to s3 (Dask DF requires version 0.4.0 or higher)
103 | # - conda-pack for packaging the environment for distribution
104 | # -----------------------------------------------------------------------------
105 | echo "Installing base packages"
106 | conda install \
107 | -c conda-forge \
108 | -y \
109 | -q \
110 | dask-yarn=0.9.0 \
111 | pyarrow=4.0.0 \
112 | s3fs=0.4.0 \
113 | conda-pack=0.6.0 \
114 | tornado=6.1 \
115 | $EXTRA_CONDA_PACKAGES
116 | 
117 | # -----------------------------------------------------------------------------
118 | # 4. Package the environment to be distributed to worker nodes
119 | # -----------------------------------------------------------------------------
120 | echo "Packaging environment"
121 | conda pack -q -o $HOME/environment.tar.gz
122 | 
123 | 
124 | # -----------------------------------------------------------------------------
125 | # 5. List all packages in the worker environment
126 | # -----------------------------------------------------------------------------
127 | echo "Packages installed in the worker environment:"
128 | conda list
129 | 
130 | 
131 | # -----------------------------------------------------------------------------
132 | # 6. Configure Dask
133 | #
134 | # This isn't necessary, but for this particular bootstrap script it will make a
135 | # few things easier:
136 | #
137 | # - Configure the cluster's dashboard link to show the proxied version through
138 | #   jupyter-server-proxy. This allows access to the dashboard with only an ssh
139 | #   tunnel to the notebook.
140 | #
141 | # - Specify the pre-packaged python environment, so users don't have to
142 | #
143 | # - Set the default deploy-mode to local, so the dashboard proxying works
144 | #
145 | # - Specify the location of the native libhdfs library so pyarrow can find it
146 | #   on the workers and the client (if submitting applications).
147 | # ------------------------------------------------------------------------------
148 | echo "Configuring Dask"
149 | mkdir -p $HOME/.config/dask
150 | cat <<EOT >> $HOME/.config/dask/config.yaml
151 | distributed:
152 |   dashboard:
153 |     link: "/proxy/{port}/status"
154 | 
155 | yarn:
156 |   environment: /home/hadoop/environment.tar.gz
157 |   deploy-mode: local
158 | 
159 |   worker:
160 |     env:
161 |       ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
162 | 
163 |   client:
164 |     env:
165 |       ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/
166 | EOT
167 | # Also set ARROW_LIBHDFS_DIR in ~/.bashrc so it's set for the local user
168 | echo -e '\nexport ARROW_LIBHDFS_DIR=/usr/lib/hadoop/lib/native' >> $HOME/.bashrc
169 | 
170 | 
171 | # -----------------------------------------------------------------------------
172 | # 7. If Jupyter isn't requested, we're done
173 | # -----------------------------------------------------------------------------
174 | if [[ "$JUPYTER" == "false" ]]; then
175 |     exit 0
176 | fi
177 | 
178 | 
179 | # -----------------------------------------------------------------------------
180 | # 8. Install jupyter notebook server and dependencies
181 | #
182 | # We do this after packaging the worker environments to keep the tar.gz as
183 | # small as possible.
184 | #
185 | # We install the following packages:
186 | #
187 | # - notebook: the Jupyter Notebook Server
188 | # - ipywidgets: used to provide an interactive UI for the YarnCluster objects
189 | # - jupyter-server-proxy: used to proxy the dask dashboard through the notebook server
190 | # -----------------------------------------------------------------------------
191 | if [[ "$JUPYTER" == "true" ]]; then
192 |     echo "Installing Jupyter"
193 |     conda install \
194 |     -c conda-forge \
195 |     -y \
196 |     -q \
197 |     notebook \
198 |     ipywidgets \
199 |     jupyter-server-proxy
200 | fi
201 | 
202 | 
203 | # -----------------------------------------------------------------------------
204 | # 9. List all packages in the client environment
205 | # -----------------------------------------------------------------------------
206 | echo "Packages installed in the client environment:"
207 | conda list
208 | 
209 | 
210 | # -----------------------------------------------------------------------------
211 | # 10. Configure Jupyter Notebook
212 | # -----------------------------------------------------------------------------
213 | echo "Configuring Jupyter"
214 | mkdir -p $HOME/.jupyter
215 | HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"`
216 | cat <<EOF >> $HOME/.jupyter/jupyter_notebook_config.py
217 | c.NotebookApp.password = u'$HASHED_PASSWORD'
218 | c.NotebookApp.open_browser = False
219 | c.NotebookApp.ip = '0.0.0.0'
220 | EOF
221 | 
222 | 
223 | # -----------------------------------------------------------------------------
224 | # 11. Define an upstart service for the Jupyter Notebook Server
225 | #
226 | # This sets the notebook server up to properly run as a background service.
227 | # -----------------------------------------------------------------------------
228 | echo "Configuring Jupyter Notebook Upstart Service"
229 | cat <<EOF > /tmp/jupyter-notebook.conf
230 | description "Jupyter Notebook Server"
231 | start on runlevel [2345]
232 | stop on runlevel [016]
233 | respawn
234 | respawn limit unlimited
235 | exec su - hadoop -c "jupyter notebook" >> /var/log/jupyter-notebook.log 2>&1
236 | EOF
237 | sudo mv /tmp/jupyter-notebook.conf /etc/init/
238 | 
239 | # -----------------------------------------------------------------------------
240 | # 12. Start the Jupyter Notebook Server
241 | # -----------------------------------------------------------------------------
242 | echo "Starting Jupyter Notebook Server"
243 | sudo initctl reload-configuration
244 | sudo initctl start jupyter-notebook
245 | 


--------------------------------------------------------------------------------
/Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/dask_bootstrap_workflow.md:
--------------------------------------------------------------------------------
 1 | # Launching a Dask Cluster with AWS EMR
 2 | 
 3 | To launch a Dask-compatible EMR cluster, you should first upload the bootstrap-dask script provided in this lab directory to an S3 bucket that you have access to (I uploaded mine to the aws-emr-resources bucket that AWS automatically created for storing EMR notebooks for Labs 7 and 8). This bootstrap script installs Miniconda, Dask, and some other Python packages to your EMR cluster (you can add additional packages via the --bootstrap-actions "Args", as in the example below). It also launches a Jupyter Server on the cluster for you, so that you can use Dask in a Jupyter notebook environment.
 4 | 
 5 | Run the following on your local terminal (wherever you installed AWS CLI) to launch your cluster. Note that you should provide your unique S3 paths for the location that your logs will be written to (the --log-uri field) and the spot that you uploaded your bootstrap script. You should also provide the name of your PEM file in the --ec2-attributes "KeyName" field. Note as well that you can launch whatever types of instances and counts that you wish. Here, we launch 3 m5.xlarge instances in our cluster: 1 instance for the EMR "Master" and 2 worker instances.
 6 | 
 7 | ```
 8 | aws emr create-cluster --name "Dask-Cluster" \
 9 |     --release-label emr-5.29.0 \
10 |     --applications Name=Hadoop \
11 |     --log-uri s3://aws-logs-545721747693-us-east-1/elasticmapreduce/ \
12 |     --instance-type m5.xlarge \
13 |     --instance-count 3 \
14 |     --bootstrap-actions Path=s3://aws-emr-resources-545721747693-us-east-1/bootstrap-dask,Args="[--conda-packages,bokeh,fastparquet,python-snappy,snappy,matplotlib]" \
15 |     --use-default-roles \
16 |     --region us-east-1 \
17 |     --ec2-attributes '{"KeyName":"MACS_30123"}'
18 | ```
19 | 
20 | It can take up to ten minutes for your cluster to launch. While the cluster is launching, you'll want to adjust the permissions of the Security Group associated with your "Master" EC2 instance in the cluster so that it will allow you to ssh into it. To do so, you'll need to access the Summary tab of your EMR cluster in the AWS Console and scroll down to the "Security and access" header. Click the security group ID number that is beside the "Security groups for Master" description. On the next screen, click this same security group ID again (corresponding to the name "ElasticMapReduce-master"). Now click the "Edit inbound rules" button and add an inbound rule of type "SSH" and Source type "Anywhere" (you can also limit ssh access to your IP if you'd like for this to be more secure) to your cluster. Finally, click "save rules" to save your new inbound rule.
21 | 
22 | Once your cluster has finished bootstrapping and is running, you can issue the following command on your local machine terminal (assuming a Unix-style terminal) to forward port 8888 from the Jupyter server running on the cluster to your local port 8888 (entering the path to your PEM file along with the correct address for your master node, which is available after the "Master public DNS" header in your cluster's summary tab in the AWS EMR console):
23 | 
24 | ```
25 | ssh -N -L localhost:8888:localhost:8888 -i ~/YOUR_PEM.pem hadoop@YOUR_ADDRESS_HERE
26 | ```
27 | 
28 | Be sure to keep this local terminal window open to maintain your connection to the cluster. Now, if you go to `localhost:8888` in your web browser on your local machine, you can log in to your Dask cluster! The default password for the Jupyter server is `dask-user`. To test to see if Dask is correctly configured, try running the short "Dask on an EMR Cluster" Jupyter notebook provided here in this lab directory (just upload it from within your Jupyter window and run it). The notebook demonstrates how you can launch your cluster and perform basic Dask array and DataFrame operations that scale up the capabilities of Python libraries like NumPy and Pandas.
29 | 
30 | When you're done, remember to save your work and download anything you want to save back to your local machine (you can do this by clicking File>Download as>.ipynb), or wherever you are saving your files. When you have everything saved, remember to terminate your EMR cluster (you can do this via the AWS CLI or the AWS Console). You will be charged for as long as your EC2 instances are running.


--------------------------------------------------------------------------------
/Labs/Lab 9 Accelerating Dask/8W_Dask_Rapids.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "8W_dask_rapids.ipynb",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "toc_visible": true
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "language_info": {
 16 |       "name": "python"
 17 |     },
 18 |     "accelerator": "GPU"
 19 |   },
 20 |   "cells": [
 21 |     {
 22 |       "cell_type": "markdown",
 23 |       "metadata": {
 24 |         "id": "UhCHb7X_nOyq"
 25 |       },
 26 |       "source": [
 27 |         "# Accelerating Dask with GPUs (via RAPIDS)\n",
 28 |         "\n",
 29 |         "<a href=\"https://colab.research.google.com/drive/1UgKTV1ehX2Uh-JPRV9UO5ZYkNsyaLMJq?usp=sharing\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
 30 |         "\n",
 31 |         "We've seen in lecture how the [RAPIDS libraries](https://rapids.ai/) make it possible to accelerate common analytical workflows on GPUs using libraries like `cudf` (for GPU DataFrames) and `cuml` (for basic GPU machine learning operations on DataFrames). When your data gets especially large (e.g. exceeding the memory capacity of a single GPU) or your computations get especially cumbersome, Dask makes it possible to scale these workflows out even further -- distributing work out across a cluster of GPUs.\n",
 32 |         "\n",
 33 |         "In this notebook, we'll work on Colab (with a single GPU in our Dask GPU cluster) so that you can follow along. In AWS Educate, recall that we cannot create GPU clusters. However, this notebook should be runnable on multi-GPU EC2 instances and clusters (on AWS) if you use a personal account to request these resources.\n",
 34 |         "\n",
 35 |         "To run this notebook in Colab, let's make sure that we have a GPU allocated. In Colab, click \"Runtime\" > \"Change Runtime Type\" to confirm that your \"Hardware Accelerator\" type is \"GPU.\" [Note: if you are allocated a K80 GPU (like we have access to on Midway), you will not be able to run this notebook and you will need to terminate (and restart) your session until you receive a newer GPU].\n",
 36 |         "\n",
 37 |         "If we run the command below, you'll see the type of GPU that is being used:"
 38 |       ]
 39 |     },
 40 |     {
 41 |       "cell_type": "code",
 42 |       "metadata": {
 43 |         "colab": {
 44 |           "base_uri": "https://localhost:8080/"
 45 |         },
 46 |         "id": "8zKj9laOmtoq",
 47 |         "outputId": "f0a9a260-3cfa-447a-b1e2-0a4abd5515e2"
 48 |       },
 49 |       "source": [
 50 |         "!nvidia-smi"
 51 |       ],
 52 |       "execution_count": 1,
 53 |       "outputs": [
 54 |         {
 55 |           "output_type": "stream",
 56 |           "text": [
 57 |             "Tue May 18 19:47:23 2021       \n",
 58 |             "+-----------------------------------------------------------------------------+\n",
 59 |             "| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |\n",
 60 |             "|-------------------------------+----------------------+----------------------+\n",
 61 |             "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
 62 |             "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
 63 |             "|                               |                      |               MIG M. |\n",
 64 |             "|===============================+======================+======================|\n",
 65 |             "|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |\n",
 66 |             "| N/A   60C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |\n",
 67 |             "|                               |                      |                  N/A |\n",
 68 |             "+-------------------------------+----------------------+----------------------+\n",
 69 |             "                                                                               \n",
 70 |             "+-----------------------------------------------------------------------------+\n",
 71 |             "| Processes:                                                                  |\n",
 72 |             "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
 73 |             "|        ID   ID                                                   Usage      |\n",
 74 |             "|=============================================================================|\n",
 75 |             "|  No running processes found                                                 |\n",
 76 |             "+-----------------------------------------------------------------------------+\n"
 77 |           ],
 78 |           "name": "stdout"
 79 |         }
 80 |       ]
 81 |     },
 82 |     {
 83 |       "cell_type": "markdown",
 84 |       "metadata": {
 85 |         "id": "o51ycJzBrHHm"
 86 |       },
 87 |       "source": [
 88 |         "Then, we need to install RAPIDS by running the following cell [Note: this will take some time (and take up much of your available disk-space), so be patient!]"
 89 |       ]
 90 |     },
 91 |     {
 92 |       "cell_type": "code",
 93 |       "metadata": {
 94 |         "colab": {
 95 |           "base_uri": "https://localhost:8080/"
 96 |         },
 97 |         "id": "pi7fFRHfnMNX",
 98 |         "outputId": "864223a0-8986-4970-d6dd-cd8812cc5f06"
 99 |       },
100 |       "source": [
101 |         "# Install RAPIDS\n",
102 |         "!git clone https://github.com/rapidsai/rapidsai-csp-utils.git\n",
103 |         "!bash rapidsai-csp-utils/colab/rapids-colab.sh stable\n",
104 |         "\n",
105 |         "import sys, os, shutil\n",
106 |         "\n",
107 |         "sys.path.append('/usr/local/lib/python3.7/site-packages/')\n",
108 |         "os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'\n",
109 |         "os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'\n",
110 |         "os.environ[\"CONDA_PREFIX\"] = \"/usr/local\"\n",
111 |         "for so in ['cudf', 'rmm', 'nccl', 'cuml', 'cugraph', 'xgboost', 'cuspatial']:\n",
112 |         "  fn = 'lib'+so+'.so'\n",
113 |         "  source_fn = '/usr/local/lib/'+fn\n",
114 |         "  dest_fn = '/usr/lib/'+fn\n",
115 |         "  if os.path.exists(source_fn):\n",
116 |         "    print(f'Copying {source_fn} to {dest_fn}')\n",
117 |         "    shutil.copyfile(source_fn, dest_fn)\n",
118 |         "if not os.path.exists('/usr/lib64'):\n",
119 |         "    os.makedirs('/usr/lib64')\n",
120 |         "for so_file in os.listdir('/usr/local/lib'):\n",
121 |         "  if 'libstdc' in so_file:\n",
122 |         "    shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib64/'+so_file)\n",
123 |         "    shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib/x86_64-linux-gnu/'+so_file)"
124 |       ],
125 |       "execution_count": 1,
126 |       "outputs": [
127 |       ]
128 |     },
129 |     {
130 |       "cell_type": "markdown",
131 |       "metadata": {
132 |         "id": "ws9pAFs2ER4o"
133 |       },
134 |       "source": [
135 |         "OK, with RAPIDS installed, let's use `dask_cuda`'s API to launch a GPU cluster and pass this cluster object to our `dask.distributed` client. `LocalCUDACluster()` will count each available GPU in our cluster (in this case, 1 GPU) as a Dask worker and assign it work."
136 |       ]
137 |     },
138 |     {
139 |       "cell_type": "code",
140 |       "metadata": {
141 |         "id": "L2BX7q1CnDgG"
142 |       },
143 |       "source": [
144 |         "from dask_cuda import LocalCUDACluster\n",
145 |         "from dask.distributed import Client\n",
146 |         "\n",
147 |         "cluster = LocalCUDACluster() # Identify all available GPUs\n",
148 |         "client = Client(cluster)"
149 |       ],
150 |       "execution_count": 2,
151 |       "outputs": []
152 |     },
153 |     {
154 |       "cell_type": "markdown",
155 |       "metadata": {
156 |         "id": "0GkgjBR7FPsG"
157 |       },
158 |       "source": [
159 |         "From here, we can use `dask_cudf` to automate the process of partitioning our data across our GPU workers and instantiating a GPU-based DataFrame on our GPU that we can work with. Let's load in the same AirBnB data that we were working with in the `numba` + `dask` CPU demonstration:"
160 |       ]
161 |     },
162 |     {
163 |       "cell_type": "code",
164 |       "metadata": {
165 |         "colab": {
166 |           "base_uri": "https://localhost:8080/",
167 |           "height": 450
168 |         },
169 |         "id": "BND3eVWathF_",
170 |         "outputId": "8046f29a-76a8-4a0a-af88-104fae24248b"
171 |       },
172 |       "source": [
173 |         "import dask_cudf\n",
174 |         "\n",
175 |         "df = dask_cudf.read_csv('listings*.csv')\n",
176 |         "df.head()"
177 |       ],
178 |       "execution_count": 3,
179 |       "outputs": [
180 |         {
181 |           "output_type": "execute_result",
182 |           "data": {
183 |             "text/html": [
184 |               "<div>\n",
185 |               "<style scoped>\n",
186 |               "    .dataframe tbody tr th:only-of-type {\n",
187 |               "        vertical-align: middle;\n",
188 |               "    }\n",
189 |               "\n",
190 |               "    .dataframe tbody tr th {\n",
191 |               "        vertical-align: top;\n",
192 |               "    }\n",
193 |               "\n",
194 |               "    .dataframe thead th {\n",
195 |               "        text-align: right;\n",
196 |               "    }\n",
197 |               "</style>\n",
198 |               "<table border=\"1\" class=\"dataframe\">\n",
199 |               "  <thead>\n",
200 |               "    <tr style=\"text-align: right;\">\n",
201 |               "      <th></th>\n",
202 |               "      <th>id</th>\n",
203 |               "      <th>name</th>\n",
204 |               "      <th>host_id</th>\n",
205 |               "      <th>host_name</th>\n",
206 |               "      <th>neighbourhood_group</th>\n",
207 |               "      <th>neighbourhood</th>\n",
208 |               "      <th>latitude</th>\n",
209 |               "      <th>longitude</th>\n",
210 |               "      <th>room_type</th>\n",
211 |               "      <th>price</th>\n",
212 |               "      <th>minimum_nights</th>\n",
213 |               "      <th>number_of_reviews</th>\n",
214 |               "      <th>last_review</th>\n",
215 |               "      <th>reviews_per_month</th>\n",
216 |               "      <th>calculated_host_listings_count</th>\n",
217 |               "      <th>availability_365</th>\n",
218 |               "    </tr>\n",
219 |               "  </thead>\n",
220 |               "  <tbody>\n",
221 |               "    <tr>\n",
222 |               "      <th>0</th>\n",
223 |               "      <td>3781</td>\n",
224 |               "      <td>HARBORSIDE-Walk to subway</td>\n",
225 |               "      <td>4804</td>\n",
226 |               "      <td>Frank</td>\n",
227 |               "      <td>&lt;NA&gt;</td>\n",
228 |               "      <td>East Boston</td>\n",
229 |               "      <td>42.36413</td>\n",
230 |               "      <td>-71.02991</td>\n",
231 |               "      <td>Entire home/apt</td>\n",
232 |               "      <td>125</td>\n",
233 |               "      <td>32</td>\n",
234 |               "      <td>19</td>\n",
235 |               "      <td>2021-02-26</td>\n",
236 |               "      <td>0.27</td>\n",
237 |               "      <td>1</td>\n",
238 |               "      <td>106</td>\n",
239 |               "    </tr>\n",
240 |               "    <tr>\n",
241 |               "      <th>1</th>\n",
242 |               "      <td>6695</td>\n",
243 |               "      <td>$99 Special!! Home Away! Condo</td>\n",
244 |               "      <td>8229</td>\n",
245 |               "      <td>Terry</td>\n",
246 |               "      <td>&lt;NA&gt;</td>\n",
247 |               "      <td>Roxbury</td>\n",
248 |               "      <td>42.32802</td>\n",
249 |               "      <td>-71.09387</td>\n",
250 |               "      <td>Entire home/apt</td>\n",
251 |               "      <td>169</td>\n",
252 |               "      <td>29</td>\n",
253 |               "      <td>115</td>\n",
254 |               "      <td>2019-11-02</td>\n",
255 |               "      <td>0.81</td>\n",
256 |               "      <td>4</td>\n",
257 |               "      <td>40</td>\n",
258 |               "    </tr>\n",
259 |               "    <tr>\n",
260 |               "      <th>2</th>\n",
261 |               "      <td>10813</td>\n",
262 |               "      <td>Back Bay Apt-blocks to subway, Newbury St, The...</td>\n",
263 |               "      <td>38997</td>\n",
264 |               "      <td>Michelle</td>\n",
265 |               "      <td>&lt;NA&gt;</td>\n",
266 |               "      <td>Back Bay</td>\n",
267 |               "      <td>42.35061</td>\n",
268 |               "      <td>-71.08787</td>\n",
269 |               "      <td>Entire home/apt</td>\n",
270 |               "      <td>96</td>\n",
271 |               "      <td>29</td>\n",
272 |               "      <td>5</td>\n",
273 |               "      <td>2020-12-02</td>\n",
274 |               "      <td>0.08</td>\n",
275 |               "      <td>11</td>\n",
276 |               "      <td>307</td>\n",
277 |               "    </tr>\n",
278 |               "    <tr>\n",
279 |               "      <th>3</th>\n",
280 |               "      <td>10986</td>\n",
281 |               "      <td>North End (Waterfront area)  CLOSE TO MGH &amp; SU...</td>\n",
282 |               "      <td>38997</td>\n",
283 |               "      <td>Michelle</td>\n",
284 |               "      <td>&lt;NA&gt;</td>\n",
285 |               "      <td>North End</td>\n",
286 |               "      <td>42.36377</td>\n",
287 |               "      <td>-71.05206</td>\n",
288 |               "      <td>Entire home/apt</td>\n",
289 |               "      <td>96</td>\n",
290 |               "      <td>29</td>\n",
291 |               "      <td>2</td>\n",
292 |               "      <td>2016-05-23</td>\n",
293 |               "      <td>0.03</td>\n",
294 |               "      <td>11</td>\n",
295 |               "      <td>293</td>\n",
296 |               "    </tr>\n",
297 |               "    <tr>\n",
298 |               "      <th>4</th>\n",
299 |               "      <td>13247</td>\n",
300 |               "      <td>Back Bay studio apartment</td>\n",
301 |               "      <td>51637</td>\n",
302 |               "      <td>Susan</td>\n",
303 |               "      <td>&lt;NA&gt;</td>\n",
304 |               "      <td>Back Bay</td>\n",
305 |               "      <td>42.35164</td>\n",
306 |               "      <td>-71.08752</td>\n",
307 |               "      <td>Entire home/apt</td>\n",
308 |               "      <td>75</td>\n",
309 |               "      <td>91</td>\n",
310 |               "      <td>0</td>\n",
311 |               "      <td>&lt;NA&gt;</td>\n",
312 |               "      <td>&lt;NA&gt;</td>\n",
313 |               "      <td>2</td>\n",
314 |               "      <td>0</td>\n",
315 |               "    </tr>\n",
316 |               "  </tbody>\n",
317 |               "</table>\n",
318 |               "</div>"
319 |             ],
320 |             "text/plain": [
321 |               "      id  ... availability_365\n",
322 |               "0   3781  ...              106\n",
323 |               "1   6695  ...               40\n",
324 |               "2  10813  ...              307\n",
325 |               "3  10986  ...              293\n",
326 |               "4  13247  ...                0\n",
327 |               "\n",
328 |               "[5 rows x 16 columns]"
329 |             ]
330 |           },
331 |           "metadata": {
332 |             "tags": []
333 |           },
334 |           "execution_count": 3
335 |         }
336 |       ]
337 |     },
338 |     {
339 |       "cell_type": "markdown",
340 |       "metadata": {
341 |         "id": "y4ikXLjh4RRW"
342 |       },
343 |       "source": [
344 |         "Once we have that data, we can perform many of the standard DataFrame operations we perform on CPUs -- just accelerated by our GPU cluster!"
345 |       ]
346 |     },
347 |     {
348 |       "cell_type": "code",
349 |       "metadata": {
350 |         "colab": {
351 |           "base_uri": "https://localhost:8080/"
352 |         },
353 |         "id": "GdtSdyF0u8xs",
354 |         "outputId": "cd6b5fde-cf8f-483b-ddce-13b4a33066df"
355 |       },
356 |       "source": [
357 |         "df.groupby(['neighbourhood', 'room_type']) \\\n",
358 |         "  .price \\\n",
359 |         "  .mean() \\\n",
360 |         "  .compute()"
361 |       ],
362 |       "execution_count": 4,
363 |       "outputs": [
364 |         {
365 |           "output_type": "execute_result",
366 |           "data": {
367 |             "text/plain": [
368 |               "neighbourhood   room_type      \n",
369 |               "North Center    Private room        75.818182\n",
370 |               "Ashburn         Entire home/apt    100.857143\n",
371 |               "Edgewater       Entire home/apt    140.142857\n",
372 |               "South Lawndale  Entire home/apt     79.826087\n",
373 |               "Auburn Gresham  Entire home/apt    135.000000\n",
374 |               "                                      ...    \n",
375 |               "Lakeshore       Entire home/apt    205.500000\n",
376 |               "Brighton Park   Shared room         39.000000\n",
377 |               "Lake View       Hotel room         656.400000\n",
378 |               "North Beach     Shared room         31.900000\n",
379 |               "Clearing        Entire home/apt     90.000000\n",
380 |               "Name: price, Length: 341, dtype: float64"
381 |             ]
382 |           },
383 |           "metadata": {
384 |             "tags": []
385 |           },
386 |           "execution_count": 4
387 |         }
388 |       ]
389 |     },
390 |     {
391 |       "cell_type": "markdown",
392 |       "metadata": {
393 |         "id": "mXoQOrdD-vz9"
394 |       },
395 |       "source": [
396 |         "One thing to note, though, is that not all of the functionality we might expect out of CPU clusters is available yet in the `cudf`/`dask_cudf` DataFrame implementation.\n",
397 |         "\n",
398 |         "For instance (and of particular note!), our ability to apply custom functions is still pretty limited. `cudf` uses Numba's CUDA compiler to translate this code for the GPU and [many standard `numpy` operations are not supported](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#numpy-support) (for instance, if you try to apply the distance calculation with performed in the Numba+Dask CPU demonstration notebook for today, this will fail to compile correctly for the GPU).\n",
399 |         "\n",
400 |         "That being said, we can perform many base-Python operations inside of custom functions, so if you can express your custom functions in this way, it might be worth your while to do this work on a GPU. For example, let's create a custom price index that indicates whether an AirBnB is \"Cheap\" (0), \"Moderately Expensive\" (1), or \"Very Expensive\" (2) using `cudf`'s [`apply_rows` method](https://docs.rapids.ai/api/cudf/stable/guide-to-udfs.html#DataFrame-UDFs):\n"
401 |       ]
402 |     },
403 |     {
404 |       "cell_type": "code",
405 |       "metadata": {
406 |         "colab": {
407 |           "base_uri": "https://localhost:8080/",
408 |           "height": 195
409 |         },
410 |         "id": "DOo_LkCI0fhl",
411 |         "outputId": "8c3f2db9-d4de-4d5a-e1f8-a9fc144ac6b6"
412 |       },
413 |       "source": [
414 |         "def expensive(x, price_index):\n",
415 |         "    # passed through Numba's CUDA compiler and auto-parallelized for GPU\n",
416 |         "    # for loop is automatically parallelized\n",
417 |         "    for i, price in enumerate(x):\n",
418 |         "        if price < 50:\n",
419 |         "            price_index[i] = 0\n",
420 |         "        elif price < 100:\n",
421 |         "            price_index[i] = 1\n",
422 |         "        else:\n",
423 |         "            price_index[i] = 2\n",
424 |         "\n",
425 |         "# Use cudf's `apply_rows` API for applying function to every row in DataFrame\n",
426 |         "df = df.apply_rows(expensive,\n",
427 |         "                   incols={'price':'x'},\n",
428 |         "                   outcols={'price_index': int})\n",
429 |         "\n",
430 |         "# Confirm that price index created correctly\n",
431 |         "df[['price', 'price_index']].head()"
432 |       ],
433 |       "execution_count": 5,
434 |       "outputs": [
435 |         {
436 |           "output_type": "execute_result",
437 |           "data": {
438 |             "text/html": [
439 |               "<div>\n",
440 |               "<style scoped>\n",
441 |               "    .dataframe tbody tr th:only-of-type {\n",
442 |               "        vertical-align: middle;\n",
443 |               "    }\n",
444 |               "\n",
445 |               "    .dataframe tbody tr th {\n",
446 |               "        vertical-align: top;\n",
447 |               "    }\n",
448 |               "\n",
449 |               "    .dataframe thead th {\n",
450 |               "        text-align: right;\n",
451 |               "    }\n",
452 |               "</style>\n",
453 |               "<table border=\"1\" class=\"dataframe\">\n",
454 |               "  <thead>\n",
455 |               "    <tr style=\"text-align: right;\">\n",
456 |               "      <th></th>\n",
457 |               "      <th>price</th>\n",
458 |               "      <th>price_index</th>\n",
459 |               "    </tr>\n",
460 |               "  </thead>\n",
461 |               "  <tbody>\n",
462 |               "    <tr>\n",
463 |               "      <th>0</th>\n",
464 |               "      <td>125</td>\n",
465 |               "      <td>2</td>\n",
466 |               "    </tr>\n",
467 |               "    <tr>\n",
468 |               "      <th>1</th>\n",
469 |               "      <td>169</td>\n",
470 |               "      <td>2</td>\n",
471 |               "    </tr>\n",
472 |               "    <tr>\n",
473 |               "      <th>2</th>\n",
474 |               "      <td>96</td>\n",
475 |               "      <td>1</td>\n",
476 |               "    </tr>\n",
477 |               "    <tr>\n",
478 |               "      <th>3</th>\n",
479 |               "      <td>96</td>\n",
480 |               "      <td>1</td>\n",
481 |               "    </tr>\n",
482 |               "    <tr>\n",
483 |               "      <th>4</th>\n",
484 |               "      <td>75</td>\n",
485 |               "      <td>1</td>\n",
486 |               "    </tr>\n",
487 |               "  </tbody>\n",
488 |               "</table>\n",
489 |               "</div>"
490 |             ],
491 |             "text/plain": [
492 |               "   price  price_index\n",
493 |               "0    125            2\n",
494 |               "1    169            2\n",
495 |               "2     96            1\n",
496 |               "3     96            1\n",
497 |               "4     75            1"
498 |             ]
499 |           },
500 |           "metadata": {
501 |             "tags": []
502 |           },
503 |           "execution_count": 5
504 |         }
505 |       ]
506 |     },
507 |     {
508 |       "cell_type": "markdown",
509 |       "metadata": {
510 |         "id": "IB5kFM5C3TTC"
511 |       },
512 |       "source": [
513 |         "In addition to preprocessing and analyzing data on GPUs, we can also train (a limited set of) Machine Learning models directly on our GPU cluster using the `cuml` library in the RAPIDS ecoystem as well. \n",
514 |         "\n",
515 |         "For instance, let's train a linear regression model based on our data from San Francisco, Chicago, and Boston to predict the price of an AirBnB based on other values in its listing information (e.g. \"reviews per month\" and \"minimum nights\"). We'll then use this model to make predictions about the price of AirBnBs in another city (NYC):"
516 |       ]
517 |     },
518 |     {
519 |       "cell_type": "code",
520 |       "metadata": {
521 |         "id": "3frzCDx-Pkzb"
522 |       },
523 |       "source": [
524 |         "from cuml.dask.linear_model import LinearRegression\n",
525 |         "import numpy as np\n",
526 |         "\n",
527 |         "X = df[['reviews_per_month', 'minimum_nights']].astype(np.float32).dropna()\n",
528 |         "y = df[['price']].astype(np.float32).dropna()\n",
529 |         "fit = LinearRegression().fit(X, y)"
530 |       ],
531 |       "execution_count": 6,
532 |       "outputs": []
533 |     },
534 |     {
535 |       "cell_type": "markdown",
536 |       "metadata": {
537 |         "id": "fJR31Zss2VgY"
538 |       },
539 |       "source": [
540 |         "Then, we can read in the NYC dataset and make predictions about what prices will be in NYC on the basis of the model we trained on data from our three original cities:"
541 |       ]
542 |     },
543 |     {
544 |       "cell_type": "code",
545 |       "metadata": {
546 |         "colab": {
547 |           "base_uri": "https://localhost:8080/"
548 |         },
549 |         "id": "7C18rvfJDjM_",
550 |         "outputId": "f8483688-d395-40ce-959a-c23858a4d1e4"
551 |       },
552 |       "source": [
553 |         "df_nyc = dask_cudf.read_csv('test*.csv')\n",
554 |         "X_test = df_nyc[['reviews_per_month', 'minimum_nights']].astype(np.float32) \\\n",
555 |         "                                                        .dropna()\n",
556 |         "fit.predict(X_test) \\\n",
557 |         "   .compute() \\\n",
558 |         "   .head()"
559 |       ],
560 |       "execution_count": 7,
561 |       "outputs": [
562 |         {
563 |           "output_type": "execute_result",
564 |           "data": {
565 |             "text/plain": [
566 |               "0    184.802887\n",
567 |               "1    188.286636\n",
568 |               "2    184.802887\n",
569 |               "3    183.658218\n",
570 |               "4    186.646774\n",
571 |               "dtype: float32"
572 |             ]
573 |           },
574 |           "metadata": {
575 |             "tags": []
576 |           },
577 |           "execution_count": 7
578 |         }
579 |       ]
580 |     }
581 |   ]
582 | }
583 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Large-Scale Computing for the Social Sciences
  2 | ### Spring 2021 - MACS 30123/MAPS 30123/PLSC 30123
  3 | 
  4 | | Instructor Information       | **TA Information**      | **TA Information**     | Course Information     |
  5 | | :-------------               | :-------------          | :-------------         | :------------          |
  6 | | Jon Clindaniel               | Yongfei Lu              | Luxin Tian             | Location: [Online](https://canvas.uchicago.edu/courses/34598)   |
  7 | | 1155 E. 60th Street, Rm. 215 |                         |                        | Monday/Wednesday       |
  8 | | jclindaniel@uchicago.edu     | yongfeilu@uchicago.edu | luxintian@uchicago.edu | 9:10-11:10 AM (CDT)  |
  9 | | **Office Hours:** [Schedule](https://appoint.ly/s/jclindaniel/office-hours)\* | **Office Hours:** [Schedule](https://appoint.ly/s/yongfeilu/20-min)\* | **Office Hours:** [Schedule](https://appoint.ly/s/luxintian/macs30123)\*| **Lab:** Prerecorded, [Online](https://canvas.uchicago.edu/courses/34598) |
 10 | 
 11 | \* Office Hours held via Zoom
 12 | 
 13 | ## Course Description
 14 | Computational social scientists increasingly need to grapple with data that is either too big for a local machine and/or code that is too resource intensive to process on a local machine. In this course, students will learn how to effectively scale their computational methods beyond their local machines. The focus of the course will be social scientific applications, ranging from training machine learning models on large economic time series to processing and analyzing social media data in real-time. Students will be introduced to several large-scale computing frameworks such as MPI, MapReduce, Spark, Dask, and OpenCL, with a special emphasis on employing these frameworks using cloud resources and the Python programming language.
 15 | 
 16 | *Prerequisites: MACS 30121 and MACS 30122, or equivalent (e.g. CAPP 30121 and CAPP 30122).*
 17 | 
 18 | ## Course Structure
 19 | This course is structured into several modules, or overarching thematic learning units, focused on teaching students fundamental large-scale computing concepts, as well as giving them the opportunity to apply these concepts to Computational Social Science research problems. Students can access the content in these modules on the [Canvas course site](https://canvas.uchicago.edu/courses/34598). Each module consists of a series of asynchronous lectures, readings, and assignments, which will make up a portion of the instruction time each week. If students have any questions about the asynchronous content, they should post these questions in the Ed Discussion forum for the course, which they can access by clicking the "Ed Discussion" tab on the left side of the screen on the Canvas course site. To see an overall schedule and syllabus for the course, as well as access additional course-related files, please visit the GitHub Course Repository, available here.
 20 | 
 21 | Additionally, students will attend short, interactive Zoom sessions during regular class hours (starting at 9:10 AM CDT) on Mondays and Wednesdays meant to give them the opportunity to discuss and apply the skills they have learned asynchronously and receive live instructor feedback. Attendance to the synchronous class sessions is mandatory and is an important component of the final course grade. Students can access the Zoom sessions by clicking on the "Zoom" tab on the left side of the screen on the Canvas course site. Students should prepare for these classes by reading the assigned readings ahead of every session. All readings are available online and are linked in the course schedule below.
 22 | 
 23 | Students will also virtually participate in a hands-on Python lab section each week, meant to instill practical large-scale computing skills related to the week’s topic. These labs are accessible on the Canvas course site and related code will be posted here in this GitHub repository. In order to practice these large-scale computing skills and complete the course assignments, students will be given free access to UChicago's [Midway Cluster](https://rcc.uchicago.edu/docs/), [Amazon Web Services (AWS)](https://aws.amazon.com/) cloud computing resources, and [DataCamp](https://www.datacamp.com/). More information about accessing these resources will be provided to registered students in the first several weeks of the quarter.
 24 | 
 25 | ## Grading
 26 | There will be an assignment due at the end of each unit (3 in total). Each assignment is worth 20% of the overall grade, with all assignments together worth a total of 60%. Additionally, attendance and participation will be worth 10% of the overall grade. Finally, students will complete a final project that is worth 30% of the overall grade (25% for the project itself, and 5% for an end-of-quarter presentation).
 27 | 
 28 | | Course Component         | Grade Percentage  |
 29 | | :-------------           | :-------------    |
 30 | | Assignments (Total: 3)   | 60%               |
 31 | | Attendance/Participation | 10%               |
 32 | | Final Project            | 5% (Presentation) |
 33 | |                          | 25% (Project)     |
 34 | 
 35 | Grades are not curved in this class or, at least, not in the traditional sense. We use a standard set of grade boundaries:
 36 | * 95-100: A
 37 | * 90-95: A-
 38 | * 85-90: B+
 39 | * 80-85: B
 40 | * 75-80: B-
 41 | * 70-75: C+
 42 | * <70: Dealt on a case-by-case basis
 43 | 
 44 | We curve only to the extent we might lower the boundaries for one or more letter grades, depending on the distribution of the raw scores. We will not raise the boundaries in response to the distribution.
 45 | 
 46 | So, for example, if you have a total score of 82 in the course, you are guaranteed to get, at least, a B (but may potentially get a higher grade if the boundary for a B+ is lowered).
 47 | 
 48 | ## Final Project
 49 | For their final project (due June 4th, 2021), students will write large-scale computing code that solves a social science research problem of their choosing. For instance, students might perform a computationally intensive demographic simulation, or they may choose to collect, analyze, and visualize large, streaming social media data, or do something else that employs large-scale computing strategies. Students will additionally record a short video presentation about their project. Detailed descriptions and grading rubrics for the project and presentation are available [on the Canvas course site.](https://canvas.uchicago.edu/courses/34598)
 50 | 
 51 | ## Late Assignments/Projects
 52 | Unexcused Late Assignment/Project Submissions will be penalized 10 percentage points for every hour they are late. For example, if an assignment is due on Wednesday at 2:00pm, the following percentage points will be deducted based on the time stamp of the last commit.
 53 | 
 54 | | Example last commit |	Percentage points deducted         |
 55 | | ----                | ----                               |
 56 | | 2:01pm to 3:00pm    |	-10 percentage points              |
 57 | | 3:01pm to 4:00pm    |-20 percentage points               |
 58 | | 4:01pm to 5:00pm    | -30 percentage points              |
 59 | | 5:01pm to 6:00pm    | -40 percentage points              |
 60 | | ...                 |	...                                |
 61 | | 11:01pm and beyond  |	-100 percentage points (no credit) |
 62 | 
 63 | ## Plagiarism on Assignments/Projects
 64 | Academic honesty is an extremely important principle in academia and at the University of Chicago.
 65 | * Writing assignments must quote and cite any excerpts taken from another work.
 66 | * If the cited work is the particular paper referenced in the Assignment, no works cited or references are necessary at the end of the composition.
 67 | * If the cited work is not the particular paper referenced in the Assignment, you MUST include a works cited or references section at the end of the composition.
 68 | * Any copying of other students' work will result in a zero grade and potential further academic discipline.
 69 | If you have any questions about citations and references, consult with your instructor.
 70 | 
 71 | ## Statement of Diversity and Inclusion
 72 | The University of Chicago is committed to diversity and rigorous inquiry from multiple perspectives. The MAPSS, CIR, and Computation programs share this commitment and seek to foster productive learning environments based upon inclusion, open communication, and mutual respect for a diverse range of identities, experiences, and positions.
 73 | 
 74 | Any suggestions for how we might further such objectives both in and outside the classroom are appreciated and will be given serious consideration. Please share your suggestions or concerns with your instructor, your preceptor, or your program’s Diversity and Inclusion representatives: Darcy Heuring (MAPSS), Matthias Staisch (CIR), and Chad Cyrenne (Computation). You are also welcome and encouraged to contact the Faculty Director of your program.
 75 | 
 76 | This course is open to all students who meet the academic requirements for participation. Any student who has a documented need for accommodation should contact Student Disability Services (773-702-6000 or disabilities@uchicago.edu) and the instructor as soon as possible.
 77 | 
 78 | ## Course Schedule
 79 | | Unit   | Week | Day | Topic | Readings | Assignment |
 80 | | --- | --- | --- | --- | --- |  --- |
 81 | | Fundamentals of Large-Scale Computing | Week 1: Introduction to Large-Scale Computing for the Social Sciences | 3/29/2021 | Introduction to the Course | | |
 82 | |   |   | 3/31/2021 | General Considerations for Large-Scale Computing  | [Robey and Zamora 2020 (Chapter 1)](https://livebook.manning.com/book/parallel-and-high-performance-computing/chapter-1), [Faster code via static typing (Cython)](http://docs.cython.org/en/latest/src/quickstart/cythonize.html), [A ~5 minute guide to Numba](https://numba.readthedocs.io/en/stable/user/5minguide.html) |   |
 83 | | | Week 2: On-Premise Large-Scale CPU-computing with MPI | 4/5/2021 | An Introduction to CPUs and Computing Clusters | [Pacheco 2011](https://canvas.uchicago.edu/files/5233897/download?download_frd=1) (Ch. 1-2), [Midway RCC User Guide](https://rcc.uchicago.edu/docs/) | |
 84 | | | | 4/7/2021 | Cluster Computing via Message Passing Interface (MPI) for Python | [Pacheco 2011](https://canvas.uchicago.edu/files/5233897/download?download_frd=1) (Ch. 3), [Dalcín et al. 2008](https://www-sciencedirect-com.proxy.uchicago.edu/science/article/pii/S0743731507001712?via%3Dihub) | |
 85 | |||| Lab: Hands-on Introduction to UChicago's Midway Computing Cluster and mpi4py |||
 86 | | | Week 3: On-Premise GPU-computing with OpenCL | 4/12/2021 | An Introduction to GPUs and Heterogenous Computing with OpenCL | [Scarpino 2012](https://canvas.uchicago.edu/files/5233898/download?download_frd=1) (Read Ch. 1, Skim Ch. 2-5,9) | |
 87 | | | | 4/14/2021 | Harnessing GPUs with PyOpenCL | [Klöckner et al. 2012](https://arxiv.org/pdf/0911.3456.pdf) | |
 88 | |||| Lab: Introduction to PyOpenCL and GPU Computing on UChicago's Midway Computing Cluster |||
 89 | | Architecting Computational Social Science Data Solutions in the Cloud | Week 4: An Introduction to Cloud Computing and Cloud HPC Architectures | 4/19/2021 | An Introduction to the Cloud Computing Landscape and AWS | [Jorissen and Bouffler 2017](https://canvas.uchicago.edu/files/5233891/download?download_frd=1) (Read Ch. 1, Skim Ch. 4-7), [Armbrust et al. 2009](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf), [Jonas et al. 2019](https://arxiv.org/pdf/1902.03383.pdf) | |
 90 | | | | 4/21/2021 | Architectures for Large-Scale Computation in the Cloud | [Introduction to HPC on AWS](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf), [HPC Architectural Best Practices](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) | Due: Assignment 1 (11:59 PM) |
 91 | |||| Lab: Running "Serverless" HPC Jobs in the AWS Cloud |||
 92 | | | Week 5: Architecting Large-Scale Data Solutions in the Cloud | 4/26/2021 | "Data Lake" Architectures | [Data Lakes and Analytics on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/), [AWS Data Lake Whitepaper](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf), [*Introduction to AWS Boto in Python*](https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python) (DataCamp Course; Practice working with S3 Data Lake in Python) | |
 93 | | | | 4/28/2021 | Architectures for Large-Scale Data Structuring and Storage | General, Open Source: ["What is a Database?" (YouTube),](https://www.youtube.com/watch?v=Tk1t3WKK-ZY) ["How to Choose the Right Database?" (YouTube),](https://www.youtube.com/watch?v=v5e_PasMdXc)  [“NoSQL Explained,”](https://www.mongodb.com/nosql-explained) AWS-specific solutions: ["Which Database to Use When?" (YouTube),](https://youtu.be/KWOSGVtHWqA) Optional: [Data Warehousing on AWS Whitepaper](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf), [AWS Big Data Whitepaper](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) | |
 94 | |||| Lab: Exploring Large Data Sources in an S3 Data Lake with the AWS Python SDK, Boto |||
 95 | | | Week 6: Large-Scale Data Ingestion and Processing | 5/3/2021 | Stream Ingestion and Processing with Apache Kafka, AWS Kinesis | [Narkhede et al. 2017](https://canvas.uchicago.edu/files/5233889/download?download_frd=1) (Read Ch. 1, Skim 3-6,11), [Dean and Crettaz 2019 (Ch. 4)](https://livebook.manning.com/book/event-streams-in-action/chapter-4/), [AWS Kinesis Whitepaper](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) ||
 96 | | | | 5/5/2021 | Batch Processing with Apache Hadoop and MapReduce | [White 2015](https://canvas.uchicago.edu/files/5233885/download?download_frd=1) (read Ch. 1-2, Skim 3-4), [Dean and Ghemawat 2004](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf), ["What is Amazon EMR?"](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html), Running MapReduce Jobs with Python’s “mrjob” package on EMR ([Fundamentals](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html) and [Elastic MapReduce Quickstart](https://mrjob.readthedocs.io/en/latest/guides/emr-quickstart.html)) | |
 97 | |||| Lab: Ingesting and Processing Batch Data with MapReduce on AWS EMR and Streaming Data with AWS Kinesis | ||
 98 | | High-Level Paradigms for Large-Scale Data Analysis, Prediction, and Presentation | Week 7: Spark | 5/10/2021 | Large-Scale Data Processing and Analysis with Spark | [Karau et al. 2015](https://canvas.uchicago.edu/files/5233908/download?download_frd=1) (Read Ch. 1-4, Skim 9-11), [*Introduction to PySpark*](https://learn.datacamp.com/courses/introduction-to-pyspark) (DataCamp Course) | |
 99 | | | | 5/12/2021 | A Deeper Dive into Spark (with a survey of Social Science Machine Learning, NLP, and Network Analysis Applications) | [*Machine Learning with PySpark*](https://campus.datacamp.com/courses/machine-learning-with-pyspark) (DataCamp Course), [Guller 2015](https://canvas.uchicago.edu/files/5233903/download?download_frd=1), [Hunter 2017](https://www.youtube.com/watch?v=NmbKst7ny5Q) (Spark Summit Talk), [GraphFrames Documentation for Python](https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html), [Spark NLP Documentation](https://nlp.johnsnowlabs.com/), Optional: [*Feature Engineering with PySpark*](https://learn.datacamp.com/courses/feature-engineering-with-pyspark) (DataCamp Course), Videos about accelerating Spark with GPUs (via [Horovod](https://www.youtube.com/watch?v=D1By2hy4Ecw) for deep learning, and the RAPIDS libraries for both [ETL and ML acceleration in Spark 3.0](https://www.youtube.com/watch?v=4MI_LYah900)) | Due: Assignment 2 (11:59 PM) |
100 | |||| Lab: Running PySpark in an AWS EMR Notebook for Large-Scale Data Analysis and Prediction |||
101 | | | Week 8: Dask | 5/17/2021 | Introduction to Dask | [“Why Dask,”](https://docs.dask.org/en/latest/why.html) [Dask Slide Deck](https://dask.org/slides.html),  [*Parallel Programming with Dask*](https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python) (DataCamp Course) | |
102 | | | | 5/19/2021 | Natively Scaling the Python Ecosystem with Dask |  | |
103 | |||| Lab: Scaling up Python Analytical Workflows in the Cloud and on the RCC in a Jupyter Notebook with Dask | ||
104 | | | Week 9: Presenting Data and Insights from Large-Scale Data Pipelines | 5/24/2021 | Building and Deploying (Scalable) Public APIs and Web Applications with Flask and AWS Elastic Beanstalk | Documentation for [Flask](https://flask-doc.readthedocs.io/en/latest/index.html), and [Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) | |
105 | || | 5/26/2021 | Visualizing Large Data | Documentation for [DataShader](https://datashader.org/index.html) and [Bokeh](https://bokeh.org/), and integrating the two libraries using [HoloViews](http://holoviews.org/user_guide/Large_Data.html) |  |
106 | |   |   | 5/28/2021  |  |   | Due: Assignment 3 (11:59 PM) |
107 | | Student Projects | Week 10: Final Projects | 6/4/2021 ||| Due: Final Project + Presentation Video (11:59 PM) |
108 | 
109 | ## Works Cited
110 | 
111 | "A ~5 minute guide to Numba." https://numba.readthedocs.io/en/stable/user/5minguide.html. Accessed 3/2021.
112 | 
113 | Armbrust, Michael, Fox, Armando, Griffith, Rean, Joseph, Anthony D., Katz, Randy H., Konwinski, Andrew, Lee, Gunho, Patterson, David A., Rabkin, Ariel, Stoica, Ion, and Matei Zaharia. 2009. "Above the Clouds: A Berkeley View of Cloud Computing." Technical report, EECS Department, University of California, Berkeley.
114 | 
115 | ["AWS Big Data Analytics Options on AWS." December 2018.)](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) AWS Whitepaper.
116 | 
117 | "AWS Elastic Beanstalk Developer Guide." https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html. Accessed 3/2021.
118 | 
119 | ["Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility." July 2017.](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf)
120 | 
121 | Dalcín, Lisandro, Paz, Rodrigo, Storti, Mario, and Jorge D'Elía. 2008. "MPI for Python: Performance improvements and MPI-2 extensions." *J. Parallel Distrib. Comput.* 68: 655-662.
122 | 
123 | ["Data Warehousing on AWS." March 2016.](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf) AWS Whitepaper.
124 | 
125 | "DataShader Documentation." https://datashader.org/index.html. Accessed 3/2021.
126 | 
127 | Dean, Alexander, and Valentin Crettaz. 2019. *Event Streams in Action: Real-time event systems with Kafka and Kinesis*. Shelter Island, NY: Manning.
128 | 
129 | Dean, Jeffrey, and Sanjay Ghemawat. 2004. "MapReduce: Simplified data processing on large clusters." In *Proceedings of Operating Systems Design and Implementation (OSDI)*. San Francisco, CA. 137-150.
130 | 
131 | Evans, Robert and Jason Lowe. "Deep Dive into GPU Support in Apache Spark 3.x." https://www.youtube.com/watch?v=4MI_LYah900. Accessed 3/2021.
132 | 
133 | "Faster code via static typing." http://docs.cython.org/en/latest/src/quickstart/cythonize.html. Accessed 3/2021
134 | 
135 | *Feature Engineering with PySpark*. https://learn.datacamp.com/courses/feature-engineering-with-pyspark. Accessed 3/2020.
136 | 
137 | "Flask Documentation." https://flask-doc.readthedocs.io/en/latest/index.html. Accessed 3/2021.
138 | 
139 | "GraphFrames user guide - Python." https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html. Accessed 3/2020.
140 | 
141 | Guller, Mohammed. 2015. "Graph Processing with Spark." In *Big Data Analytics with Spark*. New York: Apress.
142 | 
143 | ["High Performance Computing Lens AWS Well-Architected Framework." December 2019.](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) AWS Whitepaper.
144 | 
145 | Hunter, Tim. October 26, 2017. "GraphFrames: Scaling Web-Scale Graph Analytics with Apache Spark." https://www.youtube.com/watch?v=NmbKst7ny5Q.
146 | 
147 | *Introduction to AWS Boto in Python*. https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python. Accessed 3/2020.
148 | 
149 | ["Introduction to HPC on AWS." n.d.](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf) AWS Whitepaper.
150 | 
151 | *Introduction to PySpark*. https://learn.datacamp.com/courses/introduction-to-pyspark. Accessed 3/2020.
152 | 
153 | Jonas, Eric, Schleier-Smith, Johann, Sreekanti, Vikram, and Chia-Che Tsai. 2019. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." Technical report, EECS Department, University of California, Berkeley.
154 | 
155 | Jorissen, Kevin, and Brendan Bouffler. 2017. *AWS Research Cloud Program: Researcher's Handbook*. Amazon Web Services.
156 | 
157 | Kane, Frank. March 23, 2018. ["How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase"](https://www.youtube.com/watch?v=v5e_PasMdXc). https://www.youtube.com/watch?v=v5e_PasMdXc
158 | 
159 | Karau, Holden, Konwinski, Andy, Wendell, Patrick, and Matei Zaharia. 2015. *Learning Spark*. Sebastopol, CA: O'Reilly.
160 | 
161 | Klöckner, Andreas, Pinto, Nicolas, Lee, Yunsup, Catanzaro, Bryan, Ivanov, Paul, and Ahmed Fasih. 2012. "PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation." *Parallel Computing* 38(3): 157-174.
162 | 
163 | Linux Academy. July 10, 2019. ["What is a Database?"](https://www.youtube.com/watch?v=Tk1t3WKK-ZY). https://www.youtube.com/watch?v=Tk1t3WKK-ZY
164 | 
165 | *Machine Learning with PySpark*. https://campus.datacamp.com/courses/machine-learning-with-pyspark. Accessed 3/2020.
166 | 
167 | "mrjob v0.7.1 documentation." https://mrjob.readthedocs.io/en/latest/index.html. Accessed 3/2020.
168 | 
169 | Narkhede, Neha, Shapira, Gwen, and Todd Palino. 2017. *Kafka: The Definitive Guide*. Sebastopol, CA: O'Reilly.
170 | 
171 | “NoSQL Explained.” https://www.mongodb.com/nosql-explained. Accessed 3/2020.
172 | 
173 | Pacheco, Peter. 2011. *An Introduction to Parallel Programming*. Burlington, MA: Morgan Kaufmann.
174 | 
175 | *Parallel Programming with Dask*. https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python. Accessed 3/2020.
176 | 
177 | Petrossian, Tony, and Ian Meyers. November 30, 2017. "Which Database to Use When?" https://youtu.be/KWOSGVtHWqA. AWS re:Invent 2017.
178 | 
179 | "RCC User Guide." rcc.uchicago.edu/docs/. Accessed March 2020.
180 | 
181 | Robey, Robert and Yuliana Zamora. 2020. *Parallel and High Performance Computing*. Manning Early Access Program.
182 | 
183 | Scarpino, Matthew. 2012. *OpenCL in Action*. Shelter Island, NY: Manning.
184 | 
185 | Sergeev, Alex. March 28, 2019. "Distributed Deep Learning with Horovod." https://www.youtube.com/watch?v=D1By2hy4Ecw.
186 | 
187 | "Spark NLP Documentation." https://nlp.johnsnowlabs.com/. Accessed 3/2021.
188 | 
189 | ["Streaming Data Solutions on AWS with Amazon Kinesis." July 2017.](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) AWS Whitepaper.
190 | 
191 | "The Bokeh Visualization Library Documentation." https://bokeh.org/. Accessed 3/2021.
192 | 
193 | "What is Amazon EMR." https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html. Accessed 3/2020.
194 | 
195 | White, Tom. 2015. *Hadoop: The Definitive Guide*. Sebastopol, CA: O'Reilly.
196 | 
197 | "Why Dask." https://docs.dask.org/en/latest/why.html. Accessed 3/2020.
198 | 
199 | "Working with large data using datashader." http://holoviews.org/user_guide/Large_Data.html. Accessed 3/2021.
200 | 


--------------------------------------------------------------------------------