├── .gitignore
├── LICENSE
├── Labs
    ├── Lab 1 Midway RCC and mpi4py
    │   ├── midway_cheat_sheet.md
    │   ├── mpi.sbatch
    │   ├── mpi_multi_job.sbatch
    │   └── mpi_rand_walk.py
    ├── Lab 2 PyOpenCL
    │   ├── Lab_2_PyOpenCL_Random_Walk_Tutorial.ipynb
    │   ├── gpu.sbatch
    │   ├── gpu_rand_walk.py
    │   └── print_gpu_info.py
    ├── Lab 3 AWS EC2 and PyWren
    │   ├── Lab_3_PyWren.ipynb
    │   └── pywren_workflow.png
    ├── Lab 4 Accessing Large-Scale Data in S3
    │   └── Lab 4 Working with Large Data Sources in S3.ipynb
    ├── Lab 5 Ingesting and Processing Large-Scale Data
    │   ├── Part I MapReduce
    │   │   ├── .mrjob.conf
    │   │   ├── mapreduce_lab5.py
    │   │   ├── mrjob_cheatsheet.md
    │   │   └── sample_us.tsv
    │   └── Part II Kinesis
    │   │   ├── Lab 5 Kinesis.ipynb
    │   │   ├── consumer.py
    │   │   ├── consumer_feed.png
    │   │   ├── producer.py
    │   │   └── simple_kinesis_architecture.png
    ├── Lab 6 PySpark EDA and ML in an EMR Notebook
    │   ├── Lab_6.ipynb
    │   └── Local_Colab_Spark_Setup.ipynb
    └── Lab 7 Large-Scale Graph Processing with PySpark
    │   ├── Lab_7_GraphFrames.ipynb
    │   ├── edges.csv
    │   └── nodes.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pyc
 2 | build/*
 3 | dist/*
 4 | *.aux
 5 | *.bbl
 6 | *.blg
 7 | *.fdb_latexmk
 8 | *.idx
 9 | *.ilg
10 | *.ind
11 | *.lof
12 | *.log
13 | *.lot
14 | *.out
15 | *.pdfsync
16 | *.synctex.gz
17 | *.toc
18 | *.swp
19 | *.asv
20 | *.nav
21 | *.snm
22 | *.gz
23 | *.bib.bak
24 | *.fls
25 | *.m~
26 | *.sublime*
27 | .DS_Store
28 | *.dta
29 | *.ipynb_checkpoints*
30 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Jon Clindaniel
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/midway_cheat_sheet.md:
--------------------------------------------------------------------------------
 1 | # Logging in and Copying Files to Midway
 2 | See [RCC user guide](https://rcc.uchicago.edu/docs/) for login details specific to your system and additional options. The instructions for copying files via `scp` and logging in via `ssh` below assume a Unix-style command line interface.
 3 | 
 4 | ### SSH into Cluster with cNetID and Password
 5 | ```
 6 | ssh your_cnet_id@midway2.rcc.uchicago.edu
 7 | ```
 8 | Note that you'll now be able to move around your home directory using standard unix commands (`cd`, `pwd`, `ls`, etc.). If you are on a Windows machine, I (Jon) recommend enabling [Windows Subsystem for Linux and installing Ubuntu 18.04](https://docs.microsoft.com/en-us/windows/wsl/install-win10) to complete all of these tasks. This is what I do and I find this makes my life a lot easier than having to mentally transition between DOS and Unix commands (or needing to use an additional third-party tool).
 9 | 
10 | ### SCP files to/from local directory
11 | If you need to transfer files to and from Midway's storage, I find it easiest to just copy files via the `scp` command. Here, I copy a local file `local_file` in my current directory to my home directory on Midway. Then, I copy a file `remote_file` from my Midway home directory back to my current directory on my local machine. If you prefer not to use this command line approach, there are tutorials in the Midway documentation with [alternative approaches](https://rcc.uchicago.edu/docs/data-transfer/index.html).
12 | 
13 | ```
14 | scp local_file your_cnet_id@midway2.rcc.uchicago.edu:
15 | ```
16 | ```
17 | scp your_cnet_id@midway2.rcc.uchicago.edu:remote_file .
18 | ```
19 | 
20 | To copy an entire directory, use the `-r` flag:
21 | ```
22 | scp -r your_cnet_id@midway2.rcc.uchicago.edu:remote_directory .
23 | ```
24 | 
25 | ### Clone GitHub Repository
26 | Another good option is to clone a GitHub repository (for instance this public course repository) to your home directory on Midway and access files/code this way.
27 | 
28 | ```
29 | git clone https://github.com/jonclindaniel/LargeScaleComputing_S20.git
30 | ```
31 | 
32 | ### Adjustments to text on the Command Line
33 | If need to make any adjustments to text files before/after running them (or create new ones) on Midway, can do so on the command line with text editing tools such as nano, vim, etc.
34 | ```
35 | nano mpi.sbatch
36 | ```
37 | 
38 | # Loading Modules and installing programs
39 | 
40 | ### Load appropriate modules for MPI CPU* and OpenCL GPU dev
41 | ```
42 | module load cuda
43 | module load mpi4py/3.0.1a0_py3
44 | ```
45 | 
46 | ### Install additional Python packages to local user directory from login node
47 | To install packages that are not already included in a module (such as upgrading matplotlib, and installing PyOpenCL, which we will be using in this class), you can use `pip` to install these packages in your local user directory from the login node on Midway. Make sure you have loaded the cuda and mpi4py modules above before you try to run the below commands.
48 | 
49 | ```
50 | pip install -U --user matplotlib
51 | pip install --user pyopencl
52 | ```
53 | 
54 | # Running jobs
55 | 
56 | Resource sharing and job scheduling is handled by Slurm software on the Midway RCC. You can see detailed information in the Midway documentation, but Slurm allows you to see which partitions are available at any given time via the `sinfo` command, submit batch jobs via the `sbatch` command (which will allow you to request specific node/interconnect resources for a period of time and run your code on those resources), and schedule interactive sessions via the `sinteractive` command. Listed below are some of the fundamental commands you will need to know how to perform for this class.
57 | 
58 | ### Run Batch jobs with Slurm scripts
59 | You will use Slurm scripts to request computational resources for a period of time and run your code. These scripts can be run with the `sbatch` commands, as demonstrated below. You can check out sample Slurm scripts in our GitHub course repository, for more detail on how these scripts are structured.
60 | 
61 | ```
62 | sbatch mpi.sbatch
63 | sbatch gpu.sbatch
64 | ```
65 | 
66 | Note, if you wrote your sbatch file on a Windows machine, you might need to convert the text into a Unix format for it to run properly on the Midway RCC. To do so, you can install `dos2unix` on WSL via `apt-get install dos2unix` and then run the converter on your sbatch file.
67 | 
68 | ```
69 | dos2unix gpu.sbatch
70 | ```
71 | 
72 | You can monitor the progress of your job with (the sbatch command will print out your job ID):
73 | ```
74 | squeue -j your_job_id
75 | ```
76 | 
77 | You can also cancel jobs with:
78 | ```
79 | scancel your_job_id
80 | ```
81 | 
82 | ### Check results of your batch job (assuming write to `.out` file)
83 | In your Slurm script, you will specify a `.out` file, where the output of your program will be written. You can download the file to your local machine to look at the results (via `scp`), or you can check the results on the Midway command line with the standard Unix `cat` command.
84 | 
85 | ```
86 | cat gpu.out
87 | ```
88 | 
89 | ### Run interactive jobs
90 | You should not perform intensive computational tasks on the login nodes. Use the `sinteractive` command to set up an interactive session on other Midway nodes if you want to have an interactive command line experience (you can specify exactly which nodes you would like to access; see the documentation for syntax). Here we request 4 cores for 15 minutes. Additionally, you can use the interactive mode to run Jupyter notebooks, which you can view in your browser (see documentation for more details).
91 | 
92 | ```
93 | sinteractive --time=00:15:00 --ntasks=4
94 | ```
95 | 
96 | \* Note: this setup allows for use of mpi4py, PyOpenCL, as well as PySpark (which is included in the mpi4py module)
97 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name=mpi
 4 | #SBATCH --output=mpi.out
 5 | #SBATCH --ntasks=4
 6 | #SBATCH --partition=broadwl
 7 | #SBATCH --constraint=fdr
 8 | 
 9 | # Load the default mpi4py/Anaconda module.
10 | module load mpi4py/3.0.1a0_py3
11 | 
12 | # Run the python program with mpirun. The -n flag is not required;
13 | # mpirun will automatically figure out the best configuration from the
14 | # Slurm environment variables.
15 | mpirun python ./mpi_rand_walk.py
16 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi_multi_job.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name=mpi_multi_job
 4 | #SBATCH --ntasks=11
 5 | #SBATCH --partition=broadwl
 6 | #SBATCH --constraint=fdr
 7 | 
 8 | # Load the default mpi4py/Anaconda module.
 9 | module load mpi4py/3.0.1a0_py3
10 | 
11 | # Run the python program with mpirun, using & to run jobs at the same time
12 | mpirun -n 1 python ./mpi_rand_walk.py > ./mpi_nprocs1.out &
13 | mpirun -n 10 python ./mpi_rand_walk.py > ./mpi_nprocs10.out &
14 | 
15 | # Wait until all simultaneous mpiruns are done
16 | wait
17 | 


--------------------------------------------------------------------------------
/Labs/Lab 1 Midway RCC and mpi4py/mpi_rand_walk.py:
--------------------------------------------------------------------------------
 1 | from mpi4py import MPI
 2 | import matplotlib.pyplot as plt
 3 | import numpy as np
 4 | import time
 5 | 
 6 | def sim_rand_walks_parallel(n_runs):
 7 |     # Get rank of process and overall size of communicator:
 8 |     comm = MPI.COMM_WORLD
 9 |     rank = comm.Get_rank()
10 |     size = comm.Get_size()
11 | 
12 |     # Start time:
13 |     t0 = time.time()
14 | 
15 |     # Evenly distribute number of simulation runs across processes
16 |     N = int(n_runs/size)
17 | 
18 |     # Simulate N random walks and specify as a NumPy Array
19 |     r_walks = []
20 |     for i in range(N):
21 |         steps = np.random.normal(loc=0, scale=1, size=100)
22 |         steps[0] = 0
23 |         r_walks.append(100 + np.cumsum(steps))
24 |     r_walks_array = np.array(r_walks)
25 | 
26 |     # Gather all simulation arrays to buffer of expected size/dtype on rank 0
27 |     r_walks_all = None
28 |     if rank == 0:
29 |         r_walks_all = np.empty([N*size, 100], dtype='float')
30 |     comm.Gather(sendbuf = r_walks_array, recvbuf = r_walks_all, root=0)
31 | 
32 |     # Print/plot simulation results on rank 0
33 |     if rank == 0:
34 |         # Calculate time elapsed after computing mean and std
35 |         average_finish = np.mean(r_walks_all[:,-1])
36 |         std_finish = np.std(r_walks_all[:,-1])
37 |         time_elapsed = time.time() - t0
38 | 
39 |         # Print time elapsed + simulation results
40 |         print("Simulated %d Random Walks in: %f seconds on %d MPI processes"
41 |                 % (n_runs, time_elapsed, size))
42 |         print("Average final position: %f, Standard Deviation: %f"
43 |                 % (average_finish, std_finish))
44 | 
45 |         # Plot Simulations and save to file
46 |         plt.plot(r_walks_all.transpose())
47 |         plt.savefig("r_walk_nprocs%d_nruns%d.png" % (size, n_runs))
48 | 
49 |     return
50 | 
51 | def main():
52 |     sim_rand_walks_parallel(n_runs = 10000)
53 | 
54 | if __name__ == '__main__':
55 |     main()
56 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/gpu.sbatch:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=gpu   # job name
 3 | #SBATCH --output=gpu.out # output log file
 4 | #SBATCH --error=gpu.err  # error file
 5 | #SBATCH --time=00:05:00  # 5 minutes of wall time
 6 | #SBATCH --nodes=1        # 1 GPU node
 7 | #SBATCH --partition=gpu2 # GPU2 partition
 8 | #SBATCH --ntasks=1       # 1 CPU core to drive GPU
 9 | #SBATCH --gres=gpu:1     # Request 1 GPU
10 | 
11 | module load cuda
12 | module load mpi4py/3.0.1a0_py3
13 | 
14 | python ./print_gpu_info.py
15 | # python ./gpu_rand_walk.py
16 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/gpu_rand_walk.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pyopencl as cl
 3 | import pyopencl.array as cl_array
 4 | import pyopencl.clrandom as clrand
 5 | import pyopencl.tools as cltools
 6 | from pyopencl.scan import GenericScanKernel
 7 | import matplotlib.pyplot as plt
 8 | import time
 9 | 
10 | def sim_rand_walks(n_runs):
11 |     # Set up context and command queue
12 |     ctx = cl.create_some_context()
13 |     queue = cl.CommandQueue(ctx)
14 | 
15 |     # mem_pool = cltools.MemoryPool(cltools.ImmediateAllocator(queue))
16 | 
17 |     t0 = time.time()
18 | 
19 |     # Generate an array of Normal Random Numbers on GPU of length n_sims*n_steps
20 |     n_steps = 100
21 |     rand_gen = clrand.PhiloxGenerator(ctx)
22 |     ran = rand_gen.normal(queue, (n_runs*n_steps), np.float32, mu=0, sigma=1)
23 | 
24 |     # Establish boundaries for each simulated walk (i.e. start and end)
25 |     # Necessary so that we perform scan only within rand walks and not between
26 |     seg_boundaries = [1] + [0]*(n_steps-1)
27 |     seg_boundaries = np.array(seg_boundaries, dtype=np.uint8)
28 |     seg_boundary_flags = np.tile(seg_boundaries, int(n_runs))
29 |     seg_boundary_flags = cl_array.to_device(queue, seg_boundary_flags)
30 | 
31 |     # GPU: Define Segmented Scan Kernel, scanning simulations: f(n-1) + f(n)
32 |     prefix_sum = GenericScanKernel(ctx, np.float32,
33 |                 arguments="__global float *ary, __global char *segflags, "
34 |                     "__global float *out",
35 |                 input_expr="ary[i]",
36 |                 scan_expr="across_seg_boundary ? b : (a+b)", neutral="0",
37 |                 is_segment_start_expr="segflags[i]",
38 |                 output_statement="out[i] = item + 100",
39 |                 options=[])
40 | 
41 |     # Allocate space for result of kernel on device
42 |     '''
43 |     Note: use a Memory Pool (commented out above and below) if you're invoking
44 |     multiple times to avoid wasting time creating brand new memory areas each
45 |     time you invoke the kernel: https://documen.tician.de/pyopencl/tools.html
46 |     '''
47 |     # dev_result = cl_array.arange(queue, len(ran), dtype=np.float32,
48 |     #                                allocator=mem_pool)
49 |     dev_result = cl_array.empty_like(ran)
50 | 
51 |     # Enqueue and Run Scan Kernel
52 |     prefix_sum(ran, seg_boundary_flags, dev_result)
53 | 
54 |     # Get results back on CPU to plot and do final calcs, just as in Lab 1
55 |     r_walks_all = (dev_result.get()
56 |                          .reshape(n_runs, n_steps)
57 |                          .transpose()
58 |                          )
59 | 
60 |     average_finish = np.mean(r_walks_all[-1])
61 |     std_finish = np.std(r_walks_all[-1])
62 |     final_time = time.time()
63 |     time_elapsed = final_time - t0
64 | 
65 |     print("Simulated %d Random Walks in: %f seconds"
66 |                 % (n_runs, time_elapsed))
67 |     print("Average final position: %f, Standard Deviation: %f"
68 |                 % (average_finish, std_finish))
69 | 
70 |     # Plot Random Walk Paths
71 |     '''
72 |     Note: Scan already only starts scanning at the second entry, but for the
73 |     sake of the plot, let's set all of our random walk starting positions to 100
74 |     and then plot the random walk paths.
75 |     '''
76 |     r_walks_all[0] = [100]*n_runs
77 |     plt.plot(r_walks_all)
78 |     plt.savefig("r_walk_nruns%d_gpu.png" % n_runs)
79 | 
80 |     return
81 | 
82 | def main():
83 |     sim_rand_walks(n_runs = 10000)
84 | 
85 | if __name__ == '__main__':
86 |     main()
87 | 


--------------------------------------------------------------------------------
/Labs/Lab 2 PyOpenCL/print_gpu_info.py:
--------------------------------------------------------------------------------
 1 | import pyopencl as cl
 2 | 
 3 | def print_device_info() :
 4 |  print('\n' + '=' * 60 + '\nOpenCL Platforms and Devices')
 5 |  for platform in cl.get_platforms():
 6 |   print('=' * 60)
 7 |   print('Platform - Name: ' + platform.name)
 8 |   print('Platform - Vendor: ' + platform.vendor)
 9 |   print('Platform - Version: ' + platform.version)
10 |   print('Platform - Profile: ' + platform.profile)
11 |  for device in platform.get_devices():
12 |   print(' ' + '-' * 56)
13 |   print(' Device - Name: ' \
14 |     + device.name)
15 |   print(' Device - Type: ' \
16 |     + cl.device_type.to_string(device.type))
17 |   print(' Device - Max Clock Speed: {0} Mhz'\
18 |     .format(device.max_clock_frequency))
19 |   print(' Device - Compute Units: {0}'\
20 |     .format(device.max_compute_units))
21 |   print(' Device - Local Memory: {0:.0f} KB'\
22 |     .format(device.local_mem_size/1024.0))
23 |   print(' Device - Constant Memory: {0:.0f} KB'\
24 |     .format(device.max_constant_buffer_size/1024.0))
25 |   print(' Device - Global Memory: {0:.0f} GB'\
26 |     .format(device.global_mem_size/1073741824.0))
27 |   print(' Device - Max Buffer/Image Size: {0:.0f} MB'\
28 |     .format(device.max_mem_alloc_size/1048576.0))
29 |   print(' Device - Max Work Group Size: {0:.0f}'\
30 |     .format(device.max_work_group_size))
31 |   print('\n')
32 | 
33 | if __name__ == "__main__":
34 |   print_device_info()
35 | 


--------------------------------------------------------------------------------
/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/.mrjob.conf:
--------------------------------------------------------------------------------
 1 | runners:
 2 |   emr:
 3 |     # Specify a pem key to start up an EMR cluster on your behalf
 4 |     ec2_key_pair: MACS_30123
 5 |     ec2_key_pair_file: ~/MACS_30123.pem
 6 | 
 7 |     # Specify type/# of EC2 instances you want your code to run on
 8 |     core_instance_type: m5.xlarge
 9 |     num_core_instances: 3
10 |     region: us-east-1
11 | 
12 |     # to read from/write to S3; note colons instead of "=":
13 |     aws_access_key_id: <your key ID>
14 |     aws_secret_access_key: <your secret>
15 |     aws_session_token: <your session token>
16 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mapreduce_lab5.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Lab 5 (Part I): Batch Processing Data with MapReduce
 3 | 
 4 | What's the most used word in 5-star customer reviews on Amazon?
 5 | 
 6 | We can answer this question using the mrjob package to investigate customer
 7 | reviews available as a part of Amazon's public S3 customer reviews dataset.
 8 | 
 9 | For this demo, we'll use a small sample of this 100m+ review dataset that
10 | Amazon provides (s3://amazon-reviews-pds/tsv/sample_us.tsv).
11 | 
12 | In order to run the code below, be sure to `pip install mrjob` if you have not 
13 | done so already.
14 | '''
15 | 
16 | from mrjob.job import MRJob
17 | from mrjob.step import MRStep
18 | import re
19 | 
20 | WORD_RE = re.compile(r"[\w']+")
21 | 
22 | class MRMostUsedWord(MRJob):
23 | 
24 |     def mapper_get_words(self, _, row):
25 |         '''
26 |         If a review's star rating is 5, yield all of the words in the review
27 |         '''
28 |         data = row.split('\t')
29 |         if data[7] == '5':
30 |             for word in WORD_RE.findall(data[13]):
31 |                 yield (word.lower(), 1)
32 | 
33 |     def combiner_count_words(self, word, counts):
34 |         '''
35 |         Sum all of the words available so far
36 |         '''
37 |         yield (word, sum(counts))
38 | 
39 |     def reducer_count_words(self, word, counts):
40 |         '''
41 |         Arrive at a total count for each word in the 5 star reviews
42 |         '''
43 |         yield None, (sum(counts), word)
44 | 
45 |     # discard the key; it is just None
46 |     def reducer_find_max_word(self, _, word_count_pairs):
47 |         '''
48 |         Yield the word that occurs the most in the 5 star reviews
49 |         '''
50 |         yield max(word_count_pairs)
51 | 
52 |     def steps(self):
53 |         return [
54 |             MRStep(mapper=self.mapper_get_words,
55 |                    combiner=self.combiner_count_words,
56 |                    reducer=self.reducer_count_words),
57 |             MRStep(reducer=self.reducer_find_max_word)
58 |         ]
59 | 
60 | if __name__ == '__main__':
61 |     MRMostUsedWord.run()
62 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mrjob_cheatsheet.md:
--------------------------------------------------------------------------------
 1 | # mrjob Cheat Sheet
 2 | 
 3 | To run/debug `mrjob` code locally from your command line:
 4 | 
 5 | ```
 6 | python mapreduce_lab5.py sample_us.tsv
 7 | ```
 8 | 
 9 | To run your `mrjob` code on an AWS EMR cluster, you should first ensure that your configuration file is set with your EC2 pem file name and file location, as well as your current credentials from AWS Education/Vocareum. Note that the credentials are listed with ":" here and not "=" as they are in your `credentials` file. `mrjob` assumes that this (`.mrjob.conf`) file will be located in your home directory (at `~/.mrjob.conf`), so you will need to put the file there. Otherwise, you will need to designate your configuration [as a command line option](https://mrjob.readthedocs.io/en/latest/cmd.html#create-cluster) when you start your `mrjob` job using the `-c` flag.
10 | 
11 | ~/.mrjob.conf
12 | ```
13 | runners:
14 |   emr:
15 |     # Specify a pem key to start up an EMR cluster on your behalf
16 |     ec2_key_pair: MACS_30123
17 |     ec2_key_pair_file: ~/MACS_30123.pem
18 | 
19 |     # Specify type/# of EC2 instances you want your code to run on
20 |     core_instance_type: m5.xlarge
21 |     num_core_instances: 3
22 |     region: us-east-1
23 | 
24 |     # to read from/write to S3; note ":" instead of "=" from `credentials`:
25 |     aws_access_key_id: <your key ID>
26 |     aws_secret_access_key: <your secret>
27 |     aws_session_token: <your session token>
28 | ```
29 | 
30 | To run your `mrjob` code on an EMR cluster (of the size and type specified in your configuration file), you can run the following command on the command line (Note: before running this code, ensure that you have created the default IAM "EMR roles" for your cluster via the AWS console by following the instructions in the lab video for today of starting up and terminating an EMR cluster). Note that this command will start up a cluster, run your job, and then automatically terminate the cluster for you.
31 | 
32 | ```
33 | python mapreduce_lab5.py -r emr sample_us.tsv
34 | ```
35 | 
36 | To create a stand-alone cluster that you can use to run multiple jobs on, you can run:
37 | ```
38 | mrjob create-cluster
39 | ```
40 | 
41 | When you create a cluster, `mrjob` will print out the ID number for the cluster ("j" followed by a bunch of numbers and characters). If you copy this job-id number and specify it after the `--cluster-id` flag on the command line, your code will be run on this already-running cluster. Here, we additionally write the results of our job out to a text file (`mr.out`) via the `> mr.out` addendum at the end of the line.
42 | 
43 | ```
44 | python mapreduce_lab5.py -r emr sample_us.tsv --cluster-id=<cluster-id #> > mr.out
45 | ```
46 | 
47 | When you're done running jobs on your cluster, you can terminate the cluster with the following command so that you don't have to pay for it any longer than you need to:
48 | 
49 | ```
50 | mrjob terminate-cluster <cluster-id #>
51 | ```
52 | 
53 | For additional configuration options, consult the [`mrjob` documentation](https://mrjob.readthedocs.io/en/latest/index.html).
54 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/sample_us.tsv:
--------------------------------------------------------------------------------
 1 | marketplace	customer_id	review_id	product_id	product_parent	product_title	product_category	star_rating	helpful_votes	total_votes	vine	verified_purchase	review_headline	review_body	review_date
 2 | US	18778586	RDIJS7QYB6XNR	B00EDBY7X8	122952789	Monopoly Junior Board Game	Toys	5	0	0	N	Y	Five Stars	Excellent!!!	2015-08-31
 3 | US	24769659	R36ED1U38IELG8	B00D7JFOPC	952062646	56 Pieces of Wooden Train Track Compatible with All Major Train Brands	Toys	5	0	0	N	Y	Good quality track at excellent price	Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends.	2015-08-31
 4 | US	44331596	R1UE3RPRGCOLD	B002LHA74O	818126353	Super Jumbo Playing Cards by S&S Worldwide	Toys	2	1	1	N	Y	Two Stars	Cards are not as big as pictured.	2015-08-31
 5 | US	23310293	R298788GS6I901	B00ARPLCGY	261944918	Barbie Doll and Fashions Barbie Gift Set	Toys	5	0	0	N	Y	my daughter loved it and i liked the price and it came ...	my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop!	2015-08-31
 6 | US	38745832	RNX4EXOBBPN5	B00UZOPOFW	717410439	Emazing Lights eLite Flow Glow Sticks - Spinning Light LED Toy	Toys	1	1	1	N	Y	DONT BUY THESE!	Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds.	2015-08-31
 7 | US	13394189	R3BPETL222LMIM	B009B7F6CA	873028700	Melissa & Doug Water Wow Coloring Book - Vehicles	Toys	5	0	0	N	Y	Five Stars	Great item. Pictures pop thru and add detail as &#34;painted.&#34;  Pictures dry and it can be repainted.	2015-08-31
 8 | US	2749569	R3SORMPJZO3F2J	B0101EHRSM	723424342	Big Bang Cosmic Pegasus (Pegasis) Metal 4D High Performance Generic Battling Top BB-105	Toys	3	2	2	N	Y	Three Stars	To keep together, had to use crazy  glue.	2015-08-31
 9 | US	41137196	R2RDOJQ0WBZCF6	B00407S11Y	383363775	Fun Express Insect Finger Puppets 12ct Toy	Toys	5	0	0	N	Y	Five Stars	I was pleased with the product.	2015-08-31
10 | US	433677	R2B8VBEPB4YEZ7	B00FGPU7U2	780517568	Fisher-Price Octonauts Shellington's On-The-Go Pod Toy	Toys	5	0	0	N	Y	Five Stars	Children like it	2015-08-31
11 | US	1297934	R1CB783I7B0U52	B0013OY0S0	269360126	Claw Climber Goliath/ Disney's Gargoyles	Toys	1	0	1	N	Y	Shame on the seller !!!	Showed up not how it's shown . Was someone's old toy. with paint on it.	2015-08-31
12 | US	52006292	R2D90RQQ3V8LH	B00519PJTW	493486387	100 Foot Multicolor Pennant Banner	Toys	5	0	0	N	Y	Five Stars	Really liked these. They were a little larger than I thought, but still fun.	2015-08-31
13 | US	32071052	R1Y4ZOUGFMJ327	B001TCY2DO	459122467	Pig Jumbo Foil Balloon	Toys	5	0	0	N	Y	Nice huge balloon	Nice huge balloon! Had my local grocery store fill it up for a very small fee, it was totally worth it!	2015-08-31
14 | US	7360347	R2BUV9QJI2A00X	B00DOQCWF8	226984155	Minecraft Animal Toy (6-Pack)	Toys	5	0	1	N	Y	Five Stars	Great deal	2015-08-31
15 | US	11613707	RSUHRJFJIRB3Z	B004C04I4I	375659886	Disney Baby: Eeyore Large Plush	Toys	4	0	0	N	Y	Four Stars	As Advertised	2015-08-31
16 | US	13545982	R1T96CG98BBA15	B00NWGEKBY	933734136	Team Losi 8IGHT-E RTR AVC Electric 4WD Buggy Vehicle (1/8 Scale)	Toys	3	2	4	N	Y	... servo so expect to spend 150 more on a good servo immediately be the stock one breaks right	Comes w a 15$ servo so expect to spend 150 more on a good servo immediately be the stock one breaks right away	2015-08-31
17 | US	43880421	R2ATXF4QQ30YW	B00000JS5S	341842639	Hot Wheels 48- Car storage Case With Easy Grip Carrying Case	Toys	5	0	0	N	Y	Five Stars	awesome ! Thanks!	2015-08-31
18 | US	1662075	R1YS3DS218NNMD	B00XPWXYDK	210135375	ZuZo 2.4GHz 4 CH 6 Axis Gyro RC Quadcopter Drone with Camera & LED Lights, 38 x 38 x 7cm	Toys	5	4	4	N	N	The closest relevance I have to items like these is while in the army I was trained ...	I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.<br /><br />*Disclaimer I received this product at a discount for my unbiased review.	2015-08-31
19 | US	18461411	R2SDXLTLF92O0H	B00VPXX92W	705054378	Teenage Mutant Ninja Turtles T-Machines Tiger Claw in Safari Truck Diecast Vehicle	Toys	5	0	0	N	Y	Five Stars	It was a birthday present for my grandson and he LOVES IT!!	2015-08-31
20 | US	27225859	R4R337CCDWLNG	B00YRA3H4U	223420727	Franklin Sports MLB Fold Away Batting Tee	Toys	3	0	1	Y	N	Got wrong product in the shipment	Got a wrong product from Amazon Vine and unable to provide a good review. We received a pair of cute girls gloves and a baseball ball instead, while we were expecting a boys batting tee. The gloves are cute, however made for at least 6+ yrs or above...more likely 8-9 yrs old girls.<br /><br />Can't provide a fair review as we were not able to use the product.	2015-08-31
21 | US	20494593	R32Z6UA4S5Q630	B009T8BSQY	787701676	Alien Frontiers: Factions	Toys	1	0	0	N	Y	Overpriced.	You need expansion packs 3-5 if you want access to the player aids for the Factions expansion. The base game of Alien Frontiers just plays so much smoother than adding Factions with the expansion packs. All this will do is pigeonhole you into a certain path to victory.	2015-08-31
22 | US	6762003	R1H1HOVB44808I	B00PXWS1CY	996611871	Holy Stone F180C Mini RC Quadcopter Drone with Camera 2.4GHz 6-Axis Gyro Bonus Battery and 8 Blades	Toys	5	1	1	N	N	Five Stars	Awesome customer service and a cool little drone! Especially for the price!	2015-08-31
23 | US	25402244	R4UVQIRZ5T1FM	1591749352	741582499	Klutz Sticker Design Studio: Create Your Own Custom Stickers Craft Kit	Toys	4	1	2	N	Y	Great product for little girls!	I got these for my daughters for plane trip. I liked that the zipper pouch was attached for markers.  However, that pouch fell off.  But the girls have loved coloring their own stickers. Would def buy this again.	2015-08-31
24 | US	32910511	R226K8IJLRPTIR	B00V5DM3RE	587799706	Yoga Joes - Green Army Men Toys	Toys	5	0	1	N	Y	Creative and fun!	My girlfriend and I are both into yoga and I gave her a set of the Yoga Joes for her new home yoga room. When she saw them, she was impressed that I had found little green army men like her brother used to play with. Then she realized they were doing yoga and she almost exploded with delight. You should have seen the look on her face. Needless to say, the gift was a huge hit. They are absolutely brilliant!	2015-08-31
25 | US	18206299	R3031Q42BKAN7J	B00UMSVHD4	135383196	Lalaloopsy Girls Basic Doll- Prairie Dusty Trails	Toys	4	1	1	N	N	i like it but i absoloutely hate that some dolls don't ...	i like it but i absoloutely hate that some dolls don't have pets like this one so I'm not stoked and i really would have liked to see her pet	2015-08-31
26 | US	26599182	R44NP0QG6E98W	B00JLKI69W	375626298	WOW Toys Town Advent Calendar	Toys	3	1	1	N	Y	We love how well they are made	We have MANY Wow toys in our home.  We love how well they are made.  The advent calendar is an exception.  The plastic is thinner than our other wow toys and the barn animals won't even stand up on their own due to the head weighing more than the body and uneven base of the toy.  Very disappointing when we love the concept.  The story to read along with everyday is great.  I would have preferred quality (and would have paid more for it) instead of a cheap &#34;knock-off&#34; of the rest of their toy line.  It is very obvious by the weight of the toys sloppy paint jobs which items in our &#34;Wow Toys&#34; bin are part advent calendar vs. the rest of the toys we have.  Also there is a lot of overlap between the wow town advent calendar & winter wonderland calendar, which makes me want to look for alternatives for this year's advent calendar options.	2015-08-31
27 | US	128540	R24VKWVWUMV3M3	B004S8F7QM	829220659	Cards Against Humanity	Toys	5	0	0	N	Y	Five Stars	Tons of fun	2015-08-31
28 | US	125518	R2MW3TEBPWKENS	B00MZ6BR3Q	145562057	Monster High Haunted Student Spirits Porter Geiss Doll	Toys	5	0	0	N	Y	Five Stars	Love it great add to collecton	2015-08-31
29 | US	15048896	R3N01IESYEYW01	B001CTOC8O	278247652	Star Wars Clone Wars Clone Trooper Child's Deluxe Captain Rex Costume	Toys	5	0	1	N	Y	Five Stars	Exactly as described.  Fits my 7-yr old well!	2015-08-31
30 | US	12191231	RKLAK7EPEG5S6	B00BMKL5WY	906199996	LEGO Creative Tower Building Kit XXL 1600 Pieces 10664	Toys	5	1	2	N	Y	The children LOVED them and best part was that it helped the ...	Purchased these Lego's to help aid me with teaching my Sunday School class. The children LOVED them and best part was that it helped the children remember past lessons. The true Lego brand seem to work/snap together/fit together better than the generic brands. (Plus I had some Lego snobs in my class that only would use the real Lego brand and shunned the generic brand). Only wish Lego's were a little cheaper but you get what you pay for and I would recommend for quality to purchase this set.	2015-08-31
31 | US	18409006	R1HOJ5GOA2JWM0	B00L71H0F4	692305292	Barkology Princess the Poodle Hand Puppet	Toys	2	1	1	N	Y	My little dog can bite my hand right through the puupet.	IT's OK, but not as good as the old Bite Meez puppets. This puppet is very thin. My little yorkie often bites my hands right through the stuffing. It not durable enough to really play with.	2015-08-31
32 | US	42523709	RO5VL1EAPX6O3	B004CLZRM4	59085350	Intex Mesh Lounge (Colors May Vary)	Toys	1	0	0	N	Y	Save your money...don't buy!	This was to be a gift for my husband for our new pool. Did not receive the color I ordered but most of all after only one month of use (not continuously) the mesh pulled away from the material and the inflatable side. Completely shredded and no longer of use. It was stored properly and was not kept outside or in the pool. Poorly made, better off going to W**-M*** and getting something on clearance.	2015-08-31
33 | US	45601416	R3OSJU70OIBWVE	B000PEOMC8	895316207	Intex River Run I Sport Lounge, Inflatable Water Float, 53 in Diameter	Toys	5	0	0	N	Y	but I've bought one in the past and loved	Ended up sending this guy back because I didnt need it, but I've bought one in the past and loved it	2015-08-31
34 | US	47546726	R3NFZZCJSROBT4	B008W1BPWQ	397107238	Peppa Pig 7 Wood Puzzles In Wooden Storage Box (styles will vary)	Toys	3	0	0	N	Y	Three Stars	The product is good, but the box was broken.	2015-08-31
35 | US	21448082	R47XBGQFP039N	B00FZX71BI	480992295	Paraboard - Parallel Charging Board for Lipos with EC5 Connectors	Toys	5	0	0	N	Y	CAN SAVE TME IF YOU UNDESTAND HOW IT WORKS .	Works well ,quality product but this style of board will charge multiple batteries at the same time SAFELY  ( IF )  ALL BATTERIES ARE OF THE SAME CELL COUNT , THE SAME BATTERY COMPOSITION ( LIPO, NIMH-etc ) AND THEY MUST HAVE INDIVIDUAL CELL VOLTAGES THAT ARE VERY CLOSE AND EQUAL TO EACH BATTERY CONNECTED AT THE SAME TIME . When board is connected to most if not all chargers it can only read total and individual cell voltage of ONE OF THE BATTERIES AND MAY OVER OR UNDER CHARGE THE OTHERS TO SOME DEGREE , TOTAL RATE OF CHARGE IS DIVIDED EQUALLY BETWEEN BATTERIES CONNECTED AT THE SAME TIME . Close monitoring is a must  when using like all high discharge batteies . I  have only personally expeienced one lipo battery meltdown and it is a very  SHORT IF NOT  NON EXISTANT WINDOW OF OPPERUNITY TO PREVENT OR MINIMISE THE COLATTERAL DAMAGE ONCE THE PROCESS STARTS . Read and understand all charging and battery instructions .	2015-08-31
36 | US	12612039	R1JS8G26X4RM2G	B00D4NJSJE	408940178	The Game of Life Money and Asset Board Game, Fame Edition	Toys	5	0	1	N	Y	Five Stars	Great gift!	2015-08-31
37 | US	44928701	R1ORWPFQ9EDYA0	B000HZZT7W	967346376	LCR Dice Game (Red Chips)	Toys	5	0	0	N	Y	Five Stars	We play this game quite a bit with friends.	2015-08-31
38 | US	43173394	R1YIX4SO32U0GT	B002G54DAA	57447684	BCW - Deluxe Currency Slab - Regular Bill - Dollar / Currency Collecting Supplies	Toys	5	0	1	N	Y	BCW - Deluxe Currency Slab	Fits my $20 bill perfectly.	2015-08-31
39 | US	11210951	R1W3QQZ8JKECCI	B003JT0L4Y	876626440	Ocean Life Stamps Birthday Party Supplies Loot Bag Accessories 24 Pieces per Unit	Toys	5	0	0	N	Y	Fun for birthday party favor	I ordered these for my 3 year old son's birthday party as party favors. They were a huge hit & the perfect fit for a 3 year old!	2015-08-31
40 | US	12918717	RZX17JIYIPAR	B00KQUNNZ8	644368246	New Age Scare Halloween Party Pumpkin and Bat Hanging Round Lantern Decoration, Paper, 9" Pack of 3	Toys	5	0	0	N	N	Love the prints!	These paper lanterns are adorable! The colors are bright, the patterns are fun & trendy & they're a good size! They came well packaged & are easy to assemble, they're also really easy to take apart & flatten back down! We'll get to use them for a few Halloweens I'm sure! They even came with the string needed to hang! I'm glad I grabbed them, they're really cute, my kids are excited to add to the Halloween decor & are already asking to hang them!<br />I received this item at a discount for my unbiased review.	2015-08-31
41 | US	47781982	RIDVQ4P3WJR42	B00WTGGGRO	162262449	Pokemon - Double Dragon Energy (97/108) - XY Roaring Skies	Toys	5	1	1	N	Y	Five Stars	My Grandson loves these cards.  Thank you	2015-08-31
42 | US	34874898	R1WQ3ME3JAG2O1	B00WAKEQLW	824555589	Whiffer Sniffers Mystery Pack 1 Scented Backpack Clip	Toys	1	0	6	N	Y	One Star	Received a pineapple rather than the advertised s'more	2015-08-31
43 | US	20962528	RNTPOUDQIICBF	B00M5AT30G	548190970	AmiGami Fox and Owl Figure 2-Pack	Toys	4	0	0	N	Y	Four Stars	Christmas gift for 6yr	2015-08-31
44 | US	47781982	R3AHZWWOL0IAV0	B00GNDY40U	438056479	Pokemon - Gyarados (31/113) - Legendary Treasures	Toys	5	0	0	N	Y	Five Stars	My Grandson loves these cards.  Thank you	2015-08-31
45 | US	13328687	R3PDXKS9O2Z20B	B00WJ1OPMW	120071056	LeapFrog LeapTV Letter Factory Adventures Educational, Active Video Game	Toys	5	0	0	N	N	they LOVE this game	Even though both of my kids are at the top of this age recommendation level, they LOVE this game!  I love how it caters to the kinesthetic learner by asking them to move their bodies into the shape of the letters.  It even takes teamwork as sometimes two people are required to finish the letter.  My kids know all of their letter sounds and shapes, but this didn't stop them from playing the game over and over.	2015-08-31
46 | US	16245463	R23URALWA7IHWL	B00IGXV9UI	765869385	Disney Planes: Fire & Rescue Scoop & Spray Firefighter Dusty	Toys	5	0	0	N	Y	Five Stars	My 5 year old son loves this.	2015-08-31
47 | US	11916403	R36L8VKT9ZSUY6	B00JVY9J1M	771795950	Winston Zeddmore & Ecto-1: Funko POP! Rides x Ghostbusters Vinyl Figure	Toys	5	0	0	N	Y	Five Stars	love it	2015-08-31
48 | US	5543658	R23JRQR6VMY4TV	B008AL15M8	211944547	Yu-Gi-Oh! - Solemn Judgment (GLD5-EN045) - Gold Series: Haunted Mine - Limited Edition - Ghost/Gold Hybrid Rare	Toys	5	0	0	N	Y	Absolutely one of the best traps in the game	Absolutely one of the best traps in the game. It is never a dead and always live since you can always pay half your lifepoints for its cost. It's main power is that it can stop any card. Hopefully this card comes off the Forbidden/Limited list soon.	2015-08-31
49 | US	41168357	R3T73PQZZ9F6GT	B00CAEEDC0	72805974	Seat Pets Car Seat Toy	Toys	5	0	0	N	Y	Five Stars	really soft and cute	2015-08-31
50 | US	32866903	R300I65NW30Y19	B000TFLAZA	149264874	Baby Einstein Octoplush	Toys	5	0	0	N	Y	Five Stars	baby loved it - so attractive and very nice	2015-08-31
51 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/Lab 5 Kinesis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Lab 5 (Part II): Ingesting Streaming Data with Kinesis\n",
  8 |     "### MACS 30123: Large-Scale Computing for the Social Sciences (Spring 2020)\n",
  9 |     "\n",
 10 |     "In this second part of the lab, we'll explore how we can use Kinesis to ingest streaming text data, of the sort we might encounter on Twitter.\n",
 11 |     "\n",
 12 |     "To avoid requiring you to set up Twitter API access, we will create Twitter-like text and metadata using the `testdata` package to perform this demonstration. It should be easy enough to plug your streaming Twitter feed into this workflow if you desire to do so as an individual exercise (for instance, as a part of a final project!). Additionally, once you have this pipeline running, you can scale it up even further to include many more producers and consumers, if you would like, as discussed in lecture and the readings.\n",
 13 |     "\n",
 14 |     "Recall from the lecture and readings that in a Kinesis workflow, \"producers\" send data into a Kinesis stream and \"consumers\" draw data out of that stream to perform operations on it (i.e. real-time processing, archiving raw data, etc.). To make this a bit more concrete, we are going to implement a simplified version of this workflow in this lab, in which we spin up Producer and Consumer (t2.nano) EC2 Instances and create a Kinesis stream. Our Producer instance will run a producer script (which writes our Twitter-like text data into a Kinesis stream) and our Consumer instance will run a consumer script (which reads the Twitter-like data and calculates a simple demo statistic -- the average unique word count per tweet, as a real-time running average).\n",
 15 |     "\n",
 16 |     "You can visualize this data pipeline, like so:\n",
 17 |     "\n",
 18 |     "<img src=\"simple_kinesis_architecture.png\" width=\"800\" align=\"left\" />\n"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "To begin implementing this pipeline, let's import `boto3` and initialize the AWS services we'll be using in this lab (EC2 and Kinesis)."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 1,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import boto3\n",
 35 |     "import time\n",
 36 |     "\n",
 37 |     "session = boto3.Session()\n",
 38 |     "\n",
 39 |     "kinesis = session.client('kinesis')\n",
 40 |     "ec2 = session.resource('ec2')\n",
 41 |     "ec2_client = session.client('ec2')"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Then, we need to create the Kinesis stream that our Producer EC2 instance will write streaming tweets to. Because we're only setting this up to handle traffic from one consumer and one producer, we'll just use one shard, but we could increase our throughput capacity by increasing the ShardCount if we wanted to do so."
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 2,
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "response = kinesis.create_stream(StreamName = 'test_stream',\n",
 58 |     "                                 ShardCount = 1\n",
 59 |     "                                )\n",
 60 |     "\n",
 61 |     "# Is the stream active and ready to be written to/read from? Wait until it exists before moving on:\n",
 62 |     "waiter = kinesis.get_waiter('stream_exists')\n",
 63 |     "waiter.wait(StreamName='test_stream')"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "OK, now we're ready to set up our producer and consumer EC2 instances that will write to and read from this Kinesis stream. Let's spin up our two EC2 instances (specified by the `MaxCount` parameter) using one of the Amazon Linux AMIs.  Notice here that you will need to specify your `.pem` file for the `KeyName` parameter, as well as create a custom security group/group ID. Designating a security group is necessary because, by default, AWS does not allow inbound ssh traffic into EC2 instances (they create custom ssh-friendly security groups each time you run the GUI wizard in the console). Thus, if you don't set this parameter, you will not be able to ssh into the EC2 instances that you create here with `boto3`. You can follow along in the lab video for further instructions on how you can set up one of these security groups.\n",
 71 |     "\n",
 72 |     "Also, we need to specify an IAM Instance Profile so that our EC2 instances will have the permissions necessary to interact with other AWS services on our behalf. Here, I'm using one of the profiles we create in Part I of Lab 5 (a default AWS profile for launching EC2 instances within an EMR cluster), as this gives us all of the necessary permissions"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "instances = ec2.create_instances(ImageId='ami-0915e09cc7ceee3ab',\n",
 82 |     "                                 MinCount=1,\n",
 83 |     "                                 MaxCount=2,\n",
 84 |     "                                 InstanceType='t2.micro',\n",
 85 |     "                                 KeyName='MACS_30123',\n",
 86 |     "                                 SecurityGroupIds=['sg-02248fb2c9eac8bef'],\n",
 87 |     "                                 SecurityGroups=['MACS_30123'],\n",
 88 |     "                                 IamInstanceProfile=\n",
 89 |     "                                     {'Name': 'EMR_EC2_DefaultRole'},\n",
 90 |     "                                )\n",
 91 |     "\n",
 92 |     "# Wait until EC2 instances are running before moving on\n",
 93 |     "waiter = ec2_client.get_waiter('instance_running')\n",
 94 |     "waiter.wait(InstanceIds=[instance.id for instance in instances])"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "While we wait for these instances to start running, let's set up the Python scripts that we want to run on each instance. First of all, we have to define a script for our Producer instance, which continuously produces Twitter-like data using the `testdata` package and puts that data into our Kinesis stream."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 4,
107 |    "metadata": {},
108 |    "outputs": [
109 |     {
110 |      "name": "stdout",
111 |      "output_type": "stream",
112 |      "text": [
113 |       "Overwriting producer.py\n"
114 |      ]
115 |     }
116 |    ],
117 |    "source": [
118 |     "%%file producer.py\n",
119 |     "\n",
120 |     "import boto3\n",
121 |     "import testdata\n",
122 |     "import json\n",
123 |     "\n",
124 |     "kinesis = boto3.client('kinesis', region_name='us-east-1')\n",
125 |     "\n",
126 |     "# Continously write Twitter-like data into Kinesis stream\n",
127 |     "while 1 == 1:\n",
128 |     "    test_tweet = {'username': testdata.get_username(),\n",
129 |     "                  'tweet':    testdata.get_ascii_words(280)\n",
130 |     "                  }\n",
131 |     "    kinesis.put_record(StreamName = \"test_stream\",\n",
132 |     "                       Data = json.dumps(test_tweet),\n",
133 |     "                       PartitionKey = \"partitionkey\"\n",
134 |     "                      )"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "Then, we can define a script for our Consumer instance that gets the latest tweet out of the stream, one at a time. After processing each tweet, we then print out the average unique word count per processed tweet as a running average, before jumping on to the next indexed tweet in our Kinesis stream shard to do the same thing for as long as our program is running."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 5,
147 |    "metadata": {},
148 |    "outputs": [
149 |     {
150 |      "name": "stdout",
151 |      "output_type": "stream",
152 |      "text": [
153 |       "Overwriting consumer.py\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "%%file consumer.py\n",
159 |     "\n",
160 |     "import boto3\n",
161 |     "import time\n",
162 |     "import json\n",
163 |     "\n",
164 |     "kinesis = boto3.client('kinesis', region_name='us-east-1')\n",
165 |     "\n",
166 |     "shard_it = kinesis.get_shard_iterator(StreamName = \"test_stream\",\n",
167 |     "                                     ShardId = 'shardId-000000000000',\n",
168 |     "                                     ShardIteratorType = 'LATEST'\n",
169 |     "                                     )[\"ShardIterator\"]\n",
170 |     "\n",
171 |     "i = 0\n",
172 |     "s = 0\n",
173 |     "while 1==1:\n",
174 |     "    out = kinesis.get_records(ShardIterator = shard_it,\n",
175 |     "                              Limit = 1\n",
176 |     "                             )\n",
177 |     "    for o in out['Records']:\n",
178 |     "        jdat = json.loads(o['Data'])\n",
179 |     "        s = s + len(set(jdat['tweet'].split()))\n",
180 |     "        i = i + 1\n",
181 |     "\n",
182 |     "    if i != 0:\n",
183 |     "        print(\"Average Unique Word Count Per Tweet: \" + str(s/i))\n",
184 |     "        print(\"Sample of Current Tweet: \" + jdat['tweet'][:20])\n",
185 |     "        print(\"\\n\")\n",
186 |     "        \n",
187 |     "    shard_it = out['NextShardIterator']\n",
188 |     "    time.sleep(0.2)"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "As our final preparation step, we'll grab all of the public DNS names of our instances (web addresses that you normally copy from the GUI console to manually ssh into  and record the names of our code files, so that we can easily ssh/scp into the instances and pass them our Python scripts to run."
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 6,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "instance_dns = [instance.public_dns_name \n",
205 |     "                 for instance in ec2.instances.all() \n",
206 |     "                 if instance.state['Name'] == 'running'\n",
207 |     "               ]\n",
208 |     "\n",
209 |     "code = ['producer.py', 'consumer.py']"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "To copy our files over to our instances and programmatically run commands on them, we can use Python's `scp` and `paramiko` packages. You'll need to install these via `pip install paramiko scp` if you have not already done so."
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": 7,
222 |    "metadata": {},
223 |    "outputs": [
224 |     {
225 |      "name": "stdout",
226 |      "output_type": "stream",
227 |      "text": [
228 |       "Requirement already satisfied: paramiko in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (2.7.1)\n",
229 |       "Requirement already satisfied: scp in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (0.13.2)\n",
230 |       "Requirement already satisfied: bcrypt>=3.1.3 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (3.1.7)\n",
231 |       "Requirement already satisfied: pynacl>=1.0.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (1.3.0)\n",
232 |       "Requirement already satisfied: cryptography>=2.5 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (2.8)\n",
233 |       "Requirement already satisfied: six>=1.4.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.0)\n",
234 |       "Requirement already satisfied: cffi>=1.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.0)\n",
235 |       "Requirement already satisfied: pycparser in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko) (2.19)\n"
236 |      ]
237 |     }
238 |    ],
239 |    "source": [
240 |     "! pip install paramiko scp"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "Once we have `scp` and `paramiko` installed, we can copy our producer and consumer Python scripts over to the EC2 instances (designating our first EC2 instance in `instance_dns` as the producer and second EC2 instance as the consumer instance).\n",
248 |     "\n",
249 |     "Note that, on each instance, we install `boto3` (so that we can access Kinesis through our scripts) and then copy our producer/consumer Python code over to our producer/consumer EC2 instance via `scp`. After we've done this, we install the `testdata` package on the producer instance (which it needs in order to create fake tweets) and instruct it to run our Python producer script. This will write tweets into our Kinesis stream until we stop the script and terminate the producer EC2 instance.\n",
250 |     "\n",
251 |     "We could also instruct our consumer to get tweets from the stream immediately after this command and this would automatically collect and process the tweets according to the consumer.py script. For the purposes of this demonstration, though, we'll manually ssh into that instance and run the code from the terminal so that we can see the real-time consumption a bit more easily."
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": 7,
257 |    "metadata": {},
258 |    "outputs": [
259 |     {
260 |      "name": "stdout",
261 |      "output_type": "stream",
262 |      "text": [
263 |       "Producer Instance is Running producer.py\n",
264 |       ".........................................\n",
265 |       "Connect to Consumer Instance by running: ssh -i \"MACS_30123.pem\" ec2-user@ec2-52-207-178-178.compute-1.amazonaws.com\n"
266 |      ]
267 |     }
268 |    ],
269 |    "source": [
270 |     "import paramiko\n",
271 |     "from scp import SCPClient\n",
272 |     "ssh_producer, ssh_consumer = paramiko.SSHClient(), paramiko.SSHClient()\n",
273 |     "\n",
274 |     "# Initialization of SSH tunnels takes a bit of time; otherwise get connection error on first attempt\n",
275 |     "time.sleep(5)\n",
276 |     "\n",
277 |     "# Install boto3 on each EC2 instance and Copy our producer/consumer code onto producer/consumer EC2 instances\n",
278 |     "instance = 0\n",
279 |     "stdin, stdout, stderr = [[None, None] for i in range(3)]\n",
280 |     "for ssh in [ssh_producer, ssh_consumer]:\n",
281 |     "    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())\n",
282 |     "    ssh.connect(instance_dns[instance],\n",
283 |     "                username = 'ec2-user',\n",
284 |     "                key_filename='/home/jclindaniel/MACS_30123.pem')\n",
285 |     "    \n",
286 |     "    with SCPClient(ssh.get_transport()) as scp:\n",
287 |     "        scp.put(code[instance])\n",
288 |     "    \n",
289 |     "    if instance == 0:\n",
290 |     "        stdin[instance], stdout[instance], stderr[instance] = \\\n",
291 |     "            ssh.exec_command(\"sudo pip install boto3 testdata\")\n",
292 |     "    else:\n",
293 |     "        stdin[instance], stdout[instance], stderr[instance] = \\\n",
294 |     "            ssh.exec_command(\"sudo pip install boto3\")\n",
295 |     "\n",
296 |     "    instance += 1\n",
297 |     "\n",
298 |     "# Block until Producer has installed boto3 and testdata, then start running Producer script:\n",
299 |     "producer_exit_status = stdout[0].channel.recv_exit_status() \n",
300 |     "if producer_exit_status == 0:\n",
301 |     "    ssh_producer.exec_command(\"python %s\" % code[0])\n",
302 |     "    print(\"Producer Instance is Running producer.py\\n.........................................\")\n",
303 |     "else:\n",
304 |     "    print(\"Error\", producer_exit_status)\n",
305 |     "\n",
306 |     "# Close ssh and show connection instructions for manual access to Consumer Instance\n",
307 |     "ssh_consumer.close; ssh_producer.close()\n",
308 |     "\n",
309 |     "print(\"Connect to Consumer Instance by running: ssh -i \\\"MACS_30123.pem\\\" ec2-user@%s\" % instance_dns[1])"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "metadata": {},
315 |    "source": [
316 |     "If you run the command above (with the correct path to your actual `.pem` file), you should be inside your Consumer EC2 instance. If you run `python consumer.py`, you should also see a real-time count of the average number of unique words per tweet (along with a sample of the text in the most recent tweet), as in the screenshot:\n",
317 |     "\n",
318 |     "![](consumer_feed.png)\n",
319 |     "\n",
320 |     "Cool! Now we can scale this basic architecture up to perform any number of real-time data analyses, if we so desire. Also, if we execute our consumer code remotely via paramiko as well, the process will be entirely remote, so we don't need to keep any local resources running in order to keep streaming/processing real-time data.\n",
321 |     "\n",
322 |     "As a final note, when you are finished observing the real-time feed from your consumer instance, **be sure to terminate your EC2 instances and delete your Kinesis stream**. You don't want to be paying for these to run continuously! You can do so programmatically by running the following `boto3` code:"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": 8,
328 |    "metadata": {},
329 |    "outputs": [
330 |     {
331 |      "name": "stdout",
332 |      "output_type": "stream",
333 |      "text": [
334 |       "EC2 Instances Successfully Terminated\n",
335 |       "Kinesis Stream Successfully Deleted\n"
336 |      ]
337 |     }
338 |    ],
339 |    "source": [
340 |     "# Terminate EC2 Instances:\n",
341 |     "ec2_client.terminate_instances(InstanceIds=[instance.id for instance in instances])\n",
342 |     "\n",
343 |     "# Confirm that EC2 instances were terminated:\n",
344 |     "waiter = ec2_client.get_waiter('instance_terminated')\n",
345 |     "waiter.wait(InstanceIds=[instance.id for instance in instances])\n",
346 |     "print(\"EC2 Instances Successfully Terminated\")\n",
347 |     "\n",
348 |     "# Delete Kinesis Stream (if it currently exists):\n",
349 |     "try:\n",
350 |     "    response = kinesis.delete_stream(StreamName='test_stream')\n",
351 |     "except kinesis.exceptions.ResourceNotFoundException:\n",
352 |     "    pass\n",
353 |     "\n",
354 |     "# Confirm that Kinesis Stream was deleted:\n",
355 |     "waiter = kinesis.get_waiter('stream_not_exists')\n",
356 |     "waiter.wait(StreamName='test_stream')\n",
357 |     "print(\"Kinesis Stream Successfully Deleted\")"
358 |    ]
359 |   }
360 |  ],
361 |  "metadata": {
362 |   "kernelspec": {
363 |    "display_name": "Python 3",
364 |    "language": "python",
365 |    "name": "python3"
366 |   },
367 |   "language_info": {
368 |    "codemirror_mode": {
369 |     "name": "ipython",
370 |     "version": 3
371 |    },
372 |    "file_extension": ".py",
373 |    "mimetype": "text/x-python",
374 |    "name": "python",
375 |    "nbconvert_exporter": "python",
376 |    "pygments_lexer": "ipython3",
377 |    "version": "3.7.3"
378 |   }
379 |  },
380 |  "nbformat": 4,
381 |  "nbformat_minor": 4
382 | }
383 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | import time
 4 | import json
 5 | 
 6 | kinesis = boto3.client('kinesis', region_name='us-east-1')
 7 | 
 8 | shard_it = kinesis.get_shard_iterator(StreamName = "test_stream",
 9 |                                      ShardId = 'shardId-000000000000',
10 |                                      ShardIteratorType = 'LATEST'
11 |                                      )["ShardIterator"]
12 | 
13 | i = 0
14 | s = 0
15 | while 1==1:
16 |     out = kinesis.get_records(ShardIterator = shard_it,
17 |                               Limit = 1
18 |                              )
19 |     for o in out['Records']:
20 |         jdat = json.loads(o['Data'])
21 |         s = s + len(set(jdat['tweet'].split()))
22 |         i = i + 1
23 | 
24 |     if i != 0:
25 |         print("Average Unique Word Count Per Tweet: " + str(s/i))
26 |         print("Sample of Current Tweet: " + jdat['tweet'][:20])
27 |         print("\n")
28 |         
29 |     shard_it = out['NextShardIterator']
30 |     time.sleep(0.2)
31 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/producer.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | import testdata
 4 | import json
 5 | 
 6 | kinesis = boto3.client('kinesis', region_name='us-east-1')
 7 | 
 8 | # Continously write Twitter-like data into Kinesis stream
 9 | while 1 == 1:
10 |     test_tweet = {'username': testdata.get_username(),
11 |                   'tweet':    testdata.get_ascii_words(280)
12 |                   }
13 |     kinesis.put_record(StreamName = "test_stream",
14 |                        Data = json.dumps(test_tweet),
15 |                        PartitionKey = "partitionkey"
16 |                       )
17 | 


--------------------------------------------------------------------------------
/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML in an EMR Notebook/Lab_6.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Lab 6: Exploratory Data Analysis and Machine Learning in an EMR Notebook\n",
   8 |     "\n",
   9 |     "Today, we'll explore how you can use your PySpark coding skills in an [EMR Notebook](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html) on AWS. Specifically, we'll be working with all of the customer reviews for books in [Amazon's large customer reviews dataset on S3](https://s3.amazonaws.com/amazon-reviews-pds/readme.html). Note that this notebook is meant to be run on an EMR cluster, using a PySpark kernel, as demonstrated in the Lab video.\n",
  10 |     "\n",
  11 |     "First, let's load the customer book reviews data from S3 into our Spark session. The books data is spread across multiple parquet files (as we can see via the AWS CLI below), so we use the wildcard (\\*) to indicate that we want the data from all of these files to be included within our dataframe, spread out over our EMR cluster in partitions."
  12 |    ]
  13 |   },
  14 |   {
  15 |    "cell_type": "code",
  16 |    "execution_count": 1,
  17 |    "metadata": {},
  18 |    "outputs": [
  19 |     {
  20 |      "name": "stdout",
  21 |      "output_type": "stream",
  22 |      "text": [
  23 |       "2018-04-09 06:35:58    1.0 GiB part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  24 |       "2018-04-09 06:35:59    1.0 GiB part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  25 |       "2018-04-09 06:36:00    1.0 GiB part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  26 |       "2018-04-09 06:36:00    1.0 GiB part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  27 |       "2018-04-09 06:36:00    1.0 GiB part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  28 |       "2018-04-09 06:36:33    1.0 GiB part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  29 |       "2018-04-09 06:36:35    1.0 GiB part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  30 |       "2018-04-09 06:36:35    1.0 GiB part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  31 |       "2018-04-09 06:36:35    1.0 GiB part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  32 |       "2018-04-09 06:36:35    1.0 GiB part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n",
  33 |       "\n",
  34 |       "Total Objects: 10\n",
  35 |       "   Total Size: 10.2 GiB\n"
  36 |      ]
  37 |     }
  38 |    ],
  39 |    "source": [
  40 |     "%%bash \n",
  41 |     "aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Books/ --summarize --human-readable"
  42 |    ]
  43 |   },
  44 |   {
  45 |    "cell_type": "code",
  46 |    "execution_count": 2,
  47 |    "metadata": {},
  48 |    "outputs": [
  49 |     {
  50 |      "data": {
  51 |       "application/vnd.jupyter.widget-view+json": {
  52 |        "model_id": "6fc7ede6213243089f51b4bd4cd1ce5f",
  53 |        "version_major": 2,
  54 |        "version_minor": 0
  55 |       },
  56 |       "text/plain": [
  57 |        "VBox()"
  58 |       ]
  59 |      },
  60 |      "metadata": {},
  61 |      "output_type": "display_data"
  62 |     },
  63 |     {
  64 |      "name": "stdout",
  65 |      "output_type": "stream",
  66 |      "text": [
  67 |       "Starting Spark application\n"
  68 |      ]
  69 |     },
  70 |     {
  71 |      "data": {
  72 |       "text/html": [
  73 |        "<table>\n",
  74 |        "<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>0</td><td>application_1589820935408_0001</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"http://ip-172-31-88-161.ec2.internal:20888/proxy/application_1589820935408_0001/\">Link</a></td><td><a target=\"_blank\" href=\"http://ip-172-31-81-7.ec2.internal:8042/node/containerlogs/container_1589820935408_0001_01_000001/livy\">Link</a></td><td>✔</td></tr></table>"
  75 |       ],
  76 |       "text/plain": [
  77 |        "<IPython.core.display.HTML object>"
  78 |       ]
  79 |      },
  80 |      "metadata": {},
  81 |      "output_type": "display_data"
  82 |     },
  83 |     {
  84 |      "data": {
  85 |       "application/vnd.jupyter.widget-view+json": {
  86 |        "model_id": "",
  87 |        "version_major": 2,
  88 |        "version_minor": 0
  89 |       },
  90 |       "text/plain": [
  91 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
  92 |       ]
  93 |      },
  94 |      "metadata": {},
  95 |      "output_type": "display_data"
  96 |     },
  97 |     {
  98 |      "name": "stdout",
  99 |      "output_type": "stream",
 100 |      "text": [
 101 |       "SparkSession available as 'spark'.\n"
 102 |      ]
 103 |     },
 104 |     {
 105 |      "data": {
 106 |       "application/vnd.jupyter.widget-view+json": {
 107 |        "model_id": "",
 108 |        "version_major": 2,
 109 |        "version_minor": 0
 110 |       },
 111 |       "text/plain": [
 112 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 113 |       ]
 114 |      },
 115 |      "metadata": {},
 116 |      "output_type": "display_data"
 117 |     }
 118 |    ],
 119 |    "source": [
 120 |     "data = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')"
 121 |    ]
 122 |   },
 123 |   {
 124 |    "cell_type": "markdown",
 125 |    "metadata": {},
 126 |    "source": [
 127 |     "Now that we have our data loaded in a DataFrame, let's take a look at its structure and contents. We can see that there is a lot of data here (10 GB worth!), even in this small subset of the overall (~50 GB) Amazon Customer Reviews dataset."
 128 |    ]
 129 |   },
 130 |   {
 131 |    "cell_type": "code",
 132 |    "execution_count": 3,
 133 |    "metadata": {},
 134 |    "outputs": [
 135 |     {
 136 |      "data": {
 137 |       "application/vnd.jupyter.widget-view+json": {
 138 |        "model_id": "b571022b0d504d04baef37ff63f163b6",
 139 |        "version_major": 2,
 140 |        "version_minor": 0
 141 |       },
 142 |       "text/plain": [
 143 |        "VBox()"
 144 |       ]
 145 |      },
 146 |      "metadata": {},
 147 |      "output_type": "display_data"
 148 |     },
 149 |     {
 150 |      "data": {
 151 |       "application/vnd.jupyter.widget-view+json": {
 152 |        "model_id": "",
 153 |        "version_major": 2,
 154 |        "version_minor": 0
 155 |       },
 156 |       "text/plain": [
 157 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 158 |       ]
 159 |      },
 160 |      "metadata": {},
 161 |      "output_type": "display_data"
 162 |     },
 163 |     {
 164 |      "name": "stdout",
 165 |      "output_type": "stream",
 166 |      "text": [
 167 |       "Total Columns: 15\n",
 168 |       "Total Rows: 20726160\n",
 169 |       "root\n",
 170 |       " |-- marketplace: string (nullable = true)\n",
 171 |       " |-- customer_id: string (nullable = true)\n",
 172 |       " |-- review_id: string (nullable = true)\n",
 173 |       " |-- product_id: string (nullable = true)\n",
 174 |       " |-- product_parent: string (nullable = true)\n",
 175 |       " |-- product_title: string (nullable = true)\n",
 176 |       " |-- star_rating: integer (nullable = true)\n",
 177 |       " |-- helpful_votes: integer (nullable = true)\n",
 178 |       " |-- total_votes: integer (nullable = true)\n",
 179 |       " |-- vine: string (nullable = true)\n",
 180 |       " |-- verified_purchase: string (nullable = true)\n",
 181 |       " |-- review_headline: string (nullable = true)\n",
 182 |       " |-- review_body: string (nullable = true)\n",
 183 |       " |-- review_date: date (nullable = true)\n",
 184 |       " |-- year: integer (nullable = true)"
 185 |      ]
 186 |     }
 187 |    ],
 188 |    "source": [
 189 |     "print('Total Columns: %d' % len(data.dtypes))\n",
 190 |     "print('Total Rows: %d' % data.count())\n",
 191 |     "data.printSchema()"
 192 |    ]
 193 |   },
 194 |   {
 195 |    "cell_type": "markdown",
 196 |    "metadata": {},
 197 |    "source": [
 198 |     "We can take a look at (preview) data from all of our columns at once (using `data.show()`), but this is a bit busy on the screen, so let's select a subset of the columns and take a look at the first 20 rows of our data."
 199 |    ]
 200 |   },
 201 |   {
 202 |    "cell_type": "code",
 203 |    "execution_count": 4,
 204 |    "metadata": {},
 205 |    "outputs": [
 206 |     {
 207 |      "data": {
 208 |       "application/vnd.jupyter.widget-view+json": {
 209 |        "model_id": "a2769ba2f97e4ea9a1da43a4e455dabc",
 210 |        "version_major": 2,
 211 |        "version_minor": 0
 212 |       },
 213 |       "text/plain": [
 214 |        "VBox()"
 215 |       ]
 216 |      },
 217 |      "metadata": {},
 218 |      "output_type": "display_data"
 219 |     },
 220 |     {
 221 |      "data": {
 222 |       "application/vnd.jupyter.widget-view+json": {
 223 |        "model_id": "",
 224 |        "version_major": 2,
 225 |        "version_minor": 0
 226 |       },
 227 |       "text/plain": [
 228 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 229 |       ]
 230 |      },
 231 |      "metadata": {},
 232 |      "output_type": "display_data"
 233 |     },
 234 |     {
 235 |      "name": "stdout",
 236 |      "output_type": "stream",
 237 |      "text": [
 238 |       "+--------------------+-----------+-----------+\n",
 239 |       "|       product_title|total_votes|star_rating|\n",
 240 |       "+--------------------+-----------+-----------+\n",
 241 |       "|Standing Qigong f...|         10|          5|\n",
 242 |       "|A Universe from N...|          7|          4|\n",
 243 |       "|Hyacinth Girls: A...|          0|          4|\n",
 244 |       "|        Bared to You|          1|          5|\n",
 245 |       "|     Healer: A Novel|          0|          5|\n",
 246 |       "|The Missionary Po...|          7|          4|\n",
 247 |       "|I'm Tired of Bein...|          1|          4|\n",
 248 |       "|Fifty Shades of G...|          7|          1|\n",
 249 |       "|The Thrill of Vic...|          0|          4|\n",
 250 |       "|Fifty Shades of G...|          9|          5|\n",
 251 |       "|Romeo and Juliet ...|          0|          4|\n",
 252 |       "|Wheat Belly: Lose...|          1|          5|\n",
 253 |       "|Dangerous Dessert...|          0|          5|\n",
 254 |       "|Consciousness Bey...|          0|          5|\n",
 255 |       "|The Catcher in th...|          6|          1|\n",
 256 |       "|Fearless: The Und...|          0|          4|\n",
 257 |       "|Best-Ever Big Sister|          0|          5|\n",
 258 |       "|      The Book Thief|          1|          5|\n",
 259 |       "|  Large Print Sudoku|          1|          5|\n",
 260 |       "|Wild: From Lost t...|          1|          5|\n",
 261 |       "+--------------------+-----------+-----------+\n",
 262 |       "only showing top 20 rows"
 263 |      ]
 264 |     }
 265 |    ],
 266 |    "source": [
 267 |     "data[['product_title', 'total_votes', 'star_rating']].show()"
 268 |    ]
 269 |   },
 270 |   {
 271 |    "cell_type": "markdown",
 272 |    "metadata": {},
 273 |    "source": [
 274 |     "Using PySpark, we can calculate any number of additional summary statistics using standard SQL-like operations on our DataFrame."
 275 |    ]
 276 |   },
 277 |   {
 278 |    "cell_type": "code",
 279 |    "execution_count": 5,
 280 |    "metadata": {},
 281 |    "outputs": [
 282 |     {
 283 |      "data": {
 284 |       "application/vnd.jupyter.widget-view+json": {
 285 |        "model_id": "136da8cfe79b4e12acfdcc4b1f01b7d2",
 286 |        "version_major": 2,
 287 |        "version_minor": 0
 288 |       },
 289 |       "text/plain": [
 290 |        "VBox()"
 291 |       ]
 292 |      },
 293 |      "metadata": {},
 294 |      "output_type": "display_data"
 295 |     },
 296 |     {
 297 |      "data": {
 298 |       "application/vnd.jupyter.widget-view+json": {
 299 |        "model_id": "",
 300 |        "version_major": 2,
 301 |        "version_minor": 0
 302 |       },
 303 |       "text/plain": [
 304 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 305 |       ]
 306 |      },
 307 |      "metadata": {},
 308 |      "output_type": "display_data"
 309 |     },
 310 |     {
 311 |      "name": "stdout",
 312 |      "output_type": "stream",
 313 |      "text": [
 314 |       "+-----------+----------------+------------------+\n",
 315 |       "|star_rating|sum(total_votes)|sum(helpful_votes)|\n",
 316 |       "+-----------+----------------+------------------+\n",
 317 |       "|          5|        54808322|          44825468|\n",
 318 |       "|          4|        13954363|          11100563|\n",
 319 |       "|          3|        10117356|           7021927|\n",
 320 |       "|          2|         9010874|           5581929|\n",
 321 |       "|          1|        22624009|          10985502|\n",
 322 |       "+-----------+----------------+------------------+"
 323 |      ]
 324 |     }
 325 |    ],
 326 |    "source": [
 327 |     "stars_votes = (data.groupBy('star_rating')\n",
 328 |     "                  .sum('total_votes', 'helpful_votes')\n",
 329 |     "                  .sort('star_rating', ascending=False)\n",
 330 |     "              )\n",
 331 |     "stars_votes.show()"
 332 |    ]
 333 |   },
 334 |   {
 335 |    "cell_type": "markdown",
 336 |    "metadata": {},
 337 |    "source": [
 338 |     "We can even plot the results of our summarizations using standard Python packages like Pandas and Seaborn. To do so within an EMR notebook, we will need to install these packages. If we list the packages currently available to us in our SparkContext, you can see that these packages are not already installed, so we will need to install them."
 339 |    ]
 340 |   },
 341 |   {
 342 |    "cell_type": "code",
 343 |    "execution_count": 6,
 344 |    "metadata": {},
 345 |    "outputs": [
 346 |     {
 347 |      "data": {
 348 |       "application/vnd.jupyter.widget-view+json": {
 349 |        "model_id": "e0e469b8a7db453fbf68453f8d83b3a0",
 350 |        "version_major": 2,
 351 |        "version_minor": 0
 352 |       },
 353 |       "text/plain": [
 354 |        "VBox()"
 355 |       ]
 356 |      },
 357 |      "metadata": {},
 358 |      "output_type": "display_data"
 359 |     },
 360 |     {
 361 |      "data": {
 362 |       "application/vnd.jupyter.widget-view+json": {
 363 |        "model_id": "",
 364 |        "version_major": 2,
 365 |        "version_minor": 0
 366 |       },
 367 |       "text/plain": [
 368 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 369 |       ]
 370 |      },
 371 |      "metadata": {},
 372 |      "output_type": "display_data"
 373 |     },
 374 |     {
 375 |      "name": "stdout",
 376 |      "output_type": "stream",
 377 |      "text": [
 378 |       "Package                    Version\n",
 379 |       "-------------------------- -------\n",
 380 |       "beautifulsoup4             4.9.0  \n",
 381 |       "boto                       2.49.0 \n",
 382 |       "jmespath                   0.9.5  \n",
 383 |       "lxml                       4.5.0  \n",
 384 |       "mysqlclient                1.4.2  \n",
 385 |       "nltk                       3.4.5  \n",
 386 |       "nose                       1.3.4  \n",
 387 |       "numpy                      1.16.5 \n",
 388 |       "pip                        9.0.1  \n",
 389 |       "py-dateutil                2.2    \n",
 390 |       "python37-sagemaker-pyspark 1.3.0  \n",
 391 |       "pytz                       2019.3 \n",
 392 |       "PyYAML                     5.3.1  \n",
 393 |       "setuptools                 28.8.0 \n",
 394 |       "six                        1.13.0 \n",
 395 |       "soupsieve                  1.9.5  \n",
 396 |       "wheel                      0.29.0 \n",
 397 |       "windmill                   1.6"
 398 |      ]
 399 |     }
 400 |    ],
 401 |    "source": [
 402 |     "sc.list_packages()"
 403 |    ]
 404 |   },
 405 |   {
 406 |    "cell_type": "code",
 407 |    "execution_count": 7,
 408 |    "metadata": {},
 409 |    "outputs": [
 410 |     {
 411 |      "data": {
 412 |       "application/vnd.jupyter.widget-view+json": {
 413 |        "model_id": "d78e1df3176d4a7696e100237e74bb74",
 414 |        "version_major": 2,
 415 |        "version_minor": 0
 416 |       },
 417 |       "text/plain": [
 418 |        "VBox()"
 419 |       ]
 420 |      },
 421 |      "metadata": {},
 422 |      "output_type": "display_data"
 423 |     },
 424 |     {
 425 |      "data": {
 426 |       "application/vnd.jupyter.widget-view+json": {
 427 |        "model_id": "",
 428 |        "version_major": 2,
 429 |        "version_minor": 0
 430 |       },
 431 |       "text/plain": [
 432 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 433 |       ]
 434 |      },
 435 |      "metadata": {},
 436 |      "output_type": "display_data"
 437 |     },
 438 |     {
 439 |      "name": "stdout",
 440 |      "output_type": "stream",
 441 |      "text": [
 442 |       "Collecting seaborn\n",
 443 |       "  Downloading https://files.pythonhosted.org/packages/c7/e6/54aaaafd0b87f51dfba92ba73da94151aa3bc179e5fe88fc5dfb3038e860/seaborn-0.10.1-py3-none-any.whl (215kB)\n",
 444 |       "Collecting pandas>=0.22.0 (from seaborn)\n",
 445 |       "  Downloading https://files.pythonhosted.org/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0MB)\n",
 446 |       "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from seaborn)\n",
 447 |       "Collecting scipy>=1.0.1 (from seaborn)\n",
 448 |       "  Downloading https://files.pythonhosted.org/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1MB)\n",
 449 |       "Collecting matplotlib>=2.1.2 (from seaborn)\n",
 450 |       "  Downloading https://files.pythonhosted.org/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4MB)\n",
 451 |       "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn)\n",
 452 |       "Collecting python-dateutil>=2.6.1 (from pandas>=0.22.0->seaborn)\n",
 453 |       "  Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)\n",
 454 |       "Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=2.1.2->seaborn)\n",
 455 |       "  Downloading https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB)\n",
 456 |       "Collecting cycler>=0.10 (from matplotlib>=2.1.2->seaborn)\n",
 457 |       "  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl\n",
 458 |       "Collecting kiwisolver>=1.0.1 (from matplotlib>=2.1.2->seaborn)\n",
 459 |       "  Downloading https://files.pythonhosted.org/packages/31/b9/6202dcae729998a0ade30e80ac00f616542ef445b088ec970d407dfd41c0/kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88kB)\n",
 460 |       "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn)\n",
 461 |       "Installing collected packages: python-dateutil, pandas, scipy, pyparsing, cycler, kiwisolver, matplotlib, seaborn\n",
 462 |       "Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 pandas-1.0.3 pyparsing-2.4.7 python-dateutil-2.8.1 scipy-1.4.1 seaborn-0.10.1\n",
 463 |       "\n",
 464 |       "Requirement already satisfied: pandas in /mnt/tmp/1589821122437-0/lib/python3.7/site-packages\n",
 465 |       "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas)\n",
 466 |       "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from pandas)\n",
 467 |       "Requirement already satisfied: python-dateutil>=2.6.1 in /mnt/tmp/1589821122437-0/lib/python3.7/site-packages (from pandas)\n",
 468 |       "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas)"
 469 |      ]
 470 |     }
 471 |    ],
 472 |    "source": [
 473 |     "sc.install_pypi_package(\"seaborn\")\n",
 474 |     "sc.install_pypi_package(\"pandas\")"
 475 |    ]
 476 |   },
 477 |   {
 478 |    "cell_type": "code",
 479 |    "execution_count": 8,
 480 |    "metadata": {},
 481 |    "outputs": [
 482 |     {
 483 |      "data": {
 484 |       "application/vnd.jupyter.widget-view+json": {
 485 |        "model_id": "fa2b9668a2974c4da8d1e69e5c8a4918",
 486 |        "version_major": 2,
 487 |        "version_minor": 0
 488 |       },
 489 |       "text/plain": [
 490 |        "VBox()"
 491 |       ]
 492 |      },
 493 |      "metadata": {},
 494 |      "output_type": "display_data"
 495 |     },
 496 |     {
 497 |      "data": {
 498 |       "application/vnd.jupyter.widget-view+json": {
 499 |        "model_id": "",
 500 |        "version_major": 2,
 501 |        "version_minor": 0
 502 |       },
 503 |       "text/plain": [
 504 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 505 |       ]
 506 |      },
 507 |      "metadata": {},
 508 |      "output_type": "display_data"
 509 |     },
 510 |     {
 511 |      "name": "stdout",
 512 |      "output_type": "stream",
 513 |      "text": [
 514 |       "Package                    Version\n",
 515 |       "-------------------------- -------\n",
 516 |       "beautifulsoup4             4.9.0  \n",
 517 |       "boto                       2.49.0 \n",
 518 |       "cycler                     0.10.0 \n",
 519 |       "jmespath                   0.9.5  \n",
 520 |       "kiwisolver                 1.2.0  \n",
 521 |       "lxml                       4.5.0  \n",
 522 |       "matplotlib                 3.2.1  \n",
 523 |       "mysqlclient                1.4.2  \n",
 524 |       "nltk                       3.4.5  \n",
 525 |       "nose                       1.3.4  \n",
 526 |       "numpy                      1.16.5 \n",
 527 |       "pandas                     1.0.3  \n",
 528 |       "pip                        9.0.1  \n",
 529 |       "py-dateutil                2.2    \n",
 530 |       "pyparsing                  2.4.7  \n",
 531 |       "python-dateutil            2.8.1  \n",
 532 |       "python37-sagemaker-pyspark 1.3.0  \n",
 533 |       "pytz                       2019.3 \n",
 534 |       "PyYAML                     5.3.1  \n",
 535 |       "scipy                      1.4.1  \n",
 536 |       "seaborn                    0.10.1 \n",
 537 |       "setuptools                 28.8.0 \n",
 538 |       "six                        1.13.0 \n",
 539 |       "soupsieve                  1.9.5  \n",
 540 |       "wheel                      0.29.0 \n",
 541 |       "windmill                   1.6"
 542 |      ]
 543 |     }
 544 |    ],
 545 |    "source": [
 546 |     "sc.list_packages()"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "markdown",
 551 |    "metadata": {},
 552 |    "source": [
 553 |     "Now that our packages have been successfully installed, let's plot some of our data. Note, though, that because our data is so large, we do not want to send all of it into a Pandas DataFrame-- we want to take advantage of the distributed nature of our Spark DataFrame to perform any computations and then send only small segments of our data into Pandas and our plotting libraries, so that they will be able to handle it in-memory. Here, we use the `toPandas()` method to only send the small dataframe that summarizes the total number of votes for each star rating -- all of the computation was done on the Spark DataFrame.\n",
 554 |     "\n",
 555 |     "It looks like 5 star reviews had a lot more \"helpful\" votes than all of the other star ratings."
 556 |    ]
 557 |   },
 558 |   {
 559 |    "cell_type": "code",
 560 |    "execution_count": 9,
 561 |    "metadata": {},
 562 |    "outputs": [
 563 |     {
 564 |      "data": {
 565 |       "application/vnd.jupyter.widget-view+json": {
 566 |        "model_id": "c0dbcdefc8e84b9f9b7d5dbff8f5f061",
 567 |        "version_major": 2,
 568 |        "version_minor": 0
 569 |       },
 570 |       "text/plain": [
 571 |        "VBox()"
 572 |       ]
 573 |      },
 574 |      "metadata": {},
 575 |      "output_type": "display_data"
 576 |     },
 577 |     {
 578 |      "data": {
 579 |       "application/vnd.jupyter.widget-view+json": {
 580 |        "model_id": "",
 581 |        "version_major": 2,
 582 |        "version_minor": 0
 583 |       },
 584 |       "text/plain": [
 585 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 586 |       ]
 587 |      },
 588 |      "metadata": {},
 589 |      "output_type": "display_data"
 590 |     },
 591 |     {
 592 |      "data": {
 593 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3de1TUdeL/8deIAsrNBMQ1EdwUyxAkb5FdvBu2FrZ5WpdWtN12KzWVtTzU5qWjXyxXMy+RXy3RVpfNe79as/Qr2Jr2VczU3MwMhZKiUrloTsDM749+za9ZrFCBz3x8Px/ncE7zmeHjy+a75/vsMzPgcLvdbgEAAMAYTaweAAAAgMZFAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYArCc7duzQsGHD1LZtWzkcDm3cuPGivn/69OlyOBy1voKCghpoMQAAMBUBWE/Onj2rxMRELV68+JK+f/LkySopKfH66tKli0aMGFHPSwEAgOkIwHqSkpKimTNnavjw4Re83+l0avLkybr66qsVFBSk3r17Ky8vz3N/cHCw2rRp4/n64osvdPjwYf3+979vpL8BAAAwBQHYSMaNG6ddu3YpNzdXBw4c0IgRI3T77bfr6NGjF3z8smXLFBcXp1tuuaWRlwIAgCsdAdgIioqKtHz5cq1Zs0a33HKLrrnmGk2ePFk333yzli9fXuvx58+f16pVq7j6BwAAGkRTqweY4ODBg6qpqVFcXJzXcafTqfDw8FqP37BhgyoqKpSent5YEwEAgEEIwEZQWVkpPz8/FRQUyM/Pz+u+4ODgWo9ftmyZfvWrXykqKqqxJgIAAIMQgI0gKSlJNTU1Ki0t/dn39BUWFmr79u169dVXG2kdAAAwDQFYTyorK/Xxxx97bhcWFmr//v1q1aqV4uLilJaWplGjRmnu3LlKSkrSl19+qW3btikhIUF33HGH5/teeukl/eIXv1BKSooVfw0AAGAAh9vtdls94kqQl5enfv361Tqenp6unJwcVVVVaebMmVq5cqU+++wzRURE6MYbb9SMGTPUtWtXSZLL5VJMTIxGjRqlWbNmNfZfAQAAGIIABAAAMAw/BgYAAMAwBCAAAIBhCEAAAADD8Cngy+ByuXTy5EmFhITI4XBYPQcAANSB2+1WRUWF2rZtqyZNzLwWRgBehpMnTyo6OtrqGQAA4BIUFxerXbt2Vs+wBAF4GUJCQiR9939AoaGhFq8BAAB1UV5erujoaM//HzcRAXgZvn/ZNzQ0lAAEAMBmTH77lpkvfAMAABiMAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwzS1egAAAPC26M//x+oJtjVu7jCrJ9gCVwABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMA/j+zZ8+Ww+HQxIkTrZ4CAADQoAhASXv27NGSJUuUkJBg9RQAAIAGZ3wAVlZWKi0tTUuXLtVVV11l9RwAAIAGZ3wAjh07VnfccYcGDhz4s491Op0qLy/3+gIAALCbplYPsFJubq727dunPXv21OnxWVlZmjFjRgOvAgAAaFjGXgEsLi7WhAkTtGrVKgUGBtbpezIzM1VWVub5Ki4ubuCVAAAA9c/YK4AFBQUqLS3VDTfc4DlWU1OjHTt2aNGiRXI6nfLz8/P6noCAAAUEBDT2VAAAgHplbAAOGDBABw8e9Do2ZswYXXvttZoyZUqt+AMAALhSGBuAISEhio+P9zoWFBSk8PDwWscBAACuJMa+BxAAAMBUxl4BvJC8vDyrJwAAADQ4rgACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYJimVg+4FIWFhXr77bd14sQJnTt3TpGRkUpKSlJycrICAwOtngcAAODTbBWAq1at0nPPPae9e/cqKipKbdu2VfPmzXXq1CkdO3ZMgYGBSktL05QpUxQTE2P1XAAAAJ9kmwBMSkqSv7+/Ro8erXXr1ik6OtrrfqfTqV27dik3N1c9evTQ888/rxEjRli0FgAAwHfZJgBnz56tIUOG/Oj9AQEB6tu3r/r27atZs2bp+PHjjTcOAADARmwTgD8Vf/8pPDxc4eHhDbgGAADAvmz5KeB9+/bp4MGDntubNm1SamqqHn/8cX377bcWLgMAAPB9tgzAP/3pT/roo48kSZ988ol+85vfqEWLFlqzZo0ee+wxi9cBAAD4NlsG4EcffaRu3bpJktasWaNbb71Vq1evVk5OjtatW2fxOgAAAN9mywB0u91yuVySpK1bt2ro0KGSpOjoaH311Vd1Pk92drYSEhIUGhqq0NBQJScna/PmzQ2yGQAAwFfYMgB79OihmTNn6uWXX1Z+fr7uuOMOSd/9gOioqKg6n6ddu3aaPXu2CgoKtHfvXvXv31933XWXPvjgg4aaDgAAYDlbBuD8+fO1b98+jRs3Tk888YQ6duwoSVq7dq1uuummOp9n2LBhGjp0qDp16qS4uDjNmjVLwcHB2r17d0NNBwAAsJxtfgzMDyUkJHh9Cvh7c+bMkZ+f3yWds6amRmvWrNHZs2eVnJx8uRMBAAB8li0DUJLOnDmjtWvX6tixY3r00UfVqlUrHT58WFFRUbr66qvrfJ6DBw8qOTlZ58+fV3BwsDZs2KAuXbpc8LFOp1NOp9Nzu7y8/LL/HgAAAI3Nli8BHzhwQJ06ddLTTz+tv/71rzpz5owkaf369crMzLyoc3Xu3Fn79+/Xu+++q4ceekjp6ek6fPjwBR+blZWlsLAwz9d//jo6AAAAO7BlAGZkZGjMmDE6evSoAgMDPceHDh2qHTt2XNS5/P391bFjR3Xv3l1ZWVlKTEzUc889d8HHZmZmqqyszPNVXFx8WX8PAAAAK9jyJeA9e/ZoyZIltY5fffXV+vzzzy/r3C6Xy+tl3h8KCAhQQEDAZZ0fAADAarYMwICAgAu+/+6jjz5SZGRknc+TmZmplJQUtW/fXhUVFVq9erXy8vK0ZcuW+pwLAADgU2z5EvCdd96pp556SlVVVZIkh8OhoqIiTZkyRb/+9a/rfJ7S0lKNGjVKnTt31oABA7Rnzx5t2bJFgwYNaqjpAAAAlrPlFcC5c+fqnnvuUevWrfXNN9/otttu0+eff67k5GTNmjWrzud58cUXG3AlAACAb7JlAIaFhemtt97Szp079f7776uyslI33HCDBg4caPU0AAAAn2fLAFy5cqXuvfde9enTR3369PEc//bbb5Wbm6tRo0ZZuA4AAMC32fI9gGPGjFFZWVmt4xUVFRozZowFiwAAAOzDlgHodrvlcDhqHf/0008VFhZmwSIAAAD7sNVLwElJSXI4HHI4HBowYICaNv3/82tqalRYWKjbb7/dwoUAAAC+z1YBmJqaKknav3+/hgwZouDgYM99/v7+io2NvagfAwMAAGAiWwXgtGnTJEmxsbG69957vX4NHAAAAOrGVgH4vfT0dElSQUGB/v3vf0uSrr/+eiUlJVk5CwAAwBZsGYClpaX6zW9+o7y8PLVs2VKSdObMGfXr10+5ubkX9evgAAAATGPLTwGPHz9eFRUV+uCDD3Tq1CmdOnVKhw4dUnl5uR555BGr5wEAAPg0W14BfOONN7R161Zdd911nmNdunTR4sWLNXjwYAuXAQAA+D5bXgF0uVxq1qxZrePNmjWTy+WyYBEAAIB92DIA+/fvrwkTJujkyZOeY5999pkmTZqkAQMGWLgMAADA99kyABctWqTy8nLFxsbqmmuu0TXXXKMOHTqovLxcCxcutHoeAACAT7PlewCjo6O1b98+bd26VR9++KEk6brrrtPAgQMtXgYAAOD7bBmAxcXFio6O1qBBgzRo0CCr5wAAANiKLV8Cjo2N1W233aalS5fq9OnTVs8BAACwFVsG4N69e9WrVy899dRT+sUvfqHU1FStXbtWTqfT6mkAAAA+z5YBmJSUpDlz5qioqEibN29WZGSk/vjHPyoqKkr333+/1fMAAAB8mi0D8HsOh0P9+vXT0qVLtXXrVnXo0EErVqywehYAAIBPs3UAfvrpp3rmmWfUrVs39erVS8HBwVq8eLHVswAAAHyaLT8FvGTJEq1evVo7d+7Utddeq7S0NG3atEkxMTFWTwMAAPB5tgzAmTNnauTIkVqwYIESExOtngMAAGArtgzAoqIiORyOn33cww8/rKeeekoRERGNsAoAAMAebPkewLrEnyT97W9/U3l5eQOvAQAAsBdbBmBdud1uqycAAAD4nCs6AAEAAFAbAQgAAGAYAhAAAMAwBCAAAIBhrugAvO+++xQaGmr1DAAAAJ9im58DeODAgTo/NiEhQZKUnZ3dUHMAAABsyzYB2K1bNzkcjh/90S7f3+dwOFRTU9PI6wAAAOzDNgFYWFho9QQAAIArgm0CMCYmxuoJAAAAVwTbBOAPrVy58ifvHzVqVCMtAQAAsB9bBuCECRO8bldVVencuXPy9/dXixYtCEAAAICfYMsfA3P69Gmvr8rKSh05ckQ333yz/v73v1s9DwAAwKfZMgAvpFOnTpo9e3atq4MAAADwdsUEoCQ1bdpUJ0+etHoGAACAT7PlewBfffVVr9tut1slJSVatGiR+vTpY9EqAAAAe7BlAKampnrddjgcioyMVP/+/TV37lyLVgEAANiDbQKwvLzc83t9XS6XxWsAAADsyzbvAbzqqqtUWloqSerfv7/OnDlj8SIAAAB7sk0ABgcH6+uvv5Yk5eXlqaqqyuJFAAAA9mSbl4AHDhyofv366brrrpMkDR8+XP7+/hd87P/8z/805jQAAABbsU0A/u1vf9OKFSt07Ngx5efn6/rrr1eLFi2sngUAAGA7tgnA5s2b68EHH5Qk7d27V08//bRatmxp8SoAAAD7sU0A/tD27ds9/+x2uyV996NgAAAA8PNs8yGQ//Tiiy8qPj5egYGBCgwMVHx8vJYtW2b1LAAAAJ9nyyuAU6dO1bx58zR+/HglJydLknbt2qVJkyapqKhITz31lMULAQAAfJctAzA7O1tLly7VyJEjPcfuvPNOJSQkaPz48QQgAADAT7DlS8BVVVXq0aNHrePdu3dXdXW1BYsAAADsw5YB+Lvf/U7Z2dm1jv/3f/+30tLSLFgEAABgH7Z8CVj67kMgb775pm688UZJ0rvvvquioiKNGjVKGRkZnsfNmzfPqokAAAA+yZYBeOjQId1www2SpGPHjkmSIiIiFBERoUOHDnkex4+GAQAAqM2WAfjDnwMIAACAi2PL9wACAADg0tnmCuDdd99d58euX7++AZcAAADYm20CMCwszOoJAAAAVwTbBODy5cutngAAAHBFsO17AKurq7V161YtWbJEFRUVkqSTJ0+qsrLS4mUAAAC+zTZXAH/oxIkTuv3221VUVCSn06lBgwYpJCRETz/9tJxOp1544QWrJwIAAPgsW14BnDBhgnr06KHTp0+refPmnuPDhw/Xtm3bLFwGAADg+2x5BfDtt9/WO++8I39/f6/jsbGx+uyzzyxaBQAAYA+2vALocrlUU1NT6/inn36qkJAQCxYBAADYhy0DcPDgwZo/f77ntsPhUGVlpaZNm6ahQ4dauAwAAMD32fIl4Llz52rIkCHq0qWLzp8/r9/+9rc6evSoIiIi9Pe//93qeQAAAD7NllcA27Vrp/fff1+PP/64Jk2apKSkJM2ePVvvvfeeWrduXefzZGVlqWfPngoJCVHr1q2VmpqqI0eONOByAAAA69nyCqAkNW3aVPfdd99lnSM/P19jx45Vz549VV1drccff1yDBw/W4cOHFRQUVE9LAQAAfIttA/Do0aPavn27SktL5XK5vO6bOnVqnc7xxhtveN3OyclR69atVVBQoFtvvbXetgIAAPgSWwbg0qVL9dBDDykiIkJt2rSRw+Hw3OdwOOocgP+prKxMktSqVasL3u90OuV0Oj23y8vLL+nPAQAAsJItA3DmzJmaNWuWpkyZUm/ndLlcmjhxovr06aP4+PgLPiYrK0szZsyotz8TAADACrb8EMjp06c1YsSIej3n2LFjdejQIeXm5v7oYzIzM1VWVub5Ki4urtcNAAAAjcGWAThixAi9+eab9Xa+cePG6bXXXtP27dvVrl27H31cQECAQkNDvb4AAADsxjYvAS9YsMDzzx07dtSTTz6p3bt3q2vXrmrWrJnXYx955JE6ndPtdmv8+PHasGGD8vLy1KFDh3rdDAAA4ItsE4DPPvus1+3g4GDl5+crPz/f67jD4ahzAI4dO1arV6/Wpk2bFBISos8//1ySFBYWpubNm9fPcAAAAB9jmwAsLCys93NmZ2dLkvr27et1fPny5Ro9enS9/3kAAAC+wDYB2BDcbrfVEwAAABqdbT4EMnv2bJ07d65Oj3333Xf1+uuvN/AiAAAAe7JNAB4+fFgxMTF6+OGHtXnzZn355Zee+6qrq3XgwAE9//zzuummm3TvvfcqJCTEwrUAAAC+yzYvAa9cuVLvv/++Fi1apN/+9rcqLy+Xn5+fAgICPFcGk5KS9Ic//EGjR49WYGCgxYsBAAB8k20CUJISExO1dOlSLVmyRAcOHNCJEyf0zTffKCIiQt26dVNERITVEwEAAHyerQLwe02aNFG3bt3UrVs3q6cAAADYji0D8HulpaUqLS2Vy+XyOp6QkGDRIgAAAN9nywAsKChQenq6/v3vf9f6US4Oh0M1NTUWLQMAAPB9tgzA+++/X3FxcXrxxRcVFRUlh8Nh9SQAAADbsGUAfvLJJ1q3bp06duxo9RQAAADbsc3PAfyhAQMG6P3337d6BgAAgC3Z8grgsmXLlJ6erkOHDik+Pl7NmjXzuv/OO++0aBkAAIDvs2UA7tq1Szt37tTmzZtr3ceHQAAAAH6aLV8CHj9+vO677z6VlJTI5XJ5fRF/AAAAP82WAfj1119r0qRJioqKsnoKAACA7dgyAO+++25t377d6hkAAAC2ZMv3AMbFxSkzM1P/+te/1LVr11ofAnnkkUcsWgYAAOD7bBmAy5YtU3BwsPLz85Wfn+91n8PhIAABAAB+gi0DsLCw0OoJAAAAtmXL9wACAADg0tnyCuD999//k/e/9NJLjbQEAADAfmwZgKdPn/a6XVVVpUOHDunMmTPq37+/RasAAADswZYBuGHDhlrHXC6XHnroIV1zzTUWLAIAALCPK+Y9gE2aNFFGRoaeffZZq6cAAAD4tCsmACXp2LFjqq6utnoGAACAT7PlS8AZGRlet91ut0pKSvT6668rPT3dolUAAAD2YMsAfO+997xuN2nSRJGRkZo7d+7PfkIYAADAdLYMwNdff11ut1tBQUGSpOPHj2vjxo2KiYlR06a2/CsBAAA0Glu+BzA1NVUvv/yyJOnMmTO68cYbNXfuXKWmpio7O9vidQAAAL7NlgG4b98+3XLLLZKktWvXKioqSidOnNDKlSu1YMECi9cBAAD4NlsG4Llz5xQSEiJJevPNN3X33XerSZMmuvHGG3XixAmL1wEAAPg2WwZgx44dtXHjRhUXF2vLli0aPHiwJKm0tFShoaEWrwMAAPBttgzAqVOnavLkyYqNjVXv3r2VnJws6burgUlJSRavAwAA8G22/MjsPffco5tvvlklJSVKTEz0HB8wYICGDx9u4TIAAADfZ8sAlKQ2bdqoTZs2Xsd69epl0RoAAAD7sOVLwAAAALh0BCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAM09TqAQAA35B/621WT7Ct23bkWz0BuCgEYCPp/uhKqyfYVsGcUVZPAADgisJLwAAAAIYhAAEAAAxjdADu2LFDw4YNU9u2beVwOLRx40arJwEAADQ4owPw7NmzSkxM1OLFi62eAgAA0GiM/hBISkqKUlJSrJ4BAADQqIy+AggAAGAio68AXiyn0ymn0+m5XV5ebuEaAACAS8MVwIuQlZWlsLAwz1d0dLTVkwAAAC4aAXgRMjMzVVZW5vkqLi62ehIAAMBF4yXgixAQEKCAgACrZwAAAFwWowOwsrJSH3/8sed2YWGh9u/fr1atWql9+/YWLgMAAGg4Rgfg3r171a9fP8/tjIwMSVJ6erpycnIsWgUAANCwjA7Avn37yu12Wz0DAACgUfEhEAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwzS1egAAc/VZ2MfqCba2c/xOqycAsCmuAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDL8KDsYpeqqr1RNsq/3Ug1ZPAADUA64AAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGMT4AFy9erNjYWAUGBqp379763//9X6snAQAANCijA/Af//iHMjIyNG3aNO3bt0+JiYkaMmSISktLrZ4GAADQYIwOwHnz5umBBx7QmDFj1KVLF73wwgtq0aKFXnrpJaunAQAANJimVg+wyrfffquCggJlZmZ6jjVp0kQDBw7Url27Lvg9TqdTTqfTc7usrEySVF5e/rN/Xo3zm8tcbK66/Pu9GBXna+r1fCap7+ei+pvqej2faer7+ThbzfNxqer7ufjGea5ez2eSujwX3z/G7XY39ByfZWwAfvXVV6qpqVFUVJTX8aioKH344YcX/J6srCzNmDGj1vHo6OgG2YjvhC180OoJ+F5WmNUL8ANhU3g+fEYYz4WveGxx3R9bUVGhMEOfO2MD8FJkZmYqIyPDc9vlcunUqVMKDw+Xw+GwcNnlKS8vV3R0tIqLixUaGmr1HKPxXPgOngvfwXPhO66U58LtdquiokJt27a1eopljA3AiIgI+fn56YsvvvA6/sUXX6hNmzYX/J6AgAAFBAR4HWvZsmWDbWxsoaGhtv4f9JWE58J38Fz4Dp4L33ElPBemXvn7nrEfAvH391f37t21bds2zzGXy6Vt27YpOTnZwmUAAAANy9grgJKUkZGh9PR09ejRQ7169dL8+fN19uxZjRkzxuppAAAADcZv+vTp060eYZX4+Hi1bNlSs2bN0l//+ldJ0qpVq9S5c2eLlzU+Pz8/9e3bV02bGv3fBD6B58J38Fz4Dp4L38FzcWVwuE3+DDQAAICBjH0PIAAAgKkIQAAAAMMQgAAAAIYhAAEAAAxDABpsx44dGjZsmNq2bSuHw6GNGzdaPclIWVlZ6tmzp0JCQtS6dWulpqbqyJEjVs8yVnZ2thISEjw/6DY5OVmbN2+2ehYkzZ49Ww6HQxMnTrR6inGmT58uh8Ph9XXttddaPQuXgQA02NmzZ5WYmKjFiy/iFyei3uXn52vs2LHavXu33nrrLVVVVWnw4ME6e/as1dOM1K5dO82ePVsFBQXau3ev+vfvr7vuuksffPCB1dOMtmfPHi1ZskQJCQlWTzHW9ddfr5KSEs/Xv/71L6sn4TLwQ3wMlpKSopSUFKtnGO+NN97wup2Tk6PWrVuroKBAt956q0WrzDVs2DCv27NmzVJ2drZ2796t66+/3qJVZqusrFRaWgmUyrAAAAi6SURBVJqWLl2qmTNnWj3HWE2bNv3RX5UK++EKIOBjysrKJEmtWrWyeAlqamqUm5urs2fP8isiLTR27FjdcccdGjhwoNVTjHb06FG1bdtWv/zlL5WWlqaioiKrJ+EycAUQ8CEul0sTJ05Unz59FB8fb/UcYx08eFDJyck6f/68goODtWHDBnXp0sXqWUbKzc3Vvn37tGfPHqunGK13797KyclR586dVVJSohkzZuiWW27RoUOHFBISYvU8XAICEPAhY8eO1aFDh3hvjcU6d+6s/fv3q6ysTGvXrlV6erry8/OJwEZWXFysCRMm6K233lJgYKDVc4z2w7cLJSQkqHfv3oqJidErr7yi3//+9xYuw6UiAAEfMW7cOL322mvasWOH2rVrZ/Uco/n7+6tjx46SpO7du2vPnj167rnntGTJEouXmaWgoEClpaW64YYbPMdqamq0Y8cOLVq0SE6nU35+fhYuNFfLli0VFxenjz/+2OopuEQEIGAxt9ut8ePHa8OGDcrLy1OHDh2snoT/4HK55HQ6rZ5hnAEDBujgwYNex8aMGaNrr71WU6ZMIf4sVFlZqWPHjul3v/ud1VNwiQhAg1VWVnr911thYaH279+vVq1aqX379hYuM8vYsWO1evVqbdq0SSEhIfr8888lSWFhYWrevLnF68yTmZmplJQUtW/fXhUVFVq9erXy8vK0ZcsWq6cZJyQkpNZ7YYOCghQeHs57ZBvZ5MmTNWzYMMXExOjkyZOaNm2a/Pz8NHLkSKun4RIRgAbbu3ev+vXr57mdkZEhSUpPT1dOTo5Fq8yTnZ0tSerbt6/X8eXLl2v06NGNP8hwpaWlGjVqlEpKShQWFqaEhARt2bJFgwYNsnoaYJlPP/1UI0eO1Ndff63IyEjdfPPN2r17tyIjI62ehkvkcLvdbqtHAAAAoPHwcwABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAaABxcbGav78+VbPAAAvBCAAnzd69GilpqZaPeMn5eTkqGXLlrWO79mzR3/84x8tWAQAP44ABGCMb7/9tlG+54ciIyPVokWLyzoHANQ3AhCAz1i7dq26du2q5s2bKzw8XAMHDtSjjz6qFStWaNOmTXI4HHI4HMrLy5MkTZkyRXFxcWrRooV++ctf6sknn1RVVZXnfNOnT1e3bt20bNkydejQQYGBgT+7oW/fvho3bpwmTpyoiIgIDRkyRJI0b948de3aVUFBQYqOjtbDDz+syspKSVJeXp7GjBmjsrIyz8bp06dLqv0SsMPh0LJlyzR8+HC1aNFCnTp10quvvuq14dVXX1WnTp0UGBiofv36acWKFXI4HDpz5szl/OsFAI+mVg8AAEkqKSnRyJEj9cwzz2j48OGqqKjQ22+/rVGjRqmoqEjl5eVavny5JKlVq1aSpJCQEOXk5Kht27Y6ePCgHnjgAYWEhOixxx7znPfjjz/WunXrtH79evn5+dVpy4oVK/TQQw9p586dnmNNmjTRggUL1KFDB33yySd6+OGH9dhjj+n555/XTTfdpPnz52vq1Kk6cuSIJCk4OPhHzz9jxgw988wzmjNnjhYuXKi0tDSdOHFCrVq1UmFhoe655x5NmDBBf/jDH/Tee+9p8uTJF/3vEwB+khsAfEBBQYFbkvv48eO17ktPT3ffddddP3uOOXPmuLt37+65PW3aNHezZs3cpaWldd5x2223uZOSkn72cWvWrHGHh4d7bi9fvtwdFhZW63ExMTHuZ5991nNbkvsvf/mL53ZlZaVbknvz5s1ut9vtnjJlijs+Pt7rHE888YRbkvv06dN1/nsAwE/hCiAAn5CYmKgBAwaoa9euGjJkiAYPHqx77rlHV1111Y9+zz/+8Q8tWLBAx44dU2VlpaqrqxUaGur1mJiYGEVGRl7Ulu7du9c6tnXrVmVlZenDDz9UeXm5qqurdf78eZ07d+6i3+OXkJDg+eegoCCFhoaqtLRUknTkyBH17NnT6/G9evW6qPMDwM/hPYAAfIKfn5/eeustbd68WV26dNHChQvVuXNnFRYWXvDxu3btUlpamoYOHarXXntN7733np544olaH9oICgq66C3/+T3Hjx/Xr371KyUkJGjdunUqKCjQ4sWLJV3ah0SaNWvmddvhcMjlcl30eQDgUnEFEIDPcDgc6tOnj/r06aOpU6cqJiZGGzZskL+/v2pqarwe+8477ygmJkZPPPGE59iJEycaZFdBQYFcLpfmzp2rJk2+++/mV155xesxF9p4KTp37qx//vOfXsf27Nlz2ecFgB/iCiAAn/Duu+/qv/7rv7R3714VFRVp/fr1+vLLL3XdddcpNjZWBw4c0JEjR/TVV1+pqqpKnTp1UlFRkXJzc3Xs2DEtWLBAGzZsaJBtHTt2VFVVlRYuXKhPPvlEL7/8sl544QWvx8TGxqqyslLbtm3TV199pXPnzl3Sn/WnP/1JH374oaZMmaKPPvpIr7zyinJyciR9F8gAUB8IQAA+ITQ0VDt27NDQoUMVFxenv/zlL5o7d65SUlL0wAMPqHPnzurRo4ciIyO1c+dO3XnnnZo0aZLGjRunbt266Z133tGTTz7ZINsSExM1b948Pf3004qPj9eqVauUlZXl9ZibbrpJDz74oO69915FRkbqmWeeuaQ/q0OHDlq7dq3Wr1+vhIQEZWdne65yBgQEXPbfBQAkyeF2u91WjwAA/LhZs2bphRdeUHFxsdVTAFwheA8gAPiY559/Xj179lR4eLh27typOXPmaNy4cVbPAnAFIQABGKOoqEhdunT50fsPHz6s9u3bN+KiCzt69KhmzpypU6dOqX379vrzn/+szMxMq2cBuILwEjAAY1RXV+v48eM/en9sbKyaNuW/iwFc+QhAAAAAw/ApYAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAw/xdq920e4IJshwAAAABJRU5ErkJggg==\n",
 594 |       "text/plain": [
 595 |        "<IPython.core.display.Image object>"
 596 |       ]
 597 |      },
 598 |      "metadata": {},
 599 |      "output_type": "display_data"
 600 |     }
 601 |    ],
 602 |    "source": [
 603 |     "import seaborn as sns\n",
 604 |     "import matplotlib.pyplot as plt\n",
 605 |     "\n",
 606 |     "df = stars_votes.toPandas()\n",
 607 |     "\n",
 608 |     "# Close previous plots; otherwise, will just overwrite and display again\n",
 609 |     "plt.close()\n",
 610 |     "\n",
 611 |     "sns.barplot(x='star_rating', y='sum(helpful_votes)', data=df)\n",
 612 |     "%matplot plt"
 613 |    ]
 614 |   },
 615 |   {
 616 |    "cell_type": "markdown",
 617 |    "metadata": {},
 618 |    "source": [
 619 |     "Given the order of magnitude difference in 5-star helpful votes, we might wonder whether there are simply more 5 star reviews than other star counts. It seems like there are indeed a lot more 5-star ratings than anything else."
 620 |    ]
 621 |   },
 622 |   {
 623 |    "cell_type": "code",
 624 |    "execution_count": 10,
 625 |    "metadata": {},
 626 |    "outputs": [
 627 |     {
 628 |      "data": {
 629 |       "application/vnd.jupyter.widget-view+json": {
 630 |        "model_id": "33c9769f9c4a4556850430b94a80d872",
 631 |        "version_major": 2,
 632 |        "version_minor": 0
 633 |       },
 634 |       "text/plain": [
 635 |        "VBox()"
 636 |       ]
 637 |      },
 638 |      "metadata": {},
 639 |      "output_type": "display_data"
 640 |     },
 641 |     {
 642 |      "data": {
 643 |       "application/vnd.jupyter.widget-view+json": {
 644 |        "model_id": "",
 645 |        "version_major": 2,
 646 |        "version_minor": 0
 647 |       },
 648 |       "text/plain": [
 649 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 650 |       ]
 651 |      },
 652 |      "metadata": {},
 653 |      "output_type": "display_data"
 654 |     },
 655 |     {
 656 |      "name": "stdout",
 657 |      "output_type": "stream",
 658 |      "text": [
 659 |       "+-----------+--------+\n",
 660 |       "|star_rating|   count|\n",
 661 |       "+-----------+--------+\n",
 662 |       "|          5|13662131|\n",
 663 |       "|          4| 3546319|\n",
 664 |       "|          3| 1543611|\n",
 665 |       "|          2|  861867|\n",
 666 |       "|          1| 1112232|\n",
 667 |       "+-----------+--------+"
 668 |      ]
 669 |     }
 670 |    ],
 671 |    "source": [
 672 |     "stars = (data.groupBy('star_rating')\n",
 673 |     "             .count()\n",
 674 |     "             .sort('star_rating', ascending=False)\n",
 675 |     "        )\n",
 676 |     "stars.show()"
 677 |    ]
 678 |   },
 679 |   {
 680 |    "cell_type": "markdown",
 681 |    "metadata": {},
 682 |    "source": [
 683 |     "We can also take a random sample of our distributed dataset and send that sample to Pandas in order to plot it and get a sense of what it looks like. For instance, we can produce a scatter plot of a limited subset of our data using this strategy."
 684 |    ]
 685 |   },
 686 |   {
 687 |    "cell_type": "code",
 688 |    "execution_count": 11,
 689 |    "metadata": {},
 690 |    "outputs": [
 691 |     {
 692 |      "data": {
 693 |       "application/vnd.jupyter.widget-view+json": {
 694 |        "model_id": "e9529ac4455342ec8e20a23602268f6a",
 695 |        "version_major": 2,
 696 |        "version_minor": 0
 697 |       },
 698 |       "text/plain": [
 699 |        "VBox()"
 700 |       ]
 701 |      },
 702 |      "metadata": {},
 703 |      "output_type": "display_data"
 704 |     },
 705 |     {
 706 |      "data": {
 707 |       "application/vnd.jupyter.widget-view+json": {
 708 |        "model_id": "",
 709 |        "version_major": 2,
 710 |        "version_minor": 0
 711 |       },
 712 |       "text/plain": [
 713 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 714 |       ]
 715 |      },
 716 |      "metadata": {},
 717 |      "output_type": "display_data"
 718 |     },
 719 |     {
 720 |      "data": {
 721 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdeXxU9b3/8fckJIEEZkLIRiSEIEFAFiMKRpQqpCxSBKFWU6rIpXKlAR4IWEtVRLtgbW1VinLttSD3SnC5uEAVi8giEMIiUQSkQQIJQshmEpJIEpLz+4Nfpg5kheTMTM7r+XjM48Gcc2bO52Qg8+a7HZthGIYAAABgGT7uLgAAAADmIgACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAx7dxdgDerqanRqVOn1KlTJ9lsNneXAwAAmsAwDJ09e1ZRUVHy8bFmWxgB8AqcOnVK0dHR7i4DAABchuzsbHXr1s3dZbgFAfAKdOrUSdKFv0B2u93N1QAAgKYoKSlRdHS083vcigiAV6C229dutxMAAQDwMlYevmXNjm8AAAALIwACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACL4U4gAADA4xzLK9WJwnL16BKk2NAgd5fT5hAAAQCAxygqr9SclHRty8hzbhseF6alSfFyBPq5sbK2hS5gAADgMeakpGvH0XyXbTuO5mt2yn43VdQ2EQABAIBHOJZXqm0Zeao2DJft1YahbRl5yswvc1NlbQ8BEAAAeIQTheUN7j9eQABsKQRAAADgEWJCAhvc36MLk0FaCgEQAAB4hJ5hHTU8Lky+NpvLdl+bTcPjwpgN3IIIgAAAwGMsTYrXsF6hLtuG9QrV0qR4N1XUNrEMDAAA8BiOQD+tmj5EmfllOl5QxjqArYQACAAAPE5sKMGvNdEFDAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAW45UB8OWXX9bAgQNlt9tlt9uVkJCgDz/80Ln/3LlzSk5OVpcuXdSxY0dNnjxZZ86ccXmPrKwsjRs3ToGBgQoPD9cjjzyi8+fPm30pAAAApvPKANitWzc988wz2rdvn/bu3asRI0ZowoQJOnjwoCTp4Ycf1rp16/TWW29p69atOnXqlCZNmuR8fXV1tcaNG6fKykrt3LlTr732mlauXKlFixa565IAAABMYzMMw3B3ES0hJCREf/zjH/XjH/9YYWFhWr16tX784x9Lkr766iv17dtXqampuummm/Thhx/qRz/6kU6dOqWIiAhJ0vLly/Xoo48qLy9P/v7+TTpnSUmJHA6HiouLZbfbW+3aAABAy+H720tbAL+vurpaa9asUVlZmRISErRv3z5VVVUpMTHReUyfPn3UvXt3paamSpJSU1M1YMAAZ/iTpNGjR6ukpMTZiliXiooKlZSUuDwAAAC8jdcGwAMHDqhjx44KCAjQQw89pHfeeUf9+vVTTk6O/P39FRwc7HJ8RESEcnJyJEk5OTku4a92f+2++ixZskQOh8P5iI6ObuGrAgAAaH1eGwCvueYapaenKy0tTTNnztTUqVN16NChVj3nwoULVVxc7HxkZ2e36vkAAABaQzt3F3C5/P391atXL0nS4MGDtWfPHr3wwgu65557VFlZqaKiIpdWwDNnzigyMlKSFBkZqd27d7u8X+0s4dpj6hIQEKCAgICWvhQAAABTeW0L4MVqampUUVGhwYMHy8/PT5s2bXLuO3LkiLKyspSQkCBJSkhI0IEDB5Sbm+s8ZuPGjbLb7erXr5/ptQMAAJjJK1sAFy5cqLFjx6p79+46e/asVq9erS1btuijjz6Sw+HQ9OnTNW/ePIWEhMhut2v27NlKSEjQTTfdJEkaNWqU+vXrp/vuu0/PPvuscnJy9Pjjjys5OZkWPgAA0OZ5ZQDMzc3V/fffr9OnT8vhcGjgwIH66KOP9MMf/lCS9Je//EU+Pj6aPHmyKioqNHr0aL300kvO1/v6+mr9+vWaOXOmEhISFBQUpKlTp+rpp5921yUBAACYps2sA+gOrCMEAID34fu7DY0BBAAAQNMQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMV4ZQBcsmSJbrzxRnXq1Enh4eGaOHGijhw54nLMbbfdJpvN5vJ46KGHXI7JysrSuHHjFBgYqPDwcD3yyCM6f/68mZcCAABgunbuLuBybN26VcnJybrxxht1/vx5/frXv9aoUaN06NAhBQUFOY978MEH9fTTTzufBwYGOv9cXV2tcePGKTIyUjt37tTp06d1//33y8/PT7///e9NvR4AAAAz2QzDMNxdxJXKy8tTeHi4tm7dquHDh0u60AJ43XXX6fnnn6/zNR9++KF+9KMf6dSpU4qIiJAkLV++XI8++qjy8vLk7+/f6HlLSkrkcDhUXFwsu93echcEAABaDd/fXtoFfLHi4mJJUkhIiMv2119/XaGhoerfv78WLlyo8vJy577U1FQNGDDAGf4kafTo0SopKdHBgwfNKRwAAMANvLIL+Ptqamo0d+5cDRs2TP3793du/+lPf6qYmBhFRUXpiy++0KOPPqojR45o7dq1kqScnByX8CfJ+TwnJ6fOc1VUVKiiosL5vKSkpKUvBwAAoNV5fQBMTk7Wl19+qe3bt7tsnzFjhvPPAwYMUNeuXTVy5Eh9/fXXuvrqqy/rXEuWLNFTTz11RfUCAAC4m1d3Ac+aNUvr16/X5s2b1a1btwaPHTp0qCTp6NGjkqTIyEidOXPG5Zja55GRkXW+x8KFC1VcXOx8ZGdnX+klAAAAmM4rA6BhGJo1a5beeecdffLJJ4qNjW30Nenp6ZKkrl27SpISEhJ04MAB5ebmOo/ZuHGj7Ha7+vXrV+d7BAQEyG63uzwAAAC8jVd2AScnJ2v16tV677331KlTJ+eYPYfDoQ4dOujrr7/W6tWrdccdd6hLly764osv9PDDD2v48OEaOHCgJGnUqFHq16+f7rvvPj377LPKycnR448/ruTkZAUEBLjz8gAAAFqVVy4DY7PZ6ty+YsUKPfDAA8rOztbPfvYzffnllyorK1N0dLTuuusuPf744y6tdidOnNDMmTO1ZcsWBQUFaerUqXrmmWfUrl3TcjHTyAEA8D58f3tpAPQU/AUCAMD78P3tpWMAAQAAcPkIgAAAABZDAAQAALAYr5wFDAAA3O9YXqlOFJarR5cgxYYGubscNAMBEAAANEtReaXmpKRrW0aec9vwuDAtTYqXI9DPjZWhqegCBgAAzTInJV07jua7bNtxNF+zU/a7qSI0FwEQAAA02bG8Um3LyFP1RavIVRuGtmXkKTO/zE2VoTkIgAAAoMlOFJY3uP94AQHQGxAAAQBAk8WEBDa4v0cXJoN4AwIgAABosp5hHTU8Lky+F92W1ddm0/C4MGYDewkCIAAAaJalSfEa1ivUZduwXqFamhTvporQXCwDAwAAmsUR6KdV04coM79MxwvKWAfQCxEAAQDAZYkNJfh5K7qAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBimAUMNNOxvFKdKCxn2QMAgNciAAJNVFReqTkp6dqWkefcNjwuTEuT4uUI9HNjZQAANA9dwEATzUlJ146j+S7bdhzN1+yU/W6qCACAy0MABJrgWF6ptmXkqdowXLZXG4a2ZeQpM7/MTZUBANB8BECgCU4Ulje4/3gBARAA4D0IgEATxIQENri/RxcmgwAAvAcBEGiCnmEdNTwuTL42m8t2X5tNw+PCmA0MAPAqBECgiZYmxWtYr1CXbcN6hWppUrybKgIA4PKwDAzQRI5AP62aPkSZ+WU6XlDGOoAAAK9FAASaKTaU4AcA8G50AQMAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxXhlAFyyZIluvPFGderUSeHh4Zo4caKOHDnicsy5c+eUnJysLl26qGPHjpo8ebLOnDnjckxWVpbGjRunwMBAhYeH65FHHtH58+fNvBQAAADTeWUA3Lp1q5KTk7Vr1y5t3LhRVVVVGjVqlMrKypzHPPzww1q3bp3eeustbd26VadOndKkSZOc+6urqzVu3DhVVlZq586deu2117Ry5UotWrTIHZcEAABgGpthGIa7i7hSeXl5Cg8P19atWzV8+HAVFxcrLCxMq1ev1o9//GNJ0ldffaW+ffsqNTVVN910kz788EP96Ec/0qlTpxQRESFJWr58uR599FHl5eXJ39+/0fOWlJTI4XCouLhYdru9Va8RAAC0DL6/vbQF8GLFxcWSpJCQEEnSvn37VFVVpcTEROcxffr0Uffu3ZWamipJSk1N1YABA5zhT5JGjx6tkpISHTx40MTqAQAAzNXO3QVcqZqaGs2dO1fDhg1T//79JUk5OTny9/dXcHCwy7ERERHKyclxHvP98Fe7v3ZfXSoqKlRRUeF8XlJS0mLXAQAAYBavbwFMTk7Wl19+qTVr1rT6uZYsWSKHw+F8REdHt/o5AQAAWppXB8BZs2Zp/fr12rx5s7p16+bcHhkZqcrKShUVFbkcf+bMGUVGRjqPuXhWcO3z2mMutnDhQhUXFzsf2dnZLXk5AAAApvDKAGgYhmbNmqV33nlHn3zyiWJjY132Dx48WH5+ftq0aZNz25EjR5SVlaWEhARJUkJCgg4cOKDc3FznMRs3bpTdble/fv3qPG9AQIDsdrvLAwAAwNt45RjA5ORkrV69Wu+99546derkHLPncDjUoUMHORwOTZ8+XfPmzVNISIjsdrtmz56thIQE3XTTTZKkUaNGqV+/frrvvvv07LPPKicnR48//riSk5MVEBDgzssDAABoVW5dBqa6uloHDhxQTEyMOnfu3OTX2Wy2OrevWLFCDzzwgKQLC0HPnz9fKSkpqqio0OjRo/XSSy+5dO+eOHFCM2fO1JYtWxQUFKSpU6fqmWeeUbt2TcvFTCMHAMD78P1tcgCcO3euBgwYoOnTp6u6ulo/+MEPtHPnTgUGBmr9+vW67bbbzCqlRfAXCAAA78P3t8ljAN9++20NGjRIkrRu3TplZmbqq6++0sMPP6zHHnvMzFIAAAAsy9QAmJ+f7+yC/eCDD3T33Xerd+/e+o//+A8dOHDAzFIAAAAsy9QAGBERoUOHDqm6ulobNmzQD3/4Q0lSeXm5fH19zSwFAADAskydBTxt2jT95Cc/UdeuXWWz2Zy3aktLS1OfPn3MLAUAAMCyTA2AixcvVv/+/ZWdna27777budyKr6+vfvWrX5lZCgAAgGW5bRmYc+fOqX379u44dYthFhEAAN6H72+TxwBWV1frN7/5ja666ip17NhRx44dkyQ98cQTevXVV80sBQAAwLJMDYC/+93vtHLlSj377LPy9/d3bu/fv7/++7//28xSAAAALMvUALhq1Sq98sormjJlisus30GDBumrr74ysxQAAADLMjUAfvPNN+rVq9cl22tqalRVVWVmKQAAAJZlagDs16+fPv3000u2v/3224qPjzezFAAAAMsydRmYRYsWaerUqfrmm29UU1OjtWvX6siRI1q1apXWr19vZikAAACWZWoL4IQJE7Ru3Tp9/PHHCgoK0qJFi3T48GGtW7fOeVcQAAAAtC63rQPYFrCOEAAA3ofvb5NbAHv27KmCgoJLthcVFalnz55mlgIAAGBZpgbA48ePq7q6+pLtFRUV+uabb8wsBQAAwLJMmQTy/vvvO//80UcfyeFwOJ9XV1dr06ZN6tGjhxmlAAAAWJ4pAXDixImSJJvNpqlTp7rs8/PzU48ePfTcc8+ZUQoAAIDlmRIAa2pqJEmxsbHas2ePQkNDzTgtAAAA6mDqOoCZmZlmng4AAAB1MHUSiCRt3bpV48ePV69evdSrVy/deeeddd4dBAAAAK3D1AD4v//7v0pMTFRgYKDmzJmjOXPmqEOHDho5cqRWr15tZikAAACWZepC0H379tWMGTP08MMPu2z/85//rL/97W86fPiwWaW0CBaSBADA+/D9bXIL4LFjxzR+/PhLtt95552MDwQAADCJqQEwOjpamzZtumT7xx9/rOjoaDNLAQAAsCxTZwHPnz9fc+bMUXp6um6++WZJ0o4dO7Ry5Uq98MILZpYCAABgWaYGwJkzZyoyMlLPPfec3nzzTUkXxgW+8cYbmjBhgpmlAAAAWJapk0DaGgaRAgDgffj+NnkM4M9//nNt2bLFzFMCAADgIqYGwLy8PI0ZM0bR0dF65JFHlJ6ebubpAQAAIJMD4HvvvafTp0/riSee0J49ezR48GBde+21+v3vf6/jx4+bWQoAAIBluXUM4MmTJ5WSkqK///3vysjI0Pnz591VymVhDAEAAN6H72+TZwF/X1VVlfbu3au0tDQdP35cERER7ioFAFrMsbxSnSgsV48uQYoNDXJ3OQBQJ9MD4ObNm7V69Wr93//9n2pqajRp0iStX79eI0aMMLsUAGgxReWVmpOSrm0Zec5tw+PCtDQpXo5APzdWBgCXMjUAXnXVVSosLNSYMWP0yiuvaPz48QoICDCzBABoFXNS0rXjaL7Lth1H8zU7Zb9WTR/ipqoAoG6mBsDFixfr7rvvVnBwcIPHnTx5UlFRUfLxMXWOCgBclmN5pS4tf7WqDUPbMvKUmV9GdzAAj2JqwnrwwQcbDX+S1K9fP2YFA/AaJwrLG9x/vKDMpEoAoGncNgmkIdycBFbBhIG2ISYksMH9Pbrw2QLwLB4ZAIG2jgkDbUvPsI4aHhemHUfzVf29/8D62mwa1iuUcA/A4zDIDnCDhiYMwDstTYrXsF6hLtuG9QrV0qR4N1UEAPWjBRAwGRMG2iZHoJ9WTR+izPwyHS8oo1sfgEfzyBZAm83W4P5t27Zp/PjxioqKks1m07vvvuuy/4EHHpDNZnN5jBkzxuWYwsJCTZkyRXa7XcHBwZo+fbpKS0tb/FqAizFhoG2LDQ3S7deEE/4AeDSPDICNTQIpKyvToEGDtGzZsnqPGTNmjE6fPu18pKSkuOyfMmWKDh48qI0bN2r9+vXatm2bZsyY0SL1Aw1hwgAAwN08sgv40KFDioqKqnf/2LFjNXbs2AbfIyAgQJGRkXXuO3z4sDZs2KA9e/bohhtukCQtXbpUd9xxh/70pz81eG7gSjFhAADgbq0eACdNmtTkY9euXStJio6OvuLzbtmyReHh4ercubNGjBih3/72t+rSpYskKTU1VcHBwc7wJ0mJiYny8fFRWlqa7rrrris+P9CQpUnxmp2y32UsIBMGAABmafUA6HA4WvsUlxgzZowmTZqk2NhYff311/r1r3+tsWPHKjU1Vb6+vsrJyVF4eLjLa9q1a6eQkBDl5OTU+74VFRWqqKhwPi8pKWm1a0DbxoQBuBtrUALW1uoBcMWKFa19ikvce++9zj8PGDBAAwcO1NVXX60tW7Zo5MiRl/2+S5Ys0VNPPdUSJQKSLkwY4MsXZmINSgCSh04CaWk9e/ZUaGiojh49KkmKjIxUbm6uyzHnz59XYWFhveMGJWnhwoUqLi52PrKzs1u1bgBoaaxBCUBywySQt99+W2+++aaysrJUWVnpsu+zzz5rlXOePHlSBQUF6tq1qyQpISFBRUVF2rdvnwYPHixJ+uSTT1RTU6OhQ4fW+z4BAQEKCAholRoBoLWxBiWAWqa2AL744ouaNm2aIiIitH//fg0ZMkRdunTRsWPHGp3V+32lpaVKT09Xenq6JCkzM1Pp6enKyspSaWmpHnnkEe3atUvHjx/Xpk2bNGHCBPXq1UujR4+WJPXt21djxozRgw8+qN27d2vHjh2aNWuW7r33XmYAA2izWIMSQC1TA+BLL72kV155RUuXLpW/v79++ctfauPGjZozZ46Ki4ub/D579+5VfHy84uMvzJicN2+e4uPjtWjRIvn6+uqLL77QnXfeqd69e2v69OkaPHiwPv30U5fWu9dff119+vTRyJEjdccdd+iWW27RK6+80uLXDACegjUoAdSyGY2tutyCAgMDdfjwYcXExCg8PFwbN27UoEGDlJGRoZtuukkFBQVmldIiSkpK5HA4VFxcLLvd7u5yAKBR97+6u941KFdNH+LGygDz8P1tcgtgZGSkCgsLJUndu3fXrl27JF3owjUxhwKAZS1NitewXqEu21iDErAeUyeBjBgxQu+//77i4+M1bdo0Pfzww3r77be1d+/eZi0YDQC4PKxBCUAyuQu4pqZGNTU1atfuQu5cs2aNdu7cqbi4OP3nf/6n/P39zSqlRdCEDACA9+H72+QAmJWVpejoaNlsNpfthmEoOztb3bt3N6uUFsFfIAAAvA/f3yaPAYyNjVVe3qVrUBUWFio2NtbMUgAAACzL1ABoGMYlrX/ShXX92rdvb2YpAAAAlmXKJJB58+ZJkmw2m5544gkFBv57Larq6mqlpaXpuuuuM6MUAAAAyzMlAO7ff+Eek4Zh6MCBAy6TPfz9/TVo0CAtWLDAjFIAAAAsz5QAuHnzZknStGnT9MILL1h2wCUAAIAnMHUdwBUrVjj/fPLkSUlSt27dzCwBAADA8kydBFJTU6Onn35aDodDMTExiomJUXBwsH7zm9+opqbGzFIAAAAsy9QWwMcee0yvvvqqnnnmGQ0bNkyStH37di1evFjnzp3T7373OzPLAQAAsCRTF4KOiorS8uXLdeedd7psf++99/SLX/xC33zzjVmltAgWkgQAwPvw/W1yC2BhYaH69OlzyfY+ffqosLDQzFJwhY7llepEYblb7iPqznMDANAWmBoABw0apL/+9a968cUXXbb/9a9/1aBBg8wsBZepqLxSc1LStS3j33d0GR4XpqVJ8XIE+rXZcwMA0JaY2gW8detWjRs3Tt27d1dCQoIkKTU1VdnZ2frggw906623mlVKi7BiE/L9r+7WjqP5qv7eXxtfm03DeoVq1fQhbfbcAIC2w4rf3xcz/V7A//rXv3TXXXepqKhIRUVFmjRpko4cOaKYmBgzS8FlOJZXqm0ZeS4BTJKqDUPbMvKUmV/WJs8NAEBbY2oXcGxsrE6fPn3JbN+CggJFR0erurrazHLQTCcKyxvcf7ygrNXG5Lnz3AAAtDWmtgDW19tcWlqq9u3bm1kKLkNMSGCD+3t0ab0A5s5zAwDQ1pjSAjhv3jxJks1m06JFixQY+O8v8+rqaqWlpem6664zoxRcgZ5hHTU8LqzecXit2QLnznMDANDWmBIA9+/fL+lCC+CBAwfk7+/v3Ofv769BgwZpwYIFZpSCK7Q0KV6zU/a7zMQd1itUS5Pi2/S5AQBoS0ydBTxt2jS98MILbWbGjZVnEWXml+l4QZlb1uJz57nhPVgvEkB9rPz9XcvUANjW8BcI8DysFwmgMXx/mzwJBABa25yUdO04mu+ybcfRfM1O2e+migDA8xAAATToWF6pNh/J9Yq1FlkvEgCaxtR1AAF4D2/sSmW9SABoGloAAdTJG7tSWS8SAJqGAAiYwJu6USXv7UqtXS/S12arc/+T7x1UcXmVyVUBgOchAAKtqKi8Uve/ulsjntuqaSv26PY/bdH9r+72+BDSlK5UT7U0KV7DeoXWuc/TWzABwCwEQKAVeWM3quTdXamOQD8tvrNfnfs8vQUTAMxCAARaibd2o0r1d6X62mwaHhfm8RMpvLkFEwDMQAAEWom3h5C6ulK95dZ73tyCCQBmYBkYoJV4ewhxBPpp1fQhXnnrvdoWzB1H811aYH1tNg3rFeo11wEArYUWQKCVeHs3aq3Y0CDdfk2419Rby5tbMAGgtXEv4CvAvQTRmOLyKs1O2e9Viym3Nd7YggmgdfH9TQC8IvwFQlMRQgDAc/D9zRhAwBSxoQQ/AIDnYAwgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFiMVwbAbdu2afz48YqKipLNZtO7777rst8wDC1atEhdu3ZVhw4dlJiYqIyMDJdjCgsLNWXKFNntdgUHB2v69OkqLS018zIAXORYXqk2H8n16PskA0Bb4JUBsKysTIMGDdKyZcvq3P/ss8/qxRdf1PLly5WWlqagoCCNHj1a586dcx4zZcoUHTx4UBs3btT69eu1bds2zZgxw6xLANqkyw1wReWVuv/V3Rrx3FZNW7FHt/9pi+5/dbeKy6taqVIAsDavXwjaZrPpnXfe0cSJEyVdaP2LiorS/PnztWDBAklScXGxIiIitHLlSt177706fPiw+vXrpz179uiGG6M2xpMAACAASURBVG6QJG3YsEF33HGHTp48qaioqCadm4UkgQuKyis1JyX9su94cv+ru+u9b++q6UNapWYA1sX3t5e2ADYkMzNTOTk5SkxMdG5zOBwaOnSoUlNTJUmpqakKDg52hj9JSkxMlI+Pj9LS0up974qKCpWUlLg8AEhzUtK142i+y7YdR/M1O2V/o689lleqbRl5LuFPkqoNQ9sy8ugOBoBW0OYCYE5OjiQpIiLCZXtERIRzX05OjsLDw132t2vXTiEhIc5j6rJkyRI5HA7nIzo6uoWrB7xPYwFuze6sBkPcicLyBt//eAEBEABaWpsLgK1p4cKFKi4udj6ys7PdXRLgdo0FuF+tPdDgmL6YkMAGX9+jC7fQA4CW1uYCYGRkpCTpzJkzLtvPnDnj3BcZGanc3FyX/efPn1dhYaHzmLoEBATIbre7PACrayzA1aqvS7hnWEcNjwuTr83mst3XZtPwuDDuoQwAraDNBcDY2FhFRkZq06ZNzm0lJSVKS0tTQkKCJCkhIUFFRUXat2+f85hPPvlENTU1Gjp0qOk1A96svgB3sYbG9C1NitewXqEu24b1CtXSpPgWrRUAcEE7dxdwOUpLS3X06FHn88zMTKWnpyskJETdu3fX3Llz9dvf/lZxcXGKjY3VE088oaioKOdM4b59+2rMmDF68MEHtXz5clVVVWnWrFm69957mzwDGMC/LU2K1+yU/S6zgOtzvKDsklY9R6CfVk0fosz8Mh0vKFOPLkG0/AFAK/LKZWC2bNmi22+//ZLtU6dO1cqVK2UYhp588km98sorKioq0i233KKXXnpJvXv3dh5bWFioWbNmad26dfLx8dHkyZP14osvqmPHjk2ug2nkgKvM/DLtOlaghWsP1HvM5gW3Ee4AuBXf314aAD0Ff4GAurGuHwBPxvd3GxwDCMD9GNMHAJ7NK8cAAvBsjOkDAM9GAATQamJDCX4A4InoAgYAALAYAiAAAIDFEAABAAAshgAIAABgMQRAAAAAi2EWMNBGHcsr1YnCcpZgAQBcggAItDFF5ZWak5Lucl/e4XFhWpoUL0egnxsrAwB4CrqAgTZmTkq6dhzNd9m242i+Zqfsd1NFAABPQwAE2pBjeaXalpHncg9eSao2DG3LyFNmfpmbKgMAeBICINCGnCgsb3D/8QICIACAAAi0KTEhgQ3u79GFySAAAAIg0Kb0DOuo4XFh8rXZXLb72mwaHhfGbGAAgCQCINDmLE2K17BeoS7bhvUK1dKkeDdVBADwNCwDA7QxjkA/rZo+RJn5ZTpeUMY6gACASxAALYgFgq0hNpTPFwBQNwKghbBAcOsjXAMAvAEB0EIaWiB41fQhbqqqbWipcE2ABACYgQBoEbULBF/s+wsEe0Lg8NYAdKXhmtZZAICZCIAW0ZQFgt0ZuLw5ALVEuKZ1FgBgJpaBsQhPXyDYm+9fe6V33+D2bQAAsxEALcKTFwj29gB0OeH6WF6pNh/JVWZ+GbdvAwCYji5gC1maFK/ZKftduis9YYFgT++ebkxtuN5xNN8lxPrabBrWK9Sl9rq6um+I6dzg+7u7dRYA0PYQAC3EUxcI9vTu6aZoariuq6t7f1aROgf6qeS7840GSAAAWgIB0II8bYHg5rSgeaqmhOuGJot8W16lG3t01p7j3zq3e0LrLACgbSIAwiN4avd0czUUrhvr6v7F7b3Uo0uQR7XOAgDaJgIgPIKndk+3pKZ0dXta6ywAoG0iAMKjtOUA1Ba6ugEAbQPLwAAmWpoUr2G9Ql22eWNXNwDAu9ECCMsz8/ZzVujqBgB4PgIgWoU33NPXnbefa8td3QAAz0cARIvypnv6cv9dAIBVMQYQLcpb7unbUref23okVy9s+pc+rWN9PwAAPBUtgGgxDS10XBuqPKXb80pvP/fpv3I18/XPVFpR7dzWOdBP7yffouguDS/3AgCAu9ECiBbTlFBlpmN5pdp8JLfO1rzLvf1cUXml7n91t+77+x6X8CdJ35ZX6c5l2y+/YAAATEILIFqMp9zTtynjEC93Tb45KekNdvd+W16lTzPydGtcWAtdDQAALY8WQLSY2lDla7O5bPe12TQ8Lsy07t+mjkNs7pp8tV3cRp17/+2zrG8bOQIAAPeiBRAtyt339G3OOMTmrsnXWBd3reu7d7684gEAMAkBsA3xhLX33L3Q8eVM7jCMxtr0Lmisi1u6MBGE7l8AgKdrswFw8eLFeuqpp1y2XXPNNfrqq68kSefOndP8+fO1Zs0aVVRUaPTo0XrppZcUERHhjnKviCeuveeuhY6bMg6xNiiHBPrruX/+q8k/t/rGDdaqnQUMAICnsxlNbf7wMosXL9bbb7+tjz/+2LmtXbt2Cg29MOZr5syZ+sc//qGVK1fK4XBo1qxZ8vHx0Y4dO5p8jpKSEjkcDhUXF8tut7f4NTTV/a/urncyQ1tc0Lixls66fh4+kgZFB6tTe786u4hrNfZzKy6vuqSLOyakgxaO7asxA7pe/kUBAEzjKd/f7tRmWwClC4EvMjLyku3FxcV69dVXtXr1ao0YMUKStGLFCvXt21e7du3STTfdZHapl82b1t67Uk1t6axrHGKNpP3ZRY2eo7Gfm7u7uAEAaAltehZwRkaGoqKi1LNnT02ZMkVZWVmSpH379qmqqkqJiYnOY/v06aPu3bsrNTW13verqKhQSUmJy8PdPG3tvdY0JyVd24+6ht26ZvfWhrQbe3SWj+uE5CZr7OcWGxqk268JJ/wBALxSmw2AQ4cO1cqVK7Vhwwa9/PLLyszM1K233qqzZ88qJydH/v7+Cg4OdnlNRESEcnJy6n3PJUuWyOFwOB/R0dGtfRmN8pS191rb59nfaltGnmouGrBQ363bjuWVas/xby85vqnays8NAIC6tNku4LFjxzr/PHDgQA0dOlQxMTF688031aFDh8t6z4ULF2revHnO5yUlJW4PgZe7oLHUvFnD7p5h/Ng7Xza4/+LZvU1dsuViTfm5AQDg7dpsALxYcHCwevfuraNHj+qHP/yhKisrVVRU5NIKeObMmTrHDNYKCAhQQECAGeU2S1PX3vv37Fc/PffPjCbNfvWEGcbH8kr15amGu9svbrFrypItdTFzzUIAANzFMgGwtLRUX3/9te677z4NHjxYfn5+2rRpkyZPnixJOnLkiLKyspSQkODmSpuvsYkJdYW4i9WOpbt49mtDd9Uwa4ZxY615/a+yX9Ji1zOsozoH+unb8qoGXzs8LkwLRvVWQXklEzoAAJbRZgPgggULNH78eMXExOjUqVN68skn5evrq6SkJDkcDk2fPl3z5s1TSEiI7Ha7Zs+erYSEBK+aAXyx+tbeqyvEXayu2a+eMsO4sda8Wbf3kuTaTW0YRoPhb8mkAbqpZxcCHwDAktpsADx58qSSkpJUUFCgsLAw3XLLLdq1a5fCwi7cpeEvf/mLfHx8NHnyZJeFoNua+kJcfb4/lu5y7qpxpeoaa9jYAswP/e9nl7T29b+q4XWdIh3tCX8AAMtqswFwzZo1De5v3769li1bpmXLlplUkXs0dzLE98fSmTnDuLGxhnWNc/y+i1v7DjVzzCAAAFbSZpeBwQVNnQzha7NpeFyYS6tYbcvbxX9J6jr2SjU01lD69zjHVf/RtHGHtcu/mFE7AADeps22AFpBU5Zmaaz7tFZds1+Lyit1vqZGNRcdOyQ25Ipnyl48Xq+xsYaGYejg6RK9+HFGs87TL8ruMoOYWb4AABAAvVJzl2apq/t0eFyYFozurYKy+me//uzVNB38xrUr1ccm+fn6OM/T3PUB66q9sfF6s1d/1ugyMPVZ+tPrJYnbtgEA8D02w2igWQgNctfNpO9/dXe9Cz83tDRLU+9fe6KgTOOXblfJufP1HvPylHil7D55SaicPypOheVV9Z6jrtp9bGrwjh0+0iWtkI1pys8DAGBN7vr+9iS0AHqZK1mapb5lYi42cdmOBsOfJM18ff8l27Zl5DXYKllf7Y3drq254U+iqxcAgIYwCcTLNGVpliux9Uhuo4snN9X3J3Fc6Pq9NDS2tAWjemvzgtu0avoQ0+5UAgCAtyEAepnGZvW+9MlRFV9BgEs/WXTZr73Y91sl56SkN7o0y5XqHOinWSPiGOcHAEAjCIBepHbCxY09OsvXZqvzmM+yipytbpfjum7BjR/UTLuOFWhbRt5ldeU2lb19O72ffEsrngEAgLaDMYBeoK6Zs/b27eocp3clt2krKq/Uq9uPX2m5l6g7qrbce/ePsmvdnFtb8SwAALQtBEAP9f3lVZ587+AliySfbWSSRkO3aatv6ZaG7hnczsem6hpDzZkyXjsTd0hsSDNe1Ty3/v+JJgAAoOkIgB6mrta+ujQWxOq61Vmda/BF2fX7uwaoY/t2DZ5z7cyb9et3DjS4Hl9MSAedKPzO+XxYr1DNH9VbJwrL622xvBzPTBqgCEd71vUDAOAyEQA9TEOtcE1R2+pWVzCq672/PFWiO5ftUP+ohtdBKiiv1ItJ8Rrx3NZ6j6kNf/2vsuuBhB5amXpcE5btaP5FNGJozy4EPwAArgCTQDxI7Tp5Dd2yrTH2Du30u4n9m/3eBxuZodujS5DztnL1TUCp9eU3JVrw9hf68puWnfXLfXwBAGgZBEAP0tgaf03xbXmV5qy5dBZwY+9dGwt9Lsp2taHLMAxtPpKrBaN7a1iv0Cuusy4Xx8p2FxXD4s4AALQMuoA9SGNr/DXV/uwi3f3yTv331BvlCPRTUXmlXtp8tEmvDfT3VWlFtfN5ez+bisorXbp+h8eF6f3kYVqxM1Pv7D/VIjVL0pJJA3RV5w76LOtbXd+9s26NC2vy7esAAEDTEQA9SM+wjuoc6FfnnTg6B/qpW3B7HTh1tknvtffEt5qdsl+rpg/RnJR0fXaiaQs8l1dWuzwvq6zRF98Uu2zbcTRfZ89VtUiL5ffVju27NS7Mua2pt68DAABNRwD0IMfySuu9Ddu35VU6X930sYGG/v+9ef+V1+iMYunCWIAaNX5fXunCWoP7s1vujiGSdPPVTOwAAMAsjAH0II21qJ2taP4yKvuzv23Scf0amQXcmobHhenlKYPddn4AAKyGFkAP0lJjAL8vPrpzg/uXTBqgm3p2kWEYDS7x0hpiQjpoadL1Ghjd8refAwAA9aMF0IPUt8yKr82mAVc1r4WudvZut84dGjyu9kw9wzoqPtrRrHNcCXv7dtr6yxGEPwAA3IAWQA+zNCles1P2u4zbq13+5LY/ba5zjGDHAF9Jcpm9W7se4NH80gbP96u1BySp3sknraFzoJ/eT77FlHMBAIBL2QzjClYdtriSkhI5HA4VFxfLbm/ZMXR1LX+SXVCuO5dtdwlqnQP9FBfeUftOFLks8uxjk66NsuunQ7tr4dovW7S2yxUT0kELx/bVmAFd3V0KAMDCWvP721sQAK+AWX+BjuWV6kRhuTMMvrknWzuP5WvY1aEaHNPZ9LF7zdUxwFev/OwG3RzXOgtIAwDQHARAuoA92ufZ3+qxd77Ul9+7Tdv3u2rf3X9K/Zs5NtAsfSM6aszArs4FnQEAgOcgAHqgovJKPbhqr/Ycv3QJl4vH6R1q5B6+7jA8LkxLk+LlCPRzdykAAKAOBEAPU1Reqdv/tKXJEzKasnBzawvvFKDk269W9y5B3LINAAAvQAD0MA/8fbdps3Fbyhv/mUDoAwDAixAAPcixvFKlnyxu/EAP4WOTbukVRvgDAMDLsBC0B0nLLHR3Cc1yS68LY/0AAIB3oQXQg+SdPefuEprs/eRh3MUDAAAvRQD0IGGd2ru7hEb5+kjrkm9Rv6vMu20cAABoWQRADzI0NsTdJdTL39emeT/srYdu6+XuUgAAwBUiAHqQnmEd1cHPR99V1bi7FKeYLoFaOLaPxvTn9m0AALQVBEAPciyv1CPCX6Cfj5b9bDBr+gEA0EYRAD3IicJyd5egzoF+ej/5FkV3CXR3KQAAoJUQAD1ITIh7Q9evx/bRjB9c7dYaAABA6yMAWlw7H+k/hvXQr8dd6+5SAACASQiAHuR/Uk+Ydi4fSU9NuFb3JfQw7ZwAAMAzEAA9yK5j+aac56HhsfrVHf1MORcAAPA8BEAPEuDn26rvf/s1oVoxbWirngMAAHg+AqAHCWqlANg9OEDbfpXYKu8NAAC8j4+7C3C3ZcuWqUePHmrfvr2GDh2q3bt3u62Wr/PKWvT9unby0/FnxhH+AACAC0sHwDfeeEPz5s3Tk08+qc8++0yDBg3S6NGjlZub65Z6cs5WtMj7dAyQjj8zTqmPjWqR9wMAAG2LpQPgn//8Zz344IOaNm2a+vXrp+XLlyswMFB///vf3V3aZTv+zDh9+dQ4d5cBAAA8mGXHAFZWVmrfvn1auHChc5uPj48SExOVmppa52sqKipUUfHvVrqSkpJWr7Opjj9D6AMAAE1j2QCYn5+v6upqRUREuGyPiIjQV199VedrlixZoqeeesqM8pqM4AcAAJrL0l3AzbVw4UIVFxc7H9nZ2W6r5fgz4wh/AADgsli2BTA0NFS+vr46c+aMy/YzZ84oMjKyztcEBAQoICCg1Wo6/sw49fjVPxo9BgAA4EpYtgXQ399fgwcP1qZNm5zbampqtGnTJiUkJLixsvoR/gAAQEuwbAugJM2bN09Tp07VDTfcoCFDhuj5559XWVmZpk2b5raaakPe91sCCX4AAKAlWToA3nPPPcrLy9OiRYuUk5Oj6667Ths2bLhkYog7EPoAAEBrsRmGYbi7CG9VUlIih8Oh4uJi2e12d5cDAACagO9vC48BBAAAsCoCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBhL3wruStXeRKWkpMTNlQAAgKaq/d628s3QCIBX4OzZs5Kk6OhoN1cCAACa6+zZs3I4HO4uwy24F/AVqKmp0alTp9SpUyfZbLYWfe+SkhJFR0crOzvbsvcpdDc+A/fjM3A/PgP34zNoeYZh6OzZs4qKipKPjzVHw9ECeAV8fHzUrVu3Vj2H3W7nH7yb8Rm4H5+B+/EZuB+fQcuyastfLWvGXgAAAAsjAAIAAFiM7+LFixe7uwjUzdfXV7fddpvataOn3l34DNyPz8D9+Azcj88ALY1JIAAAABZDFzAAAIDFEAABAAAshgAIAABgMQRAAAAAiyEAeqBly5apR48eat++vYYOHardu3e7u6Q2a/HixbLZbC6PPn36OPefO3dOycnJ6tKlizp27KjJkyfrzJkzbqzY+23btk3jx49XVFSUbDab3n33XZf9hmFo0aJF6tq1qzp06KDExERlZGS4HFNYWKgpU6bIbrcrODhY06dPV2lpqZmX4dUa+wweeOCBS/5djBkzxuUYPoMrs2TJEt14443q1KmTwsPDNXHiRB05csTlmKb8/snKytK4ceMUGBio8PBwPfLIIzp//ryZlwIvRQD0MG+88YbmzZunJ598Up999pkGDRqk0aNHKzc3192ltVnXXnutTp8+7Xxs377due/hhx/WunXr9NZbb2nr1q06deqUJk2a5MZqvV9ZWZkGDRqkZcuW1bn/2Wef1Ysvvqjly5crLS1NQUFBGj16tM6dO+c8ZsqUKTp48KA2btyo9evXa9u2bZoxY4ZZl+D1GvsMJGnMmDEu/y5SUlJc9vMZXJmtW7cqOTlZu3bt0saNG1VVVaVRo0aprKzMeUxjv3+qq6s1btw4VVZWaufOnXrttde0cuVKLVq0yB2XBG9jwKMMGTLESE5Odj6vrq42oqKijCVLlrixqrbrySefNAYNGlTnvqKiIsPPz8946623nNsOHz5sSDJSU1PNKrFNk2S88847zuc1NTVGZGSk8cc//tG5raioyAgICDBSUlIMwzCMQ4cOGZKMPXv2OI/58MMPDZvNZnzzzTfmFd9GXPwZGIZhTJ061ZgwYUK9r+EzaHm5ubmGJGPr1q2GYTTt988HH3xg+Pj4GDk5Oc5jXn75ZcNutxsVFRXmXgC8Di2AHqSyslL79u1TYmKic5uPj48SExOVmprqxsratoyMDEVFRalnz56aMmWKsrKyJEn79u1TVVWVy+fRp08fde/enc+jlWRmZionJ8flZ+5wODR06FDnzzw1NVXBwcG64YYbnMckJibKx8dHaWlpptfcVm3ZskXh4eG65pprNHPmTBUUFDj38Rm0vOLiYklSSEiIpKb9/klNTdWAAQMUERHhPGb06NEqKSnRwYMHTawe3ogA6EHy8/NVXV3t8o9ZkiIiIpSTk+Omqtq2oUOHauXKldqwYYNefvllZWZm6tZbb9XZs2eVk5Mjf39/BQcHu7yGz6P11P5cG/o3kJOTo/DwcJf97dq1U0hICJ9LCxkzZoxWrVqlTZs26Q9/+IO2bt2qsWPHqrq6WhKfQUurqanR3LlzNWzYMPXv31+SmvT7Jycnp85/K7X7gIZwTxlY2tixY51/HjhwoIYOHaqYmBi9+eab6tChgxsrA9zn3nvvdf55wIABGjhwoK6++mpt2bJFI0eOdGNlbVNycrK+/PJLl/HHQGujBdCDhIaGytfX95JZXmfOnFFkZKSbqrKW4OBg9e7dW0ePHlVkZKQqKytVVFTkcgyfR+up/bk29G8gMjLykklR58+fV2FhIZ9LK+nZs6dCQ0N19OhRSXwGLWnWrFlav369Nm/erG7dujm3N+X3T2RkZJ3/Vmr3AQ0hAHoQf39/DR48WJs2bXJuq6mp0aZNm5SQkODGyqyjtLRUX3/9tbp27arBgwfLz8/P5fM4cuSIsrKy+DxaSWxsrCIjI11+5iUlJUpLS3P+zBMSElRUVKR9+/Y5j/nkk09UU1OjoUOHml6zFZw8eVIFBQXq2rWrJD6DlmAYhmbNmqV33nlHn3zyiWJjY132N+X3T0JCgg4cOOASxjdu3Ci73a5+/fqZcyHwXu6ehQJXa9asMQICAoyVK1cahw4dMmbMmGEEBwe7zPJCy5k/f76xZcsWIzMz09ixY4eRmJhohIaGGrm5uYZhGMZDDz1kdO/e3fjkk0+MvXv3GgkJCUZCQoKbq/ZuZ8+eNfbv32/s37/fkGT8+c9/Nvbv32+cOHHCMAzDeOaZZ4zg4GDjvffeM7744gtjwoQJRmxsrPHdd98532PMmDFGfHy8kZaWZmzfvt2Ii4szkpKS3HVJXqehz+Ds2bPGggULjNTUVCMzM9P4+OOPjeuvv96Ii4szzp0753wPPoMrM3PmTMPhcBhbtmwxTp8+7XyUl5c7j2ns98/58+eN/v37G6NGjTLS09ONDRs2GGFhYcbChQvdcUnwMgRAD7R06VKje/fuhr+/vzFkyBBj165d7i6pzbrnnnuMrl27Gv7+/sZVV11l3HPPPcbRo0ed+7/77jvjF7/4hdG5c2cjMDDQuOuuu4zTp0+7sWLvt3nzZkPSJY+pU6cahnFhKZgnnnjCiIiIMAICAoyRI0caR44ccXmPgoICIykpyejYsaNht9uNadOmGWfPnnXD1Xinhj6D8vJyY9SoUUZYWJjh5+dnxMTEGA8++OAl/wnlM7gydf38JRkrVqxwHtOU3z/Hjx83xo4da3To0MEIDQ015s+fb1RVVZl8NfBGNsMwDLNbHQEAAOA+jAEEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEYIrbbrtNc+fOvezXL168WNddd12zXmMYhmbMmKGQkBDZbDalp6c36XU2m03vvvvu5ZQJAF6BAAigzdqwYYNWrlyp9evX6/Tp0+rfv7+7S7rEypUrFRwc7O4yAFhMO3cXAACt5euvv1bXrl118803u7sUAPAotAACME1NTY1++ctfKiQkRJGRkVq8eLFzX1FRkX7+858rLCxMdrtdI0aM0Oeff17vez3wwAOaOHGinnrqKedrHnroIVVWVjr3z549W1lZWbLZbOrRo4ckqUePHnr++edd3uu6665zqaWpbr75Zj366KMu2/Ly8uTn56dt27ZJkr799lvdf//96ty5swIDAzV27FhlZGRIkrZs2aJp06apuLhYNptNNpvNWUdFRYUWLFigq666SkFBQRo6dKi2bNniPM+JEyc0fvx4de7cWUFBQbr22mv1wQcfNPsaAFgTARCAaV577TUFBQUpLS1Nzz77rJ5++mlt3LhRknT33XcrNzdXH374ofbt26frr79eI0eOVGFhYb3vt2nTJh0+fFhbtmxRSkqK1q5dq6eeekqS9MILL+jpp59Wt27ddPr0ae3Zs6fFr2fKlClas2aNDMNwbnvjjTcUFRWlW2+9VdKFILp37169//77Sk1NlWEYuuOOO1RVVaWbb75Zzz//vOx2u06fPq3Tp09rwYIFkqRZs2YpNTVVa9as0RdffKG7775bY8aMcYbH5ORkVVRUaNu2bTpw4ID+8Ic/qGPHji1+jQDaJrqAAZhm4MCBevLJJyVJcXFx+utf/6pNmzapQ4cO2r17t3JzcxUQECBJ+tOf/qR3331Xb7/9tmbMmFHn+/n7++vvf/+7AgMDde211+rpp5/WI488ot/85jdyOBzq1KmTfH19FRkZ2SrX85Of/ERz587V9u3bnYFv9erVSkpKks1mU0ZGht5//33t2LHD2Q39+uuvKzo6Wu+++67uvvtuORwO2Ww2lxqzsrK0YsUKZWVlKSoqSpK0YMECbdiwQStWrNDvf/97ZWVlafLkyRowYIAkqWfPnq1yjQDaJgIgANMMHDjQ5XnXrl2Vm5urzz//XKWlperSpYvL/u+++05ff/11ve83aNAgBQYGOp8nJCSotLRU2dnZiomJadni6xAWFqZRo0bp9ddf16233qrMzEylpqbqv/7rvyRJhw8fVrt27TR06FDna7p06aJrrrlGhw8frvd9Dxw4oOrqavXu3dtlSs5EAwAAA29JREFUe0VFhfNnNGfOHM2cOVP//Oc/lZiYqMmTJ1/y8wWA+hAAAZjGz8/P5bnNZlNNTY1KS0vVtWtXlzFutVp6hqyPj49Ll60kVVVVXfb7TZkyRXPmzNHSpUu1evVqDRgwwNkqd7lKS0vl6+urffv2ydfX12VfbTfvz3/+c40ePVr/+Mc/9M9//lNLlizRc889p9mzZ1/RuQFYA2MAAbjd9ddfr5ycHLVr1069evVyeYSGhtb7us8//1zfffed8/muXbvUsWNHRUdH1/uasLAwnT592vm8pKREmZmZl137hAkTdO7cOW3YsEGrV6/WlClTnPv69u2r8+fPKy0tzbmtoKBAR44cUb9+/SRd6Maurq52ec/4+HhVV1crNzf3kp/H97uKo6Oj9dBDD2nt2rWaP3++/va3v132dQCwFgIgALdLTExUQkKCJk6cqH/+8586fvy4du7cqccee0x79+6t93WVlZWaPn26Dh06pA8++EBPPvmkZs2aJR+f+n+1jRgxQv/zP/+jTz/9VAcOHNDUqVMvaWVrjqCgIE2cOFFPPPGEDh8+rKSkJOe+uLg4TZgwQQ8++KC2b9+uzz//XD/72c901VVXacKECZIuzEouLS3Vpk2blJ+fr/LycvXu3VtTpkzR/fffr7Vr1yozM1O7d+/WkiVL9I9//EOSNHfuXH300UfKzMzUZ599ps2bN6tv376XfR0ArIUACMDtbDabPvjgAw0fPlzTpk1T7969de+99+rEiROKiIio93UjR45UXFychg8frnvuuUd33nlno8u5LFy4UD/4wQ/+X3t3jKJoDIYB+J1SsLUSbATF2krBXvEv/s5CEEvB0jtYeABbb+ElrLyIeAFxOmGXtZlZZlnyPJAqJHzpXvIFkqqqMp/PU9d1ut3ut+pfLpe5Xq+ZTCbpdDq/zJ1OpwyHw1RVldFolOfzmfP5/GqHj8fjbDabLBaLtFqtHA6H17rVapXdbpd+v5+6rnO5XF77Px6PbLfbDAaDTKfT9Hq9HI/Hb50DKMfH8/fHMAD/gfV6nfv97ss2gC9wAwgAUBgBEOCN/X6fZrP5xzGbzf51eQBfpgUM8Mbtdnv7E0mj0Ui73f7higD+DgEQAKAwWsAAAIURAAEACiMAAgAURgAEACiMAAgAUBgBEACgMAIgAEBhPgE75GKuXMt7WgAAAABJRU5ErkJggg==\n",
 722 |       "text/plain": [
 723 |        "<IPython.core.display.Image object>"
 724 |       ]
 725 |      },
 726 |      "metadata": {},
 727 |      "output_type": "display_data"
 728 |     }
 729 |    ],
 730 |    "source": [
 731 |     "# Close previous plots; otherwise, will just overwrite and display again\n",
 732 |     "plt.close()\n",
 733 |     "\n",
 734 |     "sampled_df = data.sample(fraction=0.0001).toPandas()\n",
 735 |     "sampled_df.plot.scatter('helpful_votes', 'total_votes')\n",
 736 |     "%matplot plt"
 737 |    ]
 738 |   },
 739 |   {
 740 |    "cell_type": "markdown",
 741 |    "metadata": {},
 742 |    "source": [
 743 |     "## Fitting a Machine Learning Model to Predict whether a Rating will be Good or Bad\n",
 744 |     "\n",
 745 |     "Once we've explored our data, let's say that we want to fit a Machine Learning model to predict a review's star rating based on other features about the review. Here, let's fit a simple (and extremely naive, for the sake of demonstration) model that predicts whether or a customer review will have a \"good\" star rating (>= 4) or a \"bad\" star rating (<4) based on the total number of votes a review receives. In your homework, you will improve upon this model, engineering other features from the dataset, as well as doing a better job of balancing the data (see below -- there are a lot more \"good\" star-ratings than \"bad\" ones) and setting up a reproducible machine learning pipeline.\n",
 746 |     "\n",
 747 |     "First, let's create a column that indicates whether a review is good or bad. We'll use this column as labels for machine learning. "
 748 |    ]
 749 |   },
 750 |   {
 751 |    "cell_type": "code",
 752 |    "execution_count": 12,
 753 |    "metadata": {},
 754 |    "outputs": [
 755 |     {
 756 |      "data": {
 757 |       "application/vnd.jupyter.widget-view+json": {
 758 |        "model_id": "68028de131b8490aadac465419d6dac1",
 759 |        "version_major": 2,
 760 |        "version_minor": 0
 761 |       },
 762 |       "text/plain": [
 763 |        "VBox()"
 764 |       ]
 765 |      },
 766 |      "metadata": {},
 767 |      "output_type": "display_data"
 768 |     },
 769 |     {
 770 |      "data": {
 771 |       "application/vnd.jupyter.widget-view+json": {
 772 |        "model_id": "",
 773 |        "version_major": 2,
 774 |        "version_minor": 0
 775 |       },
 776 |       "text/plain": [
 777 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 778 |       ]
 779 |      },
 780 |      "metadata": {},
 781 |      "output_type": "display_data"
 782 |     },
 783 |     {
 784 |      "name": "stdout",
 785 |      "output_type": "stream",
 786 |      "text": [
 787 |       "+-----------+-----------+\n",
 788 |       "|star_rating|good_review|\n",
 789 |       "+-----------+-----------+\n",
 790 |       "|          5|          1|\n",
 791 |       "|          4|          1|\n",
 792 |       "|          4|          1|\n",
 793 |       "|          5|          1|\n",
 794 |       "|          5|          1|\n",
 795 |       "+-----------+-----------+\n",
 796 |       "only showing top 5 rows\n",
 797 |       "\n",
 798 |       "+-----------+--------+\n",
 799 |       "|good_review|   count|\n",
 800 |       "+-----------+--------+\n",
 801 |       "|          1|17208450|\n",
 802 |       "|          0| 3517710|\n",
 803 |       "+-----------+--------+"
 804 |      ]
 805 |     }
 806 |    ],
 807 |    "source": [
 808 |     "# Good == 1, Bad == 0 (cast as integers so that pyspark.ml can understand them)\n",
 809 |     "data = data.withColumn('good_review', (data.star_rating >= 4).cast(\"integer\"))\n",
 810 |     "\n",
 811 |     "# Check to make sure new column is capturing star_rating correctly\n",
 812 |     "data[['star_rating', 'good_review']].show(5)\n",
 813 |     "\n",
 814 |     "# Take a look at how many good and bad reviews we have, respectively\n",
 815 |     "(data.groupBy('good_review')\n",
 816 |     "     .count()\n",
 817 |     "     .show()\n",
 818 |     ")"
 819 |    ]
 820 |   },
 821 |   {
 822 |    "cell_type": "markdown",
 823 |    "metadata": {},
 824 |    "source": [
 825 |     "Then, let's use the `VectorAssembler` to get our total_votes feature into a form that `pyspark.ml` expects it to be in."
 826 |    ]
 827 |   },
 828 |   {
 829 |    "cell_type": "code",
 830 |    "execution_count": 13,
 831 |    "metadata": {},
 832 |    "outputs": [
 833 |     {
 834 |      "data": {
 835 |       "application/vnd.jupyter.widget-view+json": {
 836 |        "model_id": "d501f73cce924fccbb84d89a7854d72f",
 837 |        "version_major": 2,
 838 |        "version_minor": 0
 839 |       },
 840 |       "text/plain": [
 841 |        "VBox()"
 842 |       ]
 843 |      },
 844 |      "metadata": {},
 845 |      "output_type": "display_data"
 846 |     },
 847 |     {
 848 |      "data": {
 849 |       "application/vnd.jupyter.widget-view+json": {
 850 |        "model_id": "",
 851 |        "version_major": 2,
 852 |        "version_minor": 0
 853 |       },
 854 |       "text/plain": [
 855 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 856 |       ]
 857 |      },
 858 |      "metadata": {},
 859 |      "output_type": "display_data"
 860 |     },
 861 |     {
 862 |      "name": "stdout",
 863 |      "output_type": "stream",
 864 |      "text": [
 865 |       "+-----------+-----------+--------+\n",
 866 |       "|good_review|total_votes|features|\n",
 867 |       "+-----------+-----------+--------+\n",
 868 |       "|          1|         10|  [10.0]|\n",
 869 |       "|          1|          7|   [7.0]|\n",
 870 |       "|          1|          0|   [0.0]|\n",
 871 |       "|          1|          1|   [1.0]|\n",
 872 |       "|          1|          0|   [0.0]|\n",
 873 |       "|          1|          7|   [7.0]|\n",
 874 |       "|          1|          1|   [1.0]|\n",
 875 |       "|          0|          7|   [7.0]|\n",
 876 |       "|          1|          0|   [0.0]|\n",
 877 |       "|          1|          9|   [9.0]|\n",
 878 |       "|          1|          0|   [0.0]|\n",
 879 |       "|          1|          1|   [1.0]|\n",
 880 |       "|          1|          0|   [0.0]|\n",
 881 |       "|          1|          0|   [0.0]|\n",
 882 |       "|          0|          6|   [6.0]|\n",
 883 |       "|          1|          0|   [0.0]|\n",
 884 |       "|          1|          0|   [0.0]|\n",
 885 |       "|          1|          1|   [1.0]|\n",
 886 |       "|          1|          1|   [1.0]|\n",
 887 |       "|          1|          1|   [1.0]|\n",
 888 |       "+-----------+-----------+--------+\n",
 889 |       "only showing top 20 rows"
 890 |      ]
 891 |     }
 892 |    ],
 893 |    "source": [
 894 |     "from pyspark.ml.feature import VectorAssembler\n",
 895 |     "\n",
 896 |     "features = ['total_votes']\n",
 897 |     "assembler = VectorAssembler(inputCols = features, outputCol = 'features')\n",
 898 |     "\n",
 899 |     "data = assembler.transform(data)\n",
 900 |     "data[['good_review', 'total_votes', 'features']].show()"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "markdown",
 905 |    "metadata": {},
 906 |    "source": [
 907 |     "Then, we split up our data into training and test data and train a logistic regression model on our training data."
 908 |    ]
 909 |   },
 910 |   {
 911 |    "cell_type": "code",
 912 |    "execution_count": 14,
 913 |    "metadata": {},
 914 |    "outputs": [
 915 |     {
 916 |      "data": {
 917 |       "application/vnd.jupyter.widget-view+json": {
 918 |        "model_id": "b3349118c86646a388d1f126d2243b4a",
 919 |        "version_major": 2,
 920 |        "version_minor": 0
 921 |       },
 922 |       "text/plain": [
 923 |        "VBox()"
 924 |       ]
 925 |      },
 926 |      "metadata": {},
 927 |      "output_type": "display_data"
 928 |     },
 929 |     {
 930 |      "data": {
 931 |       "application/vnd.jupyter.widget-view+json": {
 932 |        "model_id": "",
 933 |        "version_major": 2,
 934 |        "version_minor": 0
 935 |       },
 936 |       "text/plain": [
 937 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 938 |       ]
 939 |      },
 940 |      "metadata": {},
 941 |      "output_type": "display_data"
 942 |     }
 943 |    ],
 944 |    "source": [
 945 |     "from pyspark.ml.classification import LogisticRegression\n",
 946 |     "\n",
 947 |     "train, test = data.randomSplit([0.7, 0.3])\n",
 948 |     "\n",
 949 |     "lr = LogisticRegression(featuresCol='features', labelCol='good_review')\n",
 950 |     "model = lr.fit(train)"
 951 |    ]
 952 |   },
 953 |   {
 954 |    "cell_type": "markdown",
 955 |    "metadata": {},
 956 |    "source": [
 957 |     "And, finally, we can make predictions for our test data and see how good our model is. Note that our predictive accuracy is not bad overall. However, from our \"false positive rate by label\" metric, we can see that our model is essentially just predicting that most reviews will be \"good\" reviews (label 1) -- likely because of our highly unbalanced data. Accordingly, you'll notice that our AUC is only barely above .5 (see ROC Curve below). In Assignment 3, it'll be your job to try to improve this model, so that we can better distinguish good from bad customer reviews."
 958 |    ]
 959 |   },
 960 |   {
 961 |    "cell_type": "code",
 962 |    "execution_count": 15,
 963 |    "metadata": {},
 964 |    "outputs": [
 965 |     {
 966 |      "data": {
 967 |       "application/vnd.jupyter.widget-view+json": {
 968 |        "model_id": "189a87f193c941418c73f6bc52bdc950",
 969 |        "version_major": 2,
 970 |        "version_minor": 0
 971 |       },
 972 |       "text/plain": [
 973 |        "VBox()"
 974 |       ]
 975 |      },
 976 |      "metadata": {},
 977 |      "output_type": "display_data"
 978 |     },
 979 |     {
 980 |      "data": {
 981 |       "application/vnd.jupyter.widget-view+json": {
 982 |        "model_id": "",
 983 |        "version_major": 2,
 984 |        "version_minor": 0
 985 |       },
 986 |       "text/plain": [
 987 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
 988 |       ]
 989 |      },
 990 |      "metadata": {},
 991 |      "output_type": "display_data"
 992 |     }
 993 |    ],
 994 |    "source": [
 995 |     "# Training Summary Data\n",
 996 |     "trainingSummary = model.summary\n",
 997 |     "evaluationSummary = model.evaluate(test)"
 998 |    ]
 999 |   },
1000 |   {
1001 |    "cell_type": "code",
1002 |    "execution_count": null,
1003 |    "metadata": {},
1004 |    "outputs": [
1005 |     {
1006 |      "data": {
1007 |       "application/vnd.jupyter.widget-view+json": {
1008 |        "model_id": "3c72d869035144ff8291ba382180fe94",
1009 |        "version_major": 2,
1010 |        "version_minor": 0
1011 |       },
1012 |       "text/plain": [
1013 |        "VBox()"
1014 |       ]
1015 |      },
1016 |      "metadata": {},
1017 |      "output_type": "display_data"
1018 |     },
1019 |     {
1020 |      "data": {
1021 |       "application/vnd.jupyter.widget-view+json": {
1022 |        "model_id": "bae04754455f4c91a3ce17987262680a",
1023 |        "version_major": 2,
1024 |        "version_minor": 0
1025 |       },
1026 |       "text/plain": [
1027 |        "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…"
1028 |       ]
1029 |      },
1030 |      "metadata": {},
1031 |      "output_type": "display_data"
1032 |     }
1033 |    ],
1034 |    "source": [
1035 |     "print(\"Training AUC: \" + str(trainingSummary.areaUnderROC))\n",
1036 |     "print(\"Test AUC: \", str(evaluationSummary.areaUnderROC))\n",
1037 |     "\n",
1038 |     "print(\"\\nFalse positive rate by label (Training):\")\n",
1039 |     "for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):\n",
1040 |     "    print(\"label %d: %s\" % (i, rate))\n",
1041 |     "\n",
1042 |     "print(\"\\nTrue positive rate by label (Training):\")\n",
1043 |     "for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):\n",
1044 |     "    print(\"label %d: %s\" % (i, rate))\n",
1045 |     "    \n",
1046 |     "print(\"\\nTraining Accuracy: \" + str(trainingSummary.accuracy))\n",
1047 |     "print(\"Test Accuracy: \", str(evaluationSummary.accuracy))"
1048 |    ]
1049 |   },
1050 |   {
1051 |    "cell_type": "code",
1052 |    "execution_count": null,
1053 |    "metadata": {},
1054 |    "outputs": [],
1055 |    "source": [
1056 |     "# Get ROC curve and send it to Pandas so that we can plot it\n",
1057 |     "roc_df = evaluationSummary.roc.toPandas()"
1058 |    ]
1059 |   },
1060 |   {
1061 |    "cell_type": "code",
1062 |    "execution_count": null,
1063 |    "metadata": {},
1064 |    "outputs": [],
1065 |    "source": [
1066 |     "# Close previous plots; otherwise, will just overwrite and display again\n",
1067 |     "plt.close()\n",
1068 |     "\n",
1069 |     "plt.plot(roc_df.FPR, roc_df.TPR, 'b', label = 'AUC = %0.2f' % evaluationSummary.areaUnderROC)\n",
1070 |     "plt.legend(loc = 'lower right')\n",
1071 |     "plt.plot([0, 1], [0, 1],'r--')\n",
1072 |     "plt.xlim([0, 1])\n",
1073 |     "plt.ylim([0, 1])\n",
1074 |     "plt.ylabel('True Positive Rate')\n",
1075 |     "plt.xlabel('False Positive Rate')\n",
1076 |     "plt.title('ROC Curve')\n",
1077 |     "plt.show()\n",
1078 |     "\n",
1079 |     "%matplot plt"
1080 |    ]
1081 |   }
1082 |  ],
1083 |  "metadata": {
1084 |   "kernelspec": {
1085 |    "display_name": "PySpark",
1086 |    "language": "",
1087 |    "name": "pysparkkernel"
1088 |   },
1089 |   "language_info": {
1090 |    "codemirror_mode": {
1091 |     "name": "python",
1092 |     "version": 3
1093 |    },
1094 |    "mimetype": "text/x-python",
1095 |    "name": "pyspark",
1096 |    "pygments_lexer": "python3"
1097 |   }
1098 |  },
1099 |  "nbformat": 4,
1100 |  "nbformat_minor": 4
1101 | }
1102 | 


--------------------------------------------------------------------------------
/Labs/Lab 6 PySpark EDA and ML in an EMR Notebook/Local_Colab_Spark_Setup.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Local/Colab Spark Setup",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "toc_visible": true,
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "display_name": "Python 3",
 14 |       "name": "python3"
 15 |     }
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "cell_type": "markdown",
 20 |       "metadata": {
 21 |         "id": "view-in-github",
 22 |         "colab_type": "text"
 23 |       },
 24 |       "source": [
 25 |         "<a href=\"https://colab.research.google.com/github/jonclindaniel/LargeScaleComputing_S20/blob/master/Labs/Lab%206%20PySpark%20EDA%20and%20ML%20in%20an%20EMR%20Notebook/Local_Colab_Spark_Setup.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 26 |       ]
 27 |     },
 28 |     {
 29 |       "cell_type": "markdown",
 30 |       "metadata": {
 31 |         "id": "qpaNU3ETh0Qc",
 32 |         "colab_type": "text"
 33 |       },
 34 |       "source": [
 35 |         "# Setting up PySpark in a Colab Notebook\n",
 36 |         "\n",
 37 |         "You can run Spark both locally and on a cluster. Here, I'll demonstrate how you can set up Spark to run in a Colab notebook for debugging purposes. You can also set up Spark locally in this same way if you want to take advantage of multiple CPU cores on your laptop (the setup will vary slightly, though, depending on your operating system and you'll need to figure out these specifics on your own; however, this setup does also work on WSL for me).\n",
 38 |         "\n",
 39 |         "This being said, this local option should be for testing purposes on sample datasets only. If you want to run big PySpark jobs, you will want to run these in an EMR notebook (with an EMR cluster as your backend).\n",
 40 |         "\n",
 41 |         "First, we need to install Spark and PySpark, by running the following commands:"
 42 |       ]
 43 |     },
 44 |     {
 45 |       "cell_type": "code",
 46 |       "metadata": {
 47 |         "id": "R8f1D7wfaCgF",
 48 |         "colab_type": "code",
 49 |         "outputId": "75895338-8205-406b-be0a-c674341d07d5",
 50 |         "colab": {
 51 |           "base_uri": "https://localhost:8080/",
 52 |           "height": 68
 53 |         }
 54 |       },
 55 |       "source": [
 56 |         "%%bash\n",
 57 |         "apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
 58 |         "wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n",
 59 |         "tar -xzf spark-2.4.5-bin-hadoop2.7.tgz\n",
 60 |         "pip install pyspark findspark"
 61 |       ],
 62 |       "execution_count": 0,
 63 |       "outputs": [
 64 |         {
 65 |           "output_type": "stream",
 66 |           "text": [
 67 |             "Requirement already satisfied: pyspark in /usr/local/lib/python3.6/dist-packages (2.4.5)\n",
 68 |             "Requirement already satisfied: findspark in /usr/local/lib/python3.6/dist-packages (1.3.0)\n",
 69 |             "Requirement already satisfied: py4j==0.10.7 in /usr/local/lib/python3.6/dist-packages (from pyspark) (0.10.7)\n"
 70 |           ],
 71 |           "name": "stdout"
 72 |         }
 73 |       ]
 74 |     },
 75 |     {
 76 |       "cell_type": "markdown",
 77 |       "metadata": {
 78 |         "id": "fHjZLeId0nvR",
 79 |         "colab_type": "text"
 80 |       },
 81 |       "source": [
 82 |         "OK, now that we have Spark, we need to set a path to it, so PySpark knows where to find it. We do this using the `os` Python library below.\n",
 83 |         "\n",
 84 |         "On my machine (WSL, Ubuntu 20.04), where I unpacked Spark in my home directory, this can be achieved with:\n",
 85 |         "```\n",
 86 |         "os.environ[\"SPARK_HOME\"] = \"~/spark-2.4.5-bin-hadoop2.7\"\n",
 87 |         "```\n",
 88 |         "\n",
 89 |         "In Colab, it is automatically downloaded to the `/content` directory, so we indicate that as its location here. Then, we run `findspark` to find Spark for us on the machine, and finally start up a SparkSession running on all available cores (`local[4]` means your code will run on 4 threads locally, `local[*]` means that your code will run as many threads as there are logical cores on your machine.\n"
 90 |       ]
 91 |     },
 92 |     {
 93 |       "cell_type": "code",
 94 |       "metadata": {
 95 |         "id": "CljIupW0aE06",
 96 |         "colab_type": "code",
 97 |         "colab": {}
 98 |       },
 99 |       "source": [
100 |         "# Set path to Spark\n",
101 |         "import os\n",
102 |         "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"\n",
103 |         "\n",
104 |         "# Find Spark so that we can access session within our notebook\n",
105 |         "import findspark\n",
106 |         "findspark.init()\n",
107 |         "\n",
108 |         "# Start SparkSession on all available cores\n",
109 |         "from pyspark.sql import SparkSession\n",
110 |         "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
111 |       ],
112 |       "execution_count": 0,
113 |       "outputs": []
114 |     },
115 |     {
116 |       "cell_type": "markdown",
117 |       "metadata": {
118 |         "id": "VNTQOBLthDrC",
119 |         "colab_type": "text"
120 |       },
121 |       "source": [
122 |         "Now that we've installed everything and set up our paths correctly, we can run (small) Spark jobs both in Colab notebooks and locally (for bigger jobs, you will want to run these jobs on an EMR cluster, though. Remember, for instance, that Google only allocates us one CPU core for free)!\n",
123 |         "\n",
124 |         "Let's make sure our setup is working by doing couple of simple things with the pyspark.sql package on the Amazon Customer Review Sample Dataset."
125 |       ]
126 |     },
127 |     {
128 |       "cell_type": "code",
129 |       "metadata": {
130 |         "id": "fbXWBQfSAX8q",
131 |         "colab_type": "code",
132 |         "outputId": "69af5811-2391-49ba-b101-311ee10adcb5",
133 |         "colab": {
134 |           "base_uri": "https://localhost:8080/",
135 |           "height": 51
136 |         }
137 |       },
138 |       "source": [
139 |         "! pip install wget\n",
140 |         "import wget\n",
141 |         "\n",
142 |         "wget.download('https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv', 'sample_data/sample_us.tsv')"
143 |       ],
144 |       "execution_count": 0,
145 |       "outputs": [
146 |         {
147 |           "output_type": "stream",
148 |           "text": [
149 |             "Requirement already satisfied: wget in /usr/local/lib/python3.6/dist-packages (3.2)\n"
150 |           ],
151 |           "name": "stdout"
152 |         },
153 |         {
154 |           "output_type": "execute_result",
155 |           "data": {
156 |             "text/plain": [
157 |               "'sample_data/sample_us.tsv'"
158 |             ]
159 |           },
160 |           "metadata": {
161 |             "tags": []
162 |           },
163 |           "execution_count": 18
164 |         }
165 |       ]
166 |     },
167 |     {
168 |       "cell_type": "code",
169 |       "metadata": {
170 |         "id": "KrXWEMxjeFx1",
171 |         "colab_type": "code",
172 |         "colab": {}
173 |       },
174 |       "source": [
175 |         "# Read TSV file from default data download directory in Colab\n",
176 |         "data = spark.read.csv('sample_data/sample_us.tsv',\n",
177 |         "                      sep=\"\\t\",\n",
178 |         "                      header=True,\n",
179 |         "                      inferSchema=True)"
180 |       ],
181 |       "execution_count": 0,
182 |       "outputs": []
183 |     },
184 |     {
185 |       "cell_type": "code",
186 |       "metadata": {
187 |         "id": "2qvOOIYqeWw9",
188 |         "colab_type": "code",
189 |         "outputId": "a43569f6-d2bb-4b67-e849-c51a4aa62758",
190 |         "colab": {
191 |           "base_uri": "https://localhost:8080/",
192 |           "height": 306
193 |         }
194 |       },
195 |       "source": [
196 |         "data.printSchema()"
197 |       ],
198 |       "execution_count": 0,
199 |       "outputs": [
200 |         {
201 |           "output_type": "stream",
202 |           "text": [
203 |             "root\n",
204 |             " |-- marketplace: string (nullable = true)\n",
205 |             " |-- customer_id: integer (nullable = true)\n",
206 |             " |-- review_id: string (nullable = true)\n",
207 |             " |-- product_id: string (nullable = true)\n",
208 |             " |-- product_parent: integer (nullable = true)\n",
209 |             " |-- product_title: string (nullable = true)\n",
210 |             " |-- product_category: string (nullable = true)\n",
211 |             " |-- star_rating: integer (nullable = true)\n",
212 |             " |-- helpful_votes: integer (nullable = true)\n",
213 |             " |-- total_votes: integer (nullable = true)\n",
214 |             " |-- vine: string (nullable = true)\n",
215 |             " |-- verified_purchase: string (nullable = true)\n",
216 |             " |-- review_headline: string (nullable = true)\n",
217 |             " |-- review_body: string (nullable = true)\n",
218 |             " |-- review_date: timestamp (nullable = true)\n",
219 |             "\n"
220 |           ],
221 |           "name": "stdout"
222 |         }
223 |       ]
224 |     },
225 |     {
226 |       "cell_type": "code",
227 |       "metadata": {
228 |         "id": "ngb25JINcUNr",
229 |         "colab_type": "code",
230 |         "outputId": "029777eb-e3b9-4cec-8d30-e0cf55f6fcaf",
231 |         "colab": {
232 |           "base_uri": "https://localhost:8080/",
233 |           "height": 187
234 |         }
235 |       },
236 |       "source": [
237 |         "(data.groupBy('star_rating')\n",
238 |         "     .sum('total_votes')\n",
239 |         "     .sort('star_rating', ascending=False)\n",
240 |         "     .show()\n",
241 |         ")"
242 |       ],
243 |       "execution_count": 0,
244 |       "outputs": [
245 |         {
246 |           "output_type": "stream",
247 |           "text": [
248 |             "+-----------+----------------+\n",
249 |             "|star_rating|sum(total_votes)|\n",
250 |             "+-----------+----------------+\n",
251 |             "|          5|              13|\n",
252 |             "|          4|               3|\n",
253 |             "|          3|               8|\n",
254 |             "|          2|               2|\n",
255 |             "|          1|               8|\n",
256 |             "+-----------+----------------+\n",
257 |             "\n"
258 |           ],
259 |           "name": "stdout"
260 |         }
261 |       ]
262 |     }
263 |   ]
264 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Large-Scale Computing for the Social Sciences
  2 | ### Spring 2020 - MACS 30123/MAPS 30123/PLSC 30123
  3 | 
  4 | | Instructor Information | **TA Information** | Course Information     |
  5 | | :-------------         | :-------------     | :------------          |
  6 | | Jon Clindaniel         | Dhruval Bhatt      | Location: [Online](https://canvas.uchicago.edu/courses/28258)   |
  7 | | 1155 E. 60th Street, Rm. 215 |              | Monday/Wednesday       |
  8 | | jclindaniel@uchicago.edu | dhruval@uchicago.edu | 9:30-11:20 AM (CDT)|
  9 | | **Office Hours:** [Schedule](https://appoint.ly/s/jclindaniel/office-hours)\* | **Office Hours:** [Schedule](https://appoint.ly/s/dhruval/office_hours)\*| **Lab:** Prerecorded, [Online](https://canvas.uchicago.edu/courses/28258)|
 10 | 
 11 | \* Office Hours held via Zoom
 12 | 
 13 | ## Course Description
 14 | Computational social scientists increasingly need to grapple with data that is either too big for a local machine and/or code that is too resource intensive to process on a local machine. In this course, students will learn how to effectively scale their computational methods beyond their local machines. The focus of the course will be social scientific applications, ranging from training machine learning models on large economic time series to processing and analyzing social media data in real-time. Students will be introduced to several large-scale computing frameworks such as MPI, MapReduce, Spark, and OpenCL, with a special emphasis on employing these frameworks using cloud resources and the Python programming language.
 15 | 
 16 | *Prerequisites: CAPP 30121 and CAPP 30122, or equivalent.*
 17 | 
 18 | ## Course Structure
 19 | This course is structured into several modules, or overarching thematic learning units, focused on teaching students fundamental large-scale computing concepts, as well as giving them the opportunity to apply these concepts to Computational Social Science research problems. Students can access the content in these modules on the [Canvas course site](https://canvas.uchicago.edu/courses/28258). Each module consists of a series of asynchronous lectures, readings, and assignments, which will make up a portion of the instruction time each week. If students have any questions about the asynchronous content, they should post these questions in the Piazza forum for the course, which they can access by clicking the "Piazza" tab on the left side of the screen on the Canvas course site. To see an overall schedule and syllabus for the course, as well as access additional course-related files, please visit the GitHub Course Repository, available here.
 20 | 
 21 | Additionally, students will attend short, interactive Zoom sessions during regular class hours (starting at 9:30 AM CDT) on Mondays and Wednesdays meant to give them the opportunity to discuss and apply the skills they have learned asynchronously and receive live instructor feedback. Attendance to the synchronous class sessions is mandatory and is an important component of the final course grade. Students can access the Zoom sessions by clicking on the "Zoom" tab on the left side of the screen on the Canvas course site. Students should prepare for these classes by reading the assigned readings ahead of every session. All readings are available online and are linked in the course schedule below.
 22 | 
 23 | Students will also virtually participate in a hands-on Python lab section each week, meant to instill practical large-scale computing skills related to the week’s topic. These labs are accessible on the Canvas course site and related code will be posted here in this GitHub repository. In order to practice these large-scale computing skills and complete the course assignments, students will be given free access to UChicago's [Midway Research Computing Cluster](https://rcc.uchicago.edu/docs/), [Amazon Web Services (AWS)](https://aws.amazon.com/) cloud computing resources, and [DataCamp](https://www.datacamp.com/). More information about accessing these resources will be provided to registered students in the first several weeks of the quarter.
 24 | 
 25 | ## Grading
 26 | There will be an assignment due at the end of each unit (3 in total). Each assignment is worth 20% of the overall grade, with all assignments together worth a total of 60%. Additionally, attendance and participation will be worth 10% of the overall grade. Finally, students will complete a final project that is worth 30% of the overall grade (25% for the project itself, and 5% for an end-of-quarter presentation).
 27 | 
 28 | | Course Component         | Grade Percentage  |
 29 | | :-------------           | :-------------    |
 30 | | Assignments (Total: 3)   | 60%               |
 31 | | Attendance/Participation | 10%               |
 32 | | Final Project            | 5% (Presentation) |
 33 | |                          | 25% (Project)     |
 34 | 
 35 | Students do have the option of taking this course on a pass/fail basis. Note that in order to earn a "Pass," students must earn at least a B- (80%) in each of the above course components (including participation) and inform the instructor that they would like to be graded on a pass/fail basis before their Final Project due date.
 36 | 
 37 | ## Final Project
 38 | For their final project (due 6/11/2020\*), students will write their own large-scale computing code that solves a social science research problem of their choosing. For instance, students might perform a computationally intensive demographic simulation, collect and analyze large-scale streaming social media data, or train a machine learning model on a large-scale economic dataset. Students will additionally record a short video presentation about their project in the final week of the course. Detailed descriptions and grading rubrics for the project and presentation are available [on the Canvas course site.](https://canvas.uchicago.edu/courses/28258)
 39 | 
 40 | \* Due 6/5/2020 for students graduating in June
 41 | 
 42 | ## Late Assignments/Projects
 43 | Unexcused Late Assignment/Project Submissions will be penalized 10 percentage points for every hour they are late. For example, if an assignment is due on Wednesday at 2:00pm, the following percentage points will be deducted based on the time stamp of the last commit.
 44 | 
 45 | | Example last commit |	Percentage points deducted         |
 46 | | ----                | ----                               |
 47 | | 2:01pm to 3:00pm    |	-10 percentage points              |
 48 | | 3:01pm to 4:00pm    |-20 percentage points               |
 49 | | 4:01pm to 5:00pm    | -30 percentage points              |
 50 | | 5:01pm to 6:00pm    | -40 percentage points              |
 51 | | ...                 |	...                                |
 52 | | 11:01pm and beyond  |	-100 percentage points (no credit) |
 53 | 
 54 | ## Plagiarism on Assignments/Projects
 55 | Academic honesty is an extremely important principle in academia and at the University of Chicago.
 56 | * Writing assignments must quote and cite any excerpts taken from another work.
 57 | * If the cited work is the particular paper referenced in the Assignment, no works cited or references are necessary at the end of the composition.
 58 | * If the cited work is not the particular paper referenced in the Assignment, you MUST include a works cited or references section at the end of the composition.
 59 | * Any copying of other students' work will result in a zero grade and potential further academic discipline.
 60 | If you have any questions about citations and references, consult with your instructor.
 61 | 
 62 | ## Statement of Diversity and Inclusion
 63 | The University of Chicago is committed to diversity and rigorous inquiry from multiple perspectives. The MAPSS, CIR, and Computation programs share this commitment and seek to foster productive learning environments based upon inclusion, open communication, and mutual respect for a diverse range of identities, experiences, and positions.
 64 | 
 65 | Any suggestions for how we might further such objectives both in and outside the classroom are appreciated and will be given serious consideration. Please share your suggestions or concerns with your instructor, your preceptor, or your program’s Diversity and Inclusion representatives: Darcy Heuring (MAPSS), Matthias Staisch (CIR), and Chad Cyrenne (Computation). You are also welcome and encouraged to contact the Faculty Director of your program.
 66 | 
 67 | This course is open to all students who meet the academic requirements for participation. Any student who has a documented need for accommodation should contact Student Disability Services (773-702-6000 or disabilities@uchicago.edu) and the instructor as soon as possible.
 68 | 
 69 | ## Course Schedule
 70 | | Unit   | Week | Day | Topic | Readings | Assignment |
 71 | | --- | --- | --- | --- | --- |  --- |
 72 | | Fundamentals of Large-Scale Computing | Week 1: Introduction to Large-Scale Computing for the Social Sciences | 4/6/2020 | Introduction to the course and course goals | | |
 73 | | | | 4/8/2020 | General Considerations for Large-Scale Computing | [Robey and Zamora 2020 (Chapter 1)](https://livebook.manning.com/book/parallel-and-high-performance-computing/chapter-1) | |
 74 | | | Week 2: On-Premise Large-Scale CPU-computing with MPI | 4/13/2020 | An Introduction to CPUs and Research Computing Clusters | [Pacheco 2011](https://canvas.uchicago.edu/files/3391260/download?download_frd=1) (Ch. 1-2), [Midway RCC User Guide](https://rcc.uchicago.edu/docs/) | |
 75 | | | | 4/15/2020 | Cluster Computing via Message Passing Interface (MPI) for Python | [Pacheco 2011](https://canvas.uchicago.edu/files/3391260/download?download_frd=1) (Ch. 3), [Dalcín et al. 2008](https://www-sciencedirect-com.proxy.uchicago.edu/science/article/pii/S0743731507001712?via%3Dihub) | |
 76 | |||| Lab: Hands-on Introduction to UChicago's Midway Computing Cluster and mpi4py |||
 77 | | | Week 3: On-Premise GPU-computing with OpenCL | 4/20/2020 | An Introduction to GPUs and Heterogenous Computing with OpenCL | [Scarpino 2012](https://canvas.uchicago.edu/files/3391262/download?download_frd=1) (Read Ch. 1, Skim Ch. 2-5,9) | |
 78 | | | | 4/22/2020 | Harnessing GPUs with PyOpenCL | [Klöckner et al. 2012](https://arxiv.org/pdf/0911.3456.pdf) | |
 79 | |||| Lab: Introduction to PyOpenCL and GPU Computing on UChicago's Midway Computing Cluster |||
 80 | | Architecting Computational Social Science Data Solutions in the Cloud | Week 4: An Introduction to Cloud Computing and Cloud HPC Architectures | 4/27/2020 | An Introduction to the Cloud Computing Landscape and AWS | [Jorissen and Bouffler 2017](https://canvas.uchicago.edu/files/3391263/download?download_frd=1) (Read Ch. 1, Skim Ch. 4-7), [Armbrust et al. 2009](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf), [Jonas et al. 2019](https://arxiv.org/pdf/1902.03383.pdf) | |
 81 | | | | 4/29/2020 | Architectures for Large-Scale Computation in the Cloud | [Introduction to HPC on AWS](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf), [HPC Architectural Best Practices](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) | Due: Assignment 1 (11:59 PM) |
 82 | |||| Lab: Running "Serverless" HPC Jobs in the AWS Cloud |||
 83 | | | Week 5: Architecting Large-Scale Data Solutions in the Cloud | 5/4/2020 | "Data Lake" Architectures | [Data Lakes and Analytics on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/), [AWS Data Lake Whitepaper](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf), [*Introduction to AWS Boto in Python*](https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python) (DataCamp Course; Practice working with S3 Data Lake in Python) | |
 84 | | | | 5/6/2020 | Architectures for Large-Scale Data Structuring and Storage | General, Open Source: ["What is a Database?" (YouTube),](https://www.youtube.com/watch?v=Tk1t3WKK-ZY) ["How to Choose the Right Database?" (YouTube),](https://www.youtube.com/watch?v=v5e_PasMdXc)  [“NoSQL Explained,”](https://www.mongodb.com/nosql-explained) AWS-specific solutions: ["Which Database to Use When?" (YouTube),](https://youtu.be/KWOSGVtHWqA) Optional: [Data Warehousing on AWS Whitepaper](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf), [AWS Big Data Whitepaper](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) | |
 85 | |||| Lab: Exploring Large Data Sources in an S3 Data Lake with the AWS Python SDK, Boto |||
 86 | | | Week 6: Large-Scale Data Ingestion and Processing | 5/11/2020 | Stream Ingestion and Processing with Apache Kafka, AWS Kinesis | [Narkhede et al. 2017](https://canvas.uchicago.edu/files/3391266/download?download_frd=1) (Read Ch. 1, Skim 3-6,11), [Dean and Crettaz 2019 (Ch. 4)](https://livebook.manning.com/book/event-streams-in-action/chapter-4/), [AWS Kinesis Whitepaper](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) ||
 87 | | | | 5/13/2020 | Batch Processing with Apache Hadoop and MapReduce | [White 2015](https://canvas.uchicago.edu/files/3391265/download?download_frd=1) (read Ch. 1-2, Skim 3-4), [Dean and Ghemawat 2004](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf), ["What is Amazon EMR?"](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html), Running MapReduce Jobs with Python’s “mrjob” package on EMR ([Fundamentals](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html) and [Elastic MapReduce Quickstart](https://mrjob.readthedocs.io/en/latest/guides/emr-quickstart.html)) | |
 88 | |||| Lab: Ingesting and Processing Batch Data with MapReduce on AWS EMR and Streaming Data with AWS Kinesis | ||
 89 | | Large-Scale Data Analysis and Prediction | Week 7: An Introduction to High-Level Large-Scale Computing Paradigms for Data Analysis and Prediction | 5/18/2020 | Large-Scale Data Processing and Analysis with Spark and Dask | [Karau et al. 2015](https://canvas.uchicago.edu/files/3391268/download?download_frd=1) (Read Ch. 1-4, Skim 9-11), [*Introduction to PySpark*](https://learn.datacamp.com/courses/introduction-to-pyspark) (DataCamp Course), [“Why Dask,”](https://docs.dask.org/en/latest/why.html) [Dask Slide Deck](https://dask.org/slides.html),  Optional: [*Parallel Programming with Dask*](https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python) (DataCamp Course) | |
 90 | | | | 5/20/2020 | A Deeper Dive into Spark (with Social Science Machine Learning Applications) | [*Machine Learning with PySpark*](https://campus.datacamp.com/courses/machine-learning-with-pyspark) (DataCamp Course), Optional: [*Feature Engineering with PySpark*](https://learn.datacamp.com/courses/feature-engineering-with-pyspark) (DataCamp Course), Videos about accelerating Spark with GPUs (via [Horovod](https://www.youtube.com/watch?v=D1By2hy4Ecw) for deep learning, and the RAPIDS Library for general operations [Part 1](https://www.youtube.com/watch?v=Qw-TB6EHmR8) and [Part 2](https://www.youtube.com/watch?v=ApI2EZIJU_Q)) | Due: Assignment 2 (11:59 PM) |
 91 | |||| Lab: Running PySpark in an AWS EMR Notebook for Large-Scale Data Analysis and Prediction |||
 92 | | | Week 8: Large-Scale Network Analysis | 5/25/2020 | Memorial Day (No Class) | | |
 93 | | | | 5/27/2020 | Processing and Analyzing Large-Scale Graphs with Spark | [Guller 2015](https://canvas.uchicago.edu/files/3391270/download?download_frd=1), [Hunter 2017](https://www.youtube.com/watch?v=NmbKst7ny5Q) (Spark Summit Talk), [GraphFrames Documentation for Python](https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html) | |
 94 | |||| Lab: Network Analysis with PySpark |||
 95 | | Student Projects and Presentations | Week 9: Final Project Presentations | 6/1/2020 | View and Comment on Student Presentations |  | Due: Final Project Presentation Video (9:30 AM) |
 96 | ||| 6/3/2020 | View and Comment on Student Presentations |  | Due: Assignment 3 (11:59 PM) |
 97 | ||| 6/5/2020 ||| Due: Final Project (11:59 PM; For June 2020 Graduates)
 98 | || Week 10: Final Projects | 6/11/2020 ||| Due: Final Project (11:59 PM; For all other students) |
 99 | 
100 | ## Works Cited
101 | 
102 | Armbrust, Michael, Fox, Armando, Griffith, Rean, Joseph, Anthony D., Katz, Randy H., Konwinski, Andrew, Lee, Gunho, Patterson, David A., Rabkin, Ariel, Stoica, Ion, and Matei Zaharia. 2009. "Above the Clouds: A Berkeley View of Cloud Computing." Technical report, EECS Department, University of California, Berkeley.
103 | 
104 | ["AWS Big Data Analytics Options on AWS." December 2018.)](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) AWS Whitepaper.
105 | 
106 | ["Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility." July 2017.](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf)
107 | 
108 | Dalcín, Lisandro, Paz, Rodrigo, Storti, Mario, and Jorge D'Elía. 2008. "MPI for Python: Performance improvements and MPI-2 extensions." *J. Parallel Distrib. Comput.* 68: 655-662.
109 | 
110 | ["Data Warehousing on AWS." March 2016.](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf) AWS Whitepaper.
111 | 
112 | Dean, Alexander, and Valentin Crettaz. 2019. *Event Streams in Action: Real-time event systems with Kafka and Kinesis*. Shelter Island, NY: Manning.
113 | 
114 | Dean, Jeffrey, and Sanjay Ghemawat. 2004. "MapReduce: Simplified data processing on large clusters." In *Proceedings of Operating Systems Design and Implementation (OSDI)*. San Francisco, CA. 137-150.
115 | 
116 | *Feature Engineering with PySpark*. https://learn.datacamp.com/courses/feature-engineering-with-pyspark. Accessed 3/2020.
117 | 
118 | "GraphFrames user guide - Python." https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html. Accessed 3/2020.
119 | 
120 | Guller, Mohammed. 2015. "Graph Processing with Spark." In *Big Data Analytics with Spark*. New York: Apress.
121 | 
122 | ["High Performance Computing Lens AWS Well-Architected Framework." December 2019.](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) AWS Whitepaper.
123 | 
124 | Hunter, Tim. October 26, 2017. "GraphFrames: Scaling Web-Scale Graph Analytics with Apache Spark." https://www.youtube.com/watch?v=NmbKst7ny5Q.
125 | 
126 | *Introduction to AWS Boto in Python*. https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python. Accessed 3/2020.
127 | 
128 | ["Introduction to HPC on AWS." n.d.](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf) AWS Whitepaper.
129 | 
130 | *Introduction to PySpark*. https://learn.datacamp.com/courses/introduction-to-pyspark. Accessed 3/2020.
131 | 
132 | Jonas, Eric, Schleier-Smith, Johann, Sreekanti, Vikram, and Chia-Che Tsai. 2019. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." Technical report, EECS Department, University of California, Berkeley.
133 | 
134 | Jorissen, Kevin, and Brendan Bouffler. 2017. *AWS Research Cloud Program: Researcher's Handbook*. Amazon Web Services.
135 | 
136 | Kane, Frank. March 23, 2018. ["How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase"](https://www.youtube.com/watch?v=v5e_PasMdXc). https://www.youtube.com/watch?v=v5e_PasMdXc
137 | 
138 | Karau, Holden, Konwinski, Andy, Wendell, Patrick, and Matei Zaharia. 2015. *Learning Spark*. Sebastopol, CA: O'Reilly.
139 | 
140 | Klöckner, Andreas, Pinto, Nicolas, Lee, Yunsup, Catanzaro, Bryan, Ivanov, Paul, and Ahmed Fasih. 2012. "PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation." *Parallel Computing* 38(3): 157-174.
141 | 
142 | Linux Academy. July 10, 2019. ["What is a Database?"](https://www.youtube.com/watch?v=Tk1t3WKK-ZY). https://www.youtube.com/watch?v=Tk1t3WKK-ZY
143 | 
144 | *Machine Learning with PySpark*. https://campus.datacamp.com/courses/machine-learning-with-pyspark. Accessed 3/2020.
145 | 
146 | Martinez, Miguel, and Thomas Graves. "Accelerating Apache Spark by Several Orders of Magnitude with GPUs." https://www.youtube.com/watch?v=Qw-TB6EHmR8.
147 | 
148 | "mrjob v0.7.1 documentation." https://mrjob.readthedocs.io/en/latest/index.html. Accessed 3/2020.
149 | 
150 | Narkhede, Neha, Shapira, Gwen, and Todd Palino. 2017. *Kafka: The Definitive Guide*. Sebastopol, CA: O'Reilly.
151 | 
152 | “NoSQL Explained.” https://www.mongodb.com/nosql-explained. Accessed 3/2020.
153 | 
154 | Pacheco, Peter. 2011. *An Introduction to Parallel Programming*. Burlington, MA: Morgan Kaufmann.
155 | 
156 | *Parallel Programming with Dask*. https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python. Accessed 3/2020.
157 | 
158 | Petrossian, Tony, and Ian Meyers. November 30, 2017. "Which Database to Use When?" https://youtu.be/KWOSGVtHWqA. AWS re:Invent 2017.
159 | 
160 | "RCC User Guide." rcc.uchicago.edu/docs/. Accessed March 2020.
161 | 
162 | Robey, Robert and Yuliana Zamora. 2020. *Parallel and High Performance Computing*. Manning Early Access Program.
163 | 
164 | Scarpino, Matthew. 2012. *OpenCL in Action*. Shelter Island, NY: Manning.
165 | 
166 | Sergeev, Alex. March 28, 2019. "Distributed Deep Learning with Horovod." https://www.youtube.com/watch?v=D1By2hy4Ecw.
167 | 
168 | ["Streaming Data Solutions on AWS with Amazon Kinesis." July 2017.](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) AWS Whitepaper.
169 | 
170 | "What is Amazon EMR." https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html. Accessed 3/2020.
171 | 
172 | White, Tom. 2015. *Hadoop: The Definitive Guide*. Sebastopol, CA: O'Reilly.
173 | 
174 | "Why Dask." https://docs.dask.org/en/latest/why.html. Accessed 3/2020.
175 | 


--------------------------------------------------------------------------------