├── .gitignore ├── LICENSE ├── Labs ├── Lab 1 Midway RCC and mpi4py │ ├── midway_cheat_sheet.md │ ├── mpi.sbatch │ ├── mpi_multi_job.sbatch │ └── mpi_rand_walk.py ├── Lab 2 PyOpenCL │ ├── Lab_2_PyOpenCL_Random_Walk_Tutorial.ipynb │ ├── gpu.sbatch │ ├── gpu_rand_walk.py │ └── print_gpu_info.py ├── Lab 3 AWS EC2 and PyWren │ ├── Lab_3_PyWren.ipynb │ └── pywren_workflow.png ├── Lab 4 Accessing Large-Scale Data in S3 │ └── Lab 4 Working with Large Data Sources in S3.ipynb ├── Lab 5 Ingesting and Processing Large-Scale Data │ ├── Part I MapReduce │ │ ├── .mrjob.conf │ │ ├── mapreduce_lab5.py │ │ ├── mrjob_cheatsheet.md │ │ └── sample_us.tsv │ └── Part II Kinesis │ │ ├── Lab 5 Kinesis.ipynb │ │ ├── consumer.py │ │ ├── consumer_feed.png │ │ ├── producer.py │ │ └── simple_kinesis_architecture.png ├── Lab 6 PySpark EDA and ML in an EMR Notebook │ ├── Lab_6.ipynb │ └── Local_Colab_Spark_Setup.ipynb └── Lab 7 Large-Scale Graph Processing with PySpark │ ├── Lab_7_GraphFrames.ipynb │ ├── edges.csv │ └── nodes.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | build/* 3 | dist/* 4 | *.aux 5 | *.bbl 6 | *.blg 7 | *.fdb_latexmk 8 | *.idx 9 | *.ilg 10 | *.ind 11 | *.lof 12 | *.log 13 | *.lot 14 | *.out 15 | *.pdfsync 16 | *.synctex.gz 17 | *.toc 18 | *.swp 19 | *.asv 20 | *.nav 21 | *.snm 22 | *.gz 23 | *.bib.bak 24 | *.fls 25 | *.m~ 26 | *.sublime* 27 | .DS_Store 28 | *.dta 29 | *.ipynb_checkpoints* 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Jon Clindaniel 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/midway_cheat_sheet.md: -------------------------------------------------------------------------------- 1 | # Logging in and Copying Files to Midway 2 | See [RCC user guide](https://rcc.uchicago.edu/docs/) for login details specific to your system and additional options. The instructions for copying files via `scp` and logging in via `ssh` below assume a Unix-style command line interface. 3 | 4 | ### SSH into Cluster with cNetID and Password 5 | ``` 6 | ssh your_cnet_id@midway2.rcc.uchicago.edu 7 | ``` 8 | Note that you'll now be able to move around your home directory using standard unix commands (`cd`, `pwd`, `ls`, etc.). If you are on a Windows machine, I (Jon) recommend enabling [Windows Subsystem for Linux and installing Ubuntu 18.04](https://docs.microsoft.com/en-us/windows/wsl/install-win10) to complete all of these tasks. This is what I do and I find this makes my life a lot easier than having to mentally transition between DOS and Unix commands (or needing to use an additional third-party tool). 9 | 10 | ### SCP files to/from local directory 11 | If you need to transfer files to and from Midway's storage, I find it easiest to just copy files via the `scp` command. Here, I copy a local file `local_file` in my current directory to my home directory on Midway. Then, I copy a file `remote_file` from my Midway home directory back to my current directory on my local machine. If you prefer not to use this command line approach, there are tutorials in the Midway documentation with [alternative approaches](https://rcc.uchicago.edu/docs/data-transfer/index.html). 12 | 13 | ``` 14 | scp local_file your_cnet_id@midway2.rcc.uchicago.edu: 15 | ``` 16 | ``` 17 | scp your_cnet_id@midway2.rcc.uchicago.edu:remote_file . 18 | ``` 19 | 20 | To copy an entire directory, use the `-r` flag: 21 | ``` 22 | scp -r your_cnet_id@midway2.rcc.uchicago.edu:remote_directory . 23 | ``` 24 | 25 | ### Clone GitHub Repository 26 | Another good option is to clone a GitHub repository (for instance this public course repository) to your home directory on Midway and access files/code this way. 27 | 28 | ``` 29 | git clone https://github.com/jonclindaniel/LargeScaleComputing_S20.git 30 | ``` 31 | 32 | ### Adjustments to text on the Command Line 33 | If need to make any adjustments to text files before/after running them (or create new ones) on Midway, can do so on the command line with text editing tools such as nano, vim, etc. 34 | ``` 35 | nano mpi.sbatch 36 | ``` 37 | 38 | # Loading Modules and installing programs 39 | 40 | ### Load appropriate modules for MPI CPU* and OpenCL GPU dev 41 | ``` 42 | module load cuda 43 | module load mpi4py/3.0.1a0_py3 44 | ``` 45 | 46 | ### Install additional Python packages to local user directory from login node 47 | To install packages that are not already included in a module (such as upgrading matplotlib, and installing PyOpenCL, which we will be using in this class), you can use `pip` to install these packages in your local user directory from the login node on Midway. Make sure you have loaded the cuda and mpi4py modules above before you try to run the below commands. 48 | 49 | ``` 50 | pip install -U --user matplotlib 51 | pip install --user pyopencl 52 | ``` 53 | 54 | # Running jobs 55 | 56 | Resource sharing and job scheduling is handled by Slurm software on the Midway RCC. You can see detailed information in the Midway documentation, but Slurm allows you to see which partitions are available at any given time via the `sinfo` command, submit batch jobs via the `sbatch` command (which will allow you to request specific node/interconnect resources for a period of time and run your code on those resources), and schedule interactive sessions via the `sinteractive` command. Listed below are some of the fundamental commands you will need to know how to perform for this class. 57 | 58 | ### Run Batch jobs with Slurm scripts 59 | You will use Slurm scripts to request computational resources for a period of time and run your code. These scripts can be run with the `sbatch` commands, as demonstrated below. You can check out sample Slurm scripts in our GitHub course repository, for more detail on how these scripts are structured. 60 | 61 | ``` 62 | sbatch mpi.sbatch 63 | sbatch gpu.sbatch 64 | ``` 65 | 66 | Note, if you wrote your sbatch file on a Windows machine, you might need to convert the text into a Unix format for it to run properly on the Midway RCC. To do so, you can install `dos2unix` on WSL via `apt-get install dos2unix` and then run the converter on your sbatch file. 67 | 68 | ``` 69 | dos2unix gpu.sbatch 70 | ``` 71 | 72 | You can monitor the progress of your job with (the sbatch command will print out your job ID): 73 | ``` 74 | squeue -j your_job_id 75 | ``` 76 | 77 | You can also cancel jobs with: 78 | ``` 79 | scancel your_job_id 80 | ``` 81 | 82 | ### Check results of your batch job (assuming write to `.out` file) 83 | In your Slurm script, you will specify a `.out` file, where the output of your program will be written. You can download the file to your local machine to look at the results (via `scp`), or you can check the results on the Midway command line with the standard Unix `cat` command. 84 | 85 | ``` 86 | cat gpu.out 87 | ``` 88 | 89 | ### Run interactive jobs 90 | You should not perform intensive computational tasks on the login nodes. Use the `sinteractive` command to set up an interactive session on other Midway nodes if you want to have an interactive command line experience (you can specify exactly which nodes you would like to access; see the documentation for syntax). Here we request 4 cores for 15 minutes. Additionally, you can use the interactive mode to run Jupyter notebooks, which you can view in your browser (see documentation for more details). 91 | 92 | ``` 93 | sinteractive --time=00:15:00 --ntasks=4 94 | ``` 95 | 96 | \* Note: this setup allows for use of mpi4py, PyOpenCL, as well as PySpark (which is included in the mpi4py module) 97 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=mpi 4 | #SBATCH --output=mpi.out 5 | #SBATCH --ntasks=4 6 | #SBATCH --partition=broadwl 7 | #SBATCH --constraint=fdr 8 | 9 | # Load the default mpi4py/Anaconda module. 10 | module load mpi4py/3.0.1a0_py3 11 | 12 | # Run the python program with mpirun. The -n flag is not required; 13 | # mpirun will automatically figure out the best configuration from the 14 | # Slurm environment variables. 15 | mpirun python ./mpi_rand_walk.py 16 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi_multi_job.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=mpi_multi_job 4 | #SBATCH --ntasks=11 5 | #SBATCH --partition=broadwl 6 | #SBATCH --constraint=fdr 7 | 8 | # Load the default mpi4py/Anaconda module. 9 | module load mpi4py/3.0.1a0_py3 10 | 11 | # Run the python program with mpirun, using & to run jobs at the same time 12 | mpirun -n 1 python ./mpi_rand_walk.py > ./mpi_nprocs1.out & 13 | mpirun -n 10 python ./mpi_rand_walk.py > ./mpi_nprocs10.out & 14 | 15 | # Wait until all simultaneous mpiruns are done 16 | wait 17 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi_rand_walk.py: -------------------------------------------------------------------------------- 1 | from mpi4py import MPI 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | import time 5 | 6 | def sim_rand_walks_parallel(n_runs): 7 | # Get rank of process and overall size of communicator: 8 | comm = MPI.COMM_WORLD 9 | rank = comm.Get_rank() 10 | size = comm.Get_size() 11 | 12 | # Start time: 13 | t0 = time.time() 14 | 15 | # Evenly distribute number of simulation runs across processes 16 | N = int(n_runs/size) 17 | 18 | # Simulate N random walks and specify as a NumPy Array 19 | r_walks = [] 20 | for i in range(N): 21 | steps = np.random.normal(loc=0, scale=1, size=100) 22 | steps[0] = 0 23 | r_walks.append(100 + np.cumsum(steps)) 24 | r_walks_array = np.array(r_walks) 25 | 26 | # Gather all simulation arrays to buffer of expected size/dtype on rank 0 27 | r_walks_all = None 28 | if rank == 0: 29 | r_walks_all = np.empty([N*size, 100], dtype='float') 30 | comm.Gather(sendbuf = r_walks_array, recvbuf = r_walks_all, root=0) 31 | 32 | # Print/plot simulation results on rank 0 33 | if rank == 0: 34 | # Calculate time elapsed after computing mean and std 35 | average_finish = np.mean(r_walks_all[:,-1]) 36 | std_finish = np.std(r_walks_all[:,-1]) 37 | time_elapsed = time.time() - t0 38 | 39 | # Print time elapsed + simulation results 40 | print("Simulated %d Random Walks in: %f seconds on %d MPI processes" 41 | % (n_runs, time_elapsed, size)) 42 | print("Average final position: %f, Standard Deviation: %f" 43 | % (average_finish, std_finish)) 44 | 45 | # Plot Simulations and save to file 46 | plt.plot(r_walks_all.transpose()) 47 | plt.savefig("r_walk_nprocs%d_nruns%d.png" % (size, n_runs)) 48 | 49 | return 50 | 51 | def main(): 52 | sim_rand_walks_parallel(n_runs = 10000) 53 | 54 | if __name__ == '__main__': 55 | main() 56 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/gpu.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=gpu # job name 3 | #SBATCH --output=gpu.out # output log file 4 | #SBATCH --error=gpu.err # error file 5 | #SBATCH --time=00:05:00 # 5 minutes of wall time 6 | #SBATCH --nodes=1 # 1 GPU node 7 | #SBATCH --partition=gpu2 # GPU2 partition 8 | #SBATCH --ntasks=1 # 1 CPU core to drive GPU 9 | #SBATCH --gres=gpu:1 # Request 1 GPU 10 | 11 | module load cuda 12 | module load mpi4py/3.0.1a0_py3 13 | 14 | python ./print_gpu_info.py 15 | # python ./gpu_rand_walk.py 16 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/gpu_rand_walk.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pyopencl as cl 3 | import pyopencl.array as cl_array 4 | import pyopencl.clrandom as clrand 5 | import pyopencl.tools as cltools 6 | from pyopencl.scan import GenericScanKernel 7 | import matplotlib.pyplot as plt 8 | import time 9 | 10 | def sim_rand_walks(n_runs): 11 | # Set up context and command queue 12 | ctx = cl.create_some_context() 13 | queue = cl.CommandQueue(ctx) 14 | 15 | # mem_pool = cltools.MemoryPool(cltools.ImmediateAllocator(queue)) 16 | 17 | t0 = time.time() 18 | 19 | # Generate an array of Normal Random Numbers on GPU of length n_sims*n_steps 20 | n_steps = 100 21 | rand_gen = clrand.PhiloxGenerator(ctx) 22 | ran = rand_gen.normal(queue, (n_runs*n_steps), np.float32, mu=0, sigma=1) 23 | 24 | # Establish boundaries for each simulated walk (i.e. start and end) 25 | # Necessary so that we perform scan only within rand walks and not between 26 | seg_boundaries = [1] + [0]*(n_steps-1) 27 | seg_boundaries = np.array(seg_boundaries, dtype=np.uint8) 28 | seg_boundary_flags = np.tile(seg_boundaries, int(n_runs)) 29 | seg_boundary_flags = cl_array.to_device(queue, seg_boundary_flags) 30 | 31 | # GPU: Define Segmented Scan Kernel, scanning simulations: f(n-1) + f(n) 32 | prefix_sum = GenericScanKernel(ctx, np.float32, 33 | arguments="__global float *ary, __global char *segflags, " 34 | "__global float *out", 35 | input_expr="ary[i]", 36 | scan_expr="across_seg_boundary ? b : (a+b)", neutral="0", 37 | is_segment_start_expr="segflags[i]", 38 | output_statement="out[i] = item + 100", 39 | options=[]) 40 | 41 | # Allocate space for result of kernel on device 42 | ''' 43 | Note: use a Memory Pool (commented out above and below) if you're invoking 44 | multiple times to avoid wasting time creating brand new memory areas each 45 | time you invoke the kernel: https://documen.tician.de/pyopencl/tools.html 46 | ''' 47 | # dev_result = cl_array.arange(queue, len(ran), dtype=np.float32, 48 | # allocator=mem_pool) 49 | dev_result = cl_array.empty_like(ran) 50 | 51 | # Enqueue and Run Scan Kernel 52 | prefix_sum(ran, seg_boundary_flags, dev_result) 53 | 54 | # Get results back on CPU to plot and do final calcs, just as in Lab 1 55 | r_walks_all = (dev_result.get() 56 | .reshape(n_runs, n_steps) 57 | .transpose() 58 | ) 59 | 60 | average_finish = np.mean(r_walks_all[-1]) 61 | std_finish = np.std(r_walks_all[-1]) 62 | final_time = time.time() 63 | time_elapsed = final_time - t0 64 | 65 | print("Simulated %d Random Walks in: %f seconds" 66 | % (n_runs, time_elapsed)) 67 | print("Average final position: %f, Standard Deviation: %f" 68 | % (average_finish, std_finish)) 69 | 70 | # Plot Random Walk Paths 71 | ''' 72 | Note: Scan already only starts scanning at the second entry, but for the 73 | sake of the plot, let's set all of our random walk starting positions to 100 74 | and then plot the random walk paths. 75 | ''' 76 | r_walks_all[0] = [100]*n_runs 77 | plt.plot(r_walks_all) 78 | plt.savefig("r_walk_nruns%d_gpu.png" % n_runs) 79 | 80 | return 81 | 82 | def main(): 83 | sim_rand_walks(n_runs = 10000) 84 | 85 | if __name__ == '__main__': 86 | main() 87 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/print_gpu_info.py: -------------------------------------------------------------------------------- 1 | import pyopencl as cl 2 | 3 | def print_device_info() : 4 | print('\n' + '=' * 60 + '\nOpenCL Platforms and Devices') 5 | for platform in cl.get_platforms(): 6 | print('=' * 60) 7 | print('Platform - Name: ' + platform.name) 8 | print('Platform - Vendor: ' + platform.vendor) 9 | print('Platform - Version: ' + platform.version) 10 | print('Platform - Profile: ' + platform.profile) 11 | for device in platform.get_devices(): 12 | print(' ' + '-' * 56) 13 | print(' Device - Name: ' \ 14 | + device.name) 15 | print(' Device - Type: ' \ 16 | + cl.device_type.to_string(device.type)) 17 | print(' Device - Max Clock Speed: {0} Mhz'\ 18 | .format(device.max_clock_frequency)) 19 | print(' Device - Compute Units: {0}'\ 20 | .format(device.max_compute_units)) 21 | print(' Device - Local Memory: {0:.0f} KB'\ 22 | .format(device.local_mem_size/1024.0)) 23 | print(' Device - Constant Memory: {0:.0f} KB'\ 24 | .format(device.max_constant_buffer_size/1024.0)) 25 | print(' Device - Global Memory: {0:.0f} GB'\ 26 | .format(device.global_mem_size/1073741824.0)) 27 | print(' Device - Max Buffer/Image Size: {0:.0f} MB'\ 28 | .format(device.max_mem_alloc_size/1048576.0)) 29 | print(' Device - Max Work Group Size: {0:.0f}'\ 30 | .format(device.max_work_group_size)) 31 | print('\n') 32 | 33 | if __name__ == "__main__": 34 | print_device_info() 35 | -------------------------------------------------------------------------------- /Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/.mrjob.conf: -------------------------------------------------------------------------------- 1 | runners: 2 | emr: 3 | # Specify a pem key to start up an EMR cluster on your behalf 4 | ec2_key_pair: MACS_30123 5 | ec2_key_pair_file: ~/MACS_30123.pem 6 | 7 | # Specify type/# of EC2 instances you want your code to run on 8 | core_instance_type: m5.xlarge 9 | num_core_instances: 3 10 | region: us-east-1 11 | 12 | # to read from/write to S3; note colons instead of "=": 13 | aws_access_key_id: 14 | aws_secret_access_key: 15 | aws_session_token: 16 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mapreduce_lab5.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Lab 5 (Part I): Batch Processing Data with MapReduce 3 | 4 | What's the most used word in 5-star customer reviews on Amazon? 5 | 6 | We can answer this question using the mrjob package to investigate customer 7 | reviews available as a part of Amazon's public S3 customer reviews dataset. 8 | 9 | For this demo, we'll use a small sample of this 100m+ review dataset that 10 | Amazon provides (s3://amazon-reviews-pds/tsv/sample_us.tsv). 11 | 12 | In order to run the code below, be sure to `pip install mrjob` if you have not 13 | done so already. 14 | ''' 15 | 16 | from mrjob.job import MRJob 17 | from mrjob.step import MRStep 18 | import re 19 | 20 | WORD_RE = re.compile(r"[\w']+") 21 | 22 | class MRMostUsedWord(MRJob): 23 | 24 | def mapper_get_words(self, _, row): 25 | ''' 26 | If a review's star rating is 5, yield all of the words in the review 27 | ''' 28 | data = row.split('\t') 29 | if data[7] == '5': 30 | for word in WORD_RE.findall(data[13]): 31 | yield (word.lower(), 1) 32 | 33 | def combiner_count_words(self, word, counts): 34 | ''' 35 | Sum all of the words available so far 36 | ''' 37 | yield (word, sum(counts)) 38 | 39 | def reducer_count_words(self, word, counts): 40 | ''' 41 | Arrive at a total count for each word in the 5 star reviews 42 | ''' 43 | yield None, (sum(counts), word) 44 | 45 | # discard the key; it is just None 46 | def reducer_find_max_word(self, _, word_count_pairs): 47 | ''' 48 | Yield the word that occurs the most in the 5 star reviews 49 | ''' 50 | yield max(word_count_pairs) 51 | 52 | def steps(self): 53 | return [ 54 | MRStep(mapper=self.mapper_get_words, 55 | combiner=self.combiner_count_words, 56 | reducer=self.reducer_count_words), 57 | MRStep(reducer=self.reducer_find_max_word) 58 | ] 59 | 60 | if __name__ == '__main__': 61 | MRMostUsedWord.run() 62 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mrjob_cheatsheet.md: -------------------------------------------------------------------------------- 1 | # mrjob Cheat Sheet 2 | 3 | To run/debug `mrjob` code locally from your command line: 4 | 5 | ``` 6 | python mapreduce_lab5.py sample_us.tsv 7 | ``` 8 | 9 | To run your `mrjob` code on an AWS EMR cluster, you should first ensure that your configuration file is set with your EC2 pem file name and file location, as well as your current credentials from AWS Education/Vocareum. Note that the credentials are listed with ":" here and not "=" as they are in your `credentials` file. `mrjob` assumes that this (`.mrjob.conf`) file will be located in your home directory (at `~/.mrjob.conf`), so you will need to put the file there. Otherwise, you will need to designate your configuration [as a command line option](https://mrjob.readthedocs.io/en/latest/cmd.html#create-cluster) when you start your `mrjob` job using the `-c` flag. 10 | 11 | ~/.mrjob.conf 12 | ``` 13 | runners: 14 | emr: 15 | # Specify a pem key to start up an EMR cluster on your behalf 16 | ec2_key_pair: MACS_30123 17 | ec2_key_pair_file: ~/MACS_30123.pem 18 | 19 | # Specify type/# of EC2 instances you want your code to run on 20 | core_instance_type: m5.xlarge 21 | num_core_instances: 3 22 | region: us-east-1 23 | 24 | # to read from/write to S3; note ":" instead of "=" from `credentials`: 25 | aws_access_key_id: 26 | aws_secret_access_key: 27 | aws_session_token: 28 | ``` 29 | 30 | To run your `mrjob` code on an EMR cluster (of the size and type specified in your configuration file), you can run the following command on the command line (Note: before running this code, ensure that you have created the default IAM "EMR roles" for your cluster via the AWS console by following the instructions in the lab video for today of starting up and terminating an EMR cluster). Note that this command will start up a cluster, run your job, and then automatically terminate the cluster for you. 31 | 32 | ``` 33 | python mapreduce_lab5.py -r emr sample_us.tsv 34 | ``` 35 | 36 | To create a stand-alone cluster that you can use to run multiple jobs on, you can run: 37 | ``` 38 | mrjob create-cluster 39 | ``` 40 | 41 | When you create a cluster, `mrjob` will print out the ID number for the cluster ("j" followed by a bunch of numbers and characters). If you copy this job-id number and specify it after the `--cluster-id` flag on the command line, your code will be run on this already-running cluster. Here, we additionally write the results of our job out to a text file (`mr.out`) via the `> mr.out` addendum at the end of the line. 42 | 43 | ``` 44 | python mapreduce_lab5.py -r emr sample_us.tsv --cluster-id= > mr.out 45 | ``` 46 | 47 | When you're done running jobs on your cluster, you can terminate the cluster with the following command so that you don't have to pay for it any longer than you need to: 48 | 49 | ``` 50 | mrjob terminate-cluster 51 | ``` 52 | 53 | For additional configuration options, consult the [`mrjob` documentation](https://mrjob.readthedocs.io/en/latest/index.html). 54 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/sample_us.tsv: -------------------------------------------------------------------------------- 1 | marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date 2 | US 18778586 RDIJS7QYB6XNR B00EDBY7X8 122952789 Monopoly Junior Board Game Toys 5 0 0 N Y Five Stars Excellent!!! 2015-08-31 3 | US 24769659 R36ED1U38IELG8 B00D7JFOPC 952062646 56 Pieces of Wooden Train Track Compatible with All Major Train Brands Toys 5 0 0 N Y Good quality track at excellent price Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends. 2015-08-31 4 | US 44331596 R1UE3RPRGCOLD B002LHA74O 818126353 Super Jumbo Playing Cards by S&S Worldwide Toys 2 1 1 N Y Two Stars Cards are not as big as pictured. 2015-08-31 5 | US 23310293 R298788GS6I901 B00ARPLCGY 261944918 Barbie Doll and Fashions Barbie Gift Set Toys 5 0 0 N Y my daughter loved it and i liked the price and it came ... my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop! 2015-08-31 6 | US 38745832 RNX4EXOBBPN5 B00UZOPOFW 717410439 Emazing Lights eLite Flow Glow Sticks - Spinning Light LED Toy Toys 1 1 1 N Y DONT BUY THESE! Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds. 2015-08-31 7 | US 13394189 R3BPETL222LMIM B009B7F6CA 873028700 Melissa & Doug Water Wow Coloring Book - Vehicles Toys 5 0 0 N Y Five Stars Great item. Pictures pop thru and add detail as "painted." Pictures dry and it can be repainted. 2015-08-31 8 | US 2749569 R3SORMPJZO3F2J B0101EHRSM 723424342 Big Bang Cosmic Pegasus (Pegasis) Metal 4D High Performance Generic Battling Top BB-105 Toys 3 2 2 N Y Three Stars To keep together, had to use crazy glue. 2015-08-31 9 | US 41137196 R2RDOJQ0WBZCF6 B00407S11Y 383363775 Fun Express Insect Finger Puppets 12ct Toy Toys 5 0 0 N Y Five Stars I was pleased with the product. 2015-08-31 10 | US 433677 R2B8VBEPB4YEZ7 B00FGPU7U2 780517568 Fisher-Price Octonauts Shellington's On-The-Go Pod Toy Toys 5 0 0 N Y Five Stars Children like it 2015-08-31 11 | US 1297934 R1CB783I7B0U52 B0013OY0S0 269360126 Claw Climber Goliath/ Disney's Gargoyles Toys 1 0 1 N Y Shame on the seller !!! Showed up not how it's shown . Was someone's old toy. with paint on it. 2015-08-31 12 | US 52006292 R2D90RQQ3V8LH B00519PJTW 493486387 100 Foot Multicolor Pennant Banner Toys 5 0 0 N Y Five Stars Really liked these. They were a little larger than I thought, but still fun. 2015-08-31 13 | US 32071052 R1Y4ZOUGFMJ327 B001TCY2DO 459122467 Pig Jumbo Foil Balloon Toys 5 0 0 N Y Nice huge balloon Nice huge balloon! Had my local grocery store fill it up for a very small fee, it was totally worth it! 2015-08-31 14 | US 7360347 R2BUV9QJI2A00X B00DOQCWF8 226984155 Minecraft Animal Toy (6-Pack) Toys 5 0 1 N Y Five Stars Great deal 2015-08-31 15 | US 11613707 RSUHRJFJIRB3Z B004C04I4I 375659886 Disney Baby: Eeyore Large Plush Toys 4 0 0 N Y Four Stars As Advertised 2015-08-31 16 | US 13545982 R1T96CG98BBA15 B00NWGEKBY 933734136 Team Losi 8IGHT-E RTR AVC Electric 4WD Buggy Vehicle (1/8 Scale) Toys 3 2 4 N Y ... servo so expect to spend 150 more on a good servo immediately be the stock one breaks right Comes w a 15$ servo so expect to spend 150 more on a good servo immediately be the stock one breaks right away 2015-08-31 17 | US 43880421 R2ATXF4QQ30YW B00000JS5S 341842639 Hot Wheels 48- Car storage Case With Easy Grip Carrying Case Toys 5 0 0 N Y Five Stars awesome ! Thanks! 2015-08-31 18 | US 1662075 R1YS3DS218NNMD B00XPWXYDK 210135375 ZuZo 2.4GHz 4 CH 6 Axis Gyro RC Quadcopter Drone with Camera & LED Lights, 38 x 38 x 7cm Toys 5 4 4 N N The closest relevance I have to items like these is while in the army I was trained ... I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.

*Disclaimer I received this product at a discount for my unbiased review. 2015-08-31 19 | US 18461411 R2SDXLTLF92O0H B00VPXX92W 705054378 Teenage Mutant Ninja Turtles T-Machines Tiger Claw in Safari Truck Diecast Vehicle Toys 5 0 0 N Y Five Stars It was a birthday present for my grandson and he LOVES IT!! 2015-08-31 20 | US 27225859 R4R337CCDWLNG B00YRA3H4U 223420727 Franklin Sports MLB Fold Away Batting Tee Toys 3 0 1 Y N Got wrong product in the shipment Got a wrong product from Amazon Vine and unable to provide a good review. We received a pair of cute girls gloves and a baseball ball instead, while we were expecting a boys batting tee. The gloves are cute, however made for at least 6+ yrs or above...more likely 8-9 yrs old girls.

Can't provide a fair review as we were not able to use the product. 2015-08-31 21 | US 20494593 R32Z6UA4S5Q630 B009T8BSQY 787701676 Alien Frontiers: Factions Toys 1 0 0 N Y Overpriced. You need expansion packs 3-5 if you want access to the player aids for the Factions expansion. The base game of Alien Frontiers just plays so much smoother than adding Factions with the expansion packs. All this will do is pigeonhole you into a certain path to victory. 2015-08-31 22 | US 6762003 R1H1HOVB44808I B00PXWS1CY 996611871 Holy Stone F180C Mini RC Quadcopter Drone with Camera 2.4GHz 6-Axis Gyro Bonus Battery and 8 Blades Toys 5 1 1 N N Five Stars Awesome customer service and a cool little drone! Especially for the price! 2015-08-31 23 | US 25402244 R4UVQIRZ5T1FM 1591749352 741582499 Klutz Sticker Design Studio: Create Your Own Custom Stickers Craft Kit Toys 4 1 2 N Y Great product for little girls! I got these for my daughters for plane trip. I liked that the zipper pouch was attached for markers. However, that pouch fell off. But the girls have loved coloring their own stickers. Would def buy this again. 2015-08-31 24 | US 32910511 R226K8IJLRPTIR B00V5DM3RE 587799706 Yoga Joes - Green Army Men Toys Toys 5 0 1 N Y Creative and fun! My girlfriend and I are both into yoga and I gave her a set of the Yoga Joes for her new home yoga room. When she saw them, she was impressed that I had found little green army men like her brother used to play with. Then she realized they were doing yoga and she almost exploded with delight. You should have seen the look on her face. Needless to say, the gift was a huge hit. They are absolutely brilliant! 2015-08-31 25 | US 18206299 R3031Q42BKAN7J B00UMSVHD4 135383196 Lalaloopsy Girls Basic Doll- Prairie Dusty Trails Toys 4 1 1 N N i like it but i absoloutely hate that some dolls don't ... i like it but i absoloutely hate that some dolls don't have pets like this one so I'm not stoked and i really would have liked to see her pet 2015-08-31 26 | US 26599182 R44NP0QG6E98W B00JLKI69W 375626298 WOW Toys Town Advent Calendar Toys 3 1 1 N Y We love how well they are made We have MANY Wow toys in our home. We love how well they are made. The advent calendar is an exception. The plastic is thinner than our other wow toys and the barn animals won't even stand up on their own due to the head weighing more than the body and uneven base of the toy. Very disappointing when we love the concept. The story to read along with everyday is great. I would have preferred quality (and would have paid more for it) instead of a cheap "knock-off" of the rest of their toy line. It is very obvious by the weight of the toys sloppy paint jobs which items in our "Wow Toys" bin are part advent calendar vs. the rest of the toys we have. Also there is a lot of overlap between the wow town advent calendar & winter wonderland calendar, which makes me want to look for alternatives for this year's advent calendar options. 2015-08-31 27 | US 128540 R24VKWVWUMV3M3 B004S8F7QM 829220659 Cards Against Humanity Toys 5 0 0 N Y Five Stars Tons of fun 2015-08-31 28 | US 125518 R2MW3TEBPWKENS B00MZ6BR3Q 145562057 Monster High Haunted Student Spirits Porter Geiss Doll Toys 5 0 0 N Y Five Stars Love it great add to collecton 2015-08-31 29 | US 15048896 R3N01IESYEYW01 B001CTOC8O 278247652 Star Wars Clone Wars Clone Trooper Child's Deluxe Captain Rex Costume Toys 5 0 1 N Y Five Stars Exactly as described. Fits my 7-yr old well! 2015-08-31 30 | US 12191231 RKLAK7EPEG5S6 B00BMKL5WY 906199996 LEGO Creative Tower Building Kit XXL 1600 Pieces 10664 Toys 5 1 2 N Y The children LOVED them and best part was that it helped the ... Purchased these Lego's to help aid me with teaching my Sunday School class. The children LOVED them and best part was that it helped the children remember past lessons. The true Lego brand seem to work/snap together/fit together better than the generic brands. (Plus I had some Lego snobs in my class that only would use the real Lego brand and shunned the generic brand). Only wish Lego's were a little cheaper but you get what you pay for and I would recommend for quality to purchase this set. 2015-08-31 31 | US 18409006 R1HOJ5GOA2JWM0 B00L71H0F4 692305292 Barkology Princess the Poodle Hand Puppet Toys 2 1 1 N Y My little dog can bite my hand right through the puupet. IT's OK, but not as good as the old Bite Meez puppets. This puppet is very thin. My little yorkie often bites my hands right through the stuffing. It not durable enough to really play with. 2015-08-31 32 | US 42523709 RO5VL1EAPX6O3 B004CLZRM4 59085350 Intex Mesh Lounge (Colors May Vary) Toys 1 0 0 N Y Save your money...don't buy! This was to be a gift for my husband for our new pool. Did not receive the color I ordered but most of all after only one month of use (not continuously) the mesh pulled away from the material and the inflatable side. Completely shredded and no longer of use. It was stored properly and was not kept outside or in the pool. Poorly made, better off going to W**-M*** and getting something on clearance. 2015-08-31 33 | US 45601416 R3OSJU70OIBWVE B000PEOMC8 895316207 Intex River Run I Sport Lounge, Inflatable Water Float, 53 in Diameter Toys 5 0 0 N Y but I've bought one in the past and loved Ended up sending this guy back because I didnt need it, but I've bought one in the past and loved it 2015-08-31 34 | US 47546726 R3NFZZCJSROBT4 B008W1BPWQ 397107238 Peppa Pig 7 Wood Puzzles In Wooden Storage Box (styles will vary) Toys 3 0 0 N Y Three Stars The product is good, but the box was broken. 2015-08-31 35 | US 21448082 R47XBGQFP039N B00FZX71BI 480992295 Paraboard - Parallel Charging Board for Lipos with EC5 Connectors Toys 5 0 0 N Y CAN SAVE TME IF YOU UNDESTAND HOW IT WORKS . Works well ,quality product but this style of board will charge multiple batteries at the same time SAFELY ( IF ) ALL BATTERIES ARE OF THE SAME CELL COUNT , THE SAME BATTERY COMPOSITION ( LIPO, NIMH-etc ) AND THEY MUST HAVE INDIVIDUAL CELL VOLTAGES THAT ARE VERY CLOSE AND EQUAL TO EACH BATTERY CONNECTED AT THE SAME TIME . When board is connected to most if not all chargers it can only read total and individual cell voltage of ONE OF THE BATTERIES AND MAY OVER OR UNDER CHARGE THE OTHERS TO SOME DEGREE , TOTAL RATE OF CHARGE IS DIVIDED EQUALLY BETWEEN BATTERIES CONNECTED AT THE SAME TIME . Close monitoring is a must when using like all high discharge batteies . I have only personally expeienced one lipo battery meltdown and it is a very SHORT IF NOT NON EXISTANT WINDOW OF OPPERUNITY TO PREVENT OR MINIMISE THE COLATTERAL DAMAGE ONCE THE PROCESS STARTS . Read and understand all charging and battery instructions . 2015-08-31 36 | US 12612039 R1JS8G26X4RM2G B00D4NJSJE 408940178 The Game of Life Money and Asset Board Game, Fame Edition Toys 5 0 1 N Y Five Stars Great gift! 2015-08-31 37 | US 44928701 R1ORWPFQ9EDYA0 B000HZZT7W 967346376 LCR Dice Game (Red Chips) Toys 5 0 0 N Y Five Stars We play this game quite a bit with friends. 2015-08-31 38 | US 43173394 R1YIX4SO32U0GT B002G54DAA 57447684 BCW - Deluxe Currency Slab - Regular Bill - Dollar / Currency Collecting Supplies Toys 5 0 1 N Y BCW - Deluxe Currency Slab Fits my $20 bill perfectly. 2015-08-31 39 | US 11210951 R1W3QQZ8JKECCI B003JT0L4Y 876626440 Ocean Life Stamps Birthday Party Supplies Loot Bag Accessories 24 Pieces per Unit Toys 5 0 0 N Y Fun for birthday party favor I ordered these for my 3 year old son's birthday party as party favors. They were a huge hit & the perfect fit for a 3 year old! 2015-08-31 40 | US 12918717 RZX17JIYIPAR B00KQUNNZ8 644368246 New Age Scare Halloween Party Pumpkin and Bat Hanging Round Lantern Decoration, Paper, 9" Pack of 3 Toys 5 0 0 N N Love the prints! These paper lanterns are adorable! The colors are bright, the patterns are fun & trendy & they're a good size! They came well packaged & are easy to assemble, they're also really easy to take apart & flatten back down! We'll get to use them for a few Halloweens I'm sure! They even came with the string needed to hang! I'm glad I grabbed them, they're really cute, my kids are excited to add to the Halloween decor & are already asking to hang them!
I received this item at a discount for my unbiased review. 2015-08-31 41 | US 47781982 RIDVQ4P3WJR42 B00WTGGGRO 162262449 Pokemon - Double Dragon Energy (97/108) - XY Roaring Skies Toys 5 1 1 N Y Five Stars My Grandson loves these cards. Thank you 2015-08-31 42 | US 34874898 R1WQ3ME3JAG2O1 B00WAKEQLW 824555589 Whiffer Sniffers Mystery Pack 1 Scented Backpack Clip Toys 1 0 6 N Y One Star Received a pineapple rather than the advertised s'more 2015-08-31 43 | US 20962528 RNTPOUDQIICBF B00M5AT30G 548190970 AmiGami Fox and Owl Figure 2-Pack Toys 4 0 0 N Y Four Stars Christmas gift for 6yr 2015-08-31 44 | US 47781982 R3AHZWWOL0IAV0 B00GNDY40U 438056479 Pokemon - Gyarados (31/113) - Legendary Treasures Toys 5 0 0 N Y Five Stars My Grandson loves these cards. Thank you 2015-08-31 45 | US 13328687 R3PDXKS9O2Z20B B00WJ1OPMW 120071056 LeapFrog LeapTV Letter Factory Adventures Educational, Active Video Game Toys 5 0 0 N N they LOVE this game Even though both of my kids are at the top of this age recommendation level, they LOVE this game! I love how it caters to the kinesthetic learner by asking them to move their bodies into the shape of the letters. It even takes teamwork as sometimes two people are required to finish the letter. My kids know all of their letter sounds and shapes, but this didn't stop them from playing the game over and over. 2015-08-31 46 | US 16245463 R23URALWA7IHWL B00IGXV9UI 765869385 Disney Planes: Fire & Rescue Scoop & Spray Firefighter Dusty Toys 5 0 0 N Y Five Stars My 5 year old son loves this. 2015-08-31 47 | US 11916403 R36L8VKT9ZSUY6 B00JVY9J1M 771795950 Winston Zeddmore & Ecto-1: Funko POP! Rides x Ghostbusters Vinyl Figure Toys 5 0 0 N Y Five Stars love it 2015-08-31 48 | US 5543658 R23JRQR6VMY4TV B008AL15M8 211944547 Yu-Gi-Oh! - Solemn Judgment (GLD5-EN045) - Gold Series: Haunted Mine - Limited Edition - Ghost/Gold Hybrid Rare Toys 5 0 0 N Y Absolutely one of the best traps in the game Absolutely one of the best traps in the game. It is never a dead and always live since you can always pay half your lifepoints for its cost. It's main power is that it can stop any card. Hopefully this card comes off the Forbidden/Limited list soon. 2015-08-31 49 | US 41168357 R3T73PQZZ9F6GT B00CAEEDC0 72805974 Seat Pets Car Seat Toy Toys 5 0 0 N Y Five Stars really soft and cute 2015-08-31 50 | US 32866903 R300I65NW30Y19 B000TFLAZA 149264874 Baby Einstein Octoplush Toys 5 0 0 N Y Five Stars baby loved it - so attractive and very nice 2015-08-31 51 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/Lab 5 Kinesis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lab 5 (Part II): Ingesting Streaming Data with Kinesis\n", 8 | "### MACS 30123: Large-Scale Computing for the Social Sciences (Spring 2020)\n", 9 | "\n", 10 | "In this second part of the lab, we'll explore how we can use Kinesis to ingest streaming text data, of the sort we might encounter on Twitter.\n", 11 | "\n", 12 | "To avoid requiring you to set up Twitter API access, we will create Twitter-like text and metadata using the `testdata` package to perform this demonstration. It should be easy enough to plug your streaming Twitter feed into this workflow if you desire to do so as an individual exercise (for instance, as a part of a final project!). Additionally, once you have this pipeline running, you can scale it up even further to include many more producers and consumers, if you would like, as discussed in lecture and the readings.\n", 13 | "\n", 14 | "Recall from the lecture and readings that in a Kinesis workflow, \"producers\" send data into a Kinesis stream and \"consumers\" draw data out of that stream to perform operations on it (i.e. real-time processing, archiving raw data, etc.). To make this a bit more concrete, we are going to implement a simplified version of this workflow in this lab, in which we spin up Producer and Consumer (t2.nano) EC2 Instances and create a Kinesis stream. Our Producer instance will run a producer script (which writes our Twitter-like text data into a Kinesis stream) and our Consumer instance will run a consumer script (which reads the Twitter-like data and calculates a simple demo statistic -- the average unique word count per tweet, as a real-time running average).\n", 15 | "\n", 16 | "You can visualize this data pipeline, like so:\n", 17 | "\n", 18 | "\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "To begin implementing this pipeline, let's import `boto3` and initialize the AWS services we'll be using in this lab (EC2 and Kinesis)." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import boto3\n", 35 | "import time\n", 36 | "\n", 37 | "session = boto3.Session()\n", 38 | "\n", 39 | "kinesis = session.client('kinesis')\n", 40 | "ec2 = session.resource('ec2')\n", 41 | "ec2_client = session.client('ec2')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Then, we need to create the Kinesis stream that our Producer EC2 instance will write streaming tweets to. Because we're only setting this up to handle traffic from one consumer and one producer, we'll just use one shard, but we could increase our throughput capacity by increasing the ShardCount if we wanted to do so." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "response = kinesis.create_stream(StreamName = 'test_stream',\n", 58 | " ShardCount = 1\n", 59 | " )\n", 60 | "\n", 61 | "# Is the stream active and ready to be written to/read from? Wait until it exists before moving on:\n", 62 | "waiter = kinesis.get_waiter('stream_exists')\n", 63 | "waiter.wait(StreamName='test_stream')" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "OK, now we're ready to set up our producer and consumer EC2 instances that will write to and read from this Kinesis stream. Let's spin up our two EC2 instances (specified by the `MaxCount` parameter) using one of the Amazon Linux AMIs. Notice here that you will need to specify your `.pem` file for the `KeyName` parameter, as well as create a custom security group/group ID. Designating a security group is necessary because, by default, AWS does not allow inbound ssh traffic into EC2 instances (they create custom ssh-friendly security groups each time you run the GUI wizard in the console). Thus, if you don't set this parameter, you will not be able to ssh into the EC2 instances that you create here with `boto3`. You can follow along in the lab video for further instructions on how you can set up one of these security groups.\n", 71 | "\n", 72 | "Also, we need to specify an IAM Instance Profile so that our EC2 instances will have the permissions necessary to interact with other AWS services on our behalf. Here, I'm using one of the profiles we create in Part I of Lab 5 (a default AWS profile for launching EC2 instances within an EMR cluster), as this gives us all of the necessary permissions" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "instances = ec2.create_instances(ImageId='ami-0915e09cc7ceee3ab',\n", 82 | " MinCount=1,\n", 83 | " MaxCount=2,\n", 84 | " InstanceType='t2.micro',\n", 85 | " KeyName='MACS_30123',\n", 86 | " SecurityGroupIds=['sg-02248fb2c9eac8bef'],\n", 87 | " SecurityGroups=['MACS_30123'],\n", 88 | " IamInstanceProfile=\n", 89 | " {'Name': 'EMR_EC2_DefaultRole'},\n", 90 | " )\n", 91 | "\n", 92 | "# Wait until EC2 instances are running before moving on\n", 93 | "waiter = ec2_client.get_waiter('instance_running')\n", 94 | "waiter.wait(InstanceIds=[instance.id for instance in instances])" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "While we wait for these instances to start running, let's set up the Python scripts that we want to run on each instance. First of all, we have to define a script for our Producer instance, which continuously produces Twitter-like data using the `testdata` package and puts that data into our Kinesis stream." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "name": "stdout", 111 | "output_type": "stream", 112 | "text": [ 113 | "Overwriting producer.py\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "%%file producer.py\n", 119 | "\n", 120 | "import boto3\n", 121 | "import testdata\n", 122 | "import json\n", 123 | "\n", 124 | "kinesis = boto3.client('kinesis', region_name='us-east-1')\n", 125 | "\n", 126 | "# Continously write Twitter-like data into Kinesis stream\n", 127 | "while 1 == 1:\n", 128 | " test_tweet = {'username': testdata.get_username(),\n", 129 | " 'tweet': testdata.get_ascii_words(280)\n", 130 | " }\n", 131 | " kinesis.put_record(StreamName = \"test_stream\",\n", 132 | " Data = json.dumps(test_tweet),\n", 133 | " PartitionKey = \"partitionkey\"\n", 134 | " )" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Then, we can define a script for our Consumer instance that gets the latest tweet out of the stream, one at a time. After processing each tweet, we then print out the average unique word count per processed tweet as a running average, before jumping on to the next indexed tweet in our Kinesis stream shard to do the same thing for as long as our program is running." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "Overwriting consumer.py\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "%%file consumer.py\n", 159 | "\n", 160 | "import boto3\n", 161 | "import time\n", 162 | "import json\n", 163 | "\n", 164 | "kinesis = boto3.client('kinesis', region_name='us-east-1')\n", 165 | "\n", 166 | "shard_it = kinesis.get_shard_iterator(StreamName = \"test_stream\",\n", 167 | " ShardId = 'shardId-000000000000',\n", 168 | " ShardIteratorType = 'LATEST'\n", 169 | " )[\"ShardIterator\"]\n", 170 | "\n", 171 | "i = 0\n", 172 | "s = 0\n", 173 | "while 1==1:\n", 174 | " out = kinesis.get_records(ShardIterator = shard_it,\n", 175 | " Limit = 1\n", 176 | " )\n", 177 | " for o in out['Records']:\n", 178 | " jdat = json.loads(o['Data'])\n", 179 | " s = s + len(set(jdat['tweet'].split()))\n", 180 | " i = i + 1\n", 181 | "\n", 182 | " if i != 0:\n", 183 | " print(\"Average Unique Word Count Per Tweet: \" + str(s/i))\n", 184 | " print(\"Sample of Current Tweet: \" + jdat['tweet'][:20])\n", 185 | " print(\"\\n\")\n", 186 | " \n", 187 | " shard_it = out['NextShardIterator']\n", 188 | " time.sleep(0.2)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "As our final preparation step, we'll grab all of the public DNS names of our instances (web addresses that you normally copy from the GUI console to manually ssh into and record the names of our code files, so that we can easily ssh/scp into the instances and pass them our Python scripts to run." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 6, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "instance_dns = [instance.public_dns_name \n", 205 | " for instance in ec2.instances.all() \n", 206 | " if instance.state['Name'] == 'running'\n", 207 | " ]\n", 208 | "\n", 209 | "code = ['producer.py', 'consumer.py']" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "To copy our files over to our instances and programmatically run commands on them, we can use Python's `scp` and `paramiko` packages. You'll need to install these via `pip install paramiko scp` if you have not already done so." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 7, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "Requirement already satisfied: paramiko in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (2.7.1)\n", 229 | "Requirement already satisfied: scp in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (0.13.2)\n", 230 | "Requirement already satisfied: bcrypt>=3.1.3 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (3.1.7)\n", 231 | "Requirement already satisfied: pynacl>=1.0.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (1.3.0)\n", 232 | "Requirement already satisfied: cryptography>=2.5 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from paramiko) (2.8)\n", 233 | "Requirement already satisfied: six>=1.4.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.0)\n", 234 | "Requirement already satisfied: cffi>=1.1 in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.0)\n", 235 | "Requirement already satisfied: pycparser in /home/jclindaniel/anaconda3/lib/python3.7/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko) (2.19)\n" 236 | ] 237 | } 238 | ], 239 | "source": [ 240 | "! pip install paramiko scp" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Once we have `scp` and `paramiko` installed, we can copy our producer and consumer Python scripts over to the EC2 instances (designating our first EC2 instance in `instance_dns` as the producer and second EC2 instance as the consumer instance).\n", 248 | "\n", 249 | "Note that, on each instance, we install `boto3` (so that we can access Kinesis through our scripts) and then copy our producer/consumer Python code over to our producer/consumer EC2 instance via `scp`. After we've done this, we install the `testdata` package on the producer instance (which it needs in order to create fake tweets) and instruct it to run our Python producer script. This will write tweets into our Kinesis stream until we stop the script and terminate the producer EC2 instance.\n", 250 | "\n", 251 | "We could also instruct our consumer to get tweets from the stream immediately after this command and this would automatically collect and process the tweets according to the consumer.py script. For the purposes of this demonstration, though, we'll manually ssh into that instance and run the code from the terminal so that we can see the real-time consumption a bit more easily." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 7, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "Producer Instance is Running producer.py\n", 264 | ".........................................\n", 265 | "Connect to Consumer Instance by running: ssh -i \"MACS_30123.pem\" ec2-user@ec2-52-207-178-178.compute-1.amazonaws.com\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "import paramiko\n", 271 | "from scp import SCPClient\n", 272 | "ssh_producer, ssh_consumer = paramiko.SSHClient(), paramiko.SSHClient()\n", 273 | "\n", 274 | "# Initialization of SSH tunnels takes a bit of time; otherwise get connection error on first attempt\n", 275 | "time.sleep(5)\n", 276 | "\n", 277 | "# Install boto3 on each EC2 instance and Copy our producer/consumer code onto producer/consumer EC2 instances\n", 278 | "instance = 0\n", 279 | "stdin, stdout, stderr = [[None, None] for i in range(3)]\n", 280 | "for ssh in [ssh_producer, ssh_consumer]:\n", 281 | " ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())\n", 282 | " ssh.connect(instance_dns[instance],\n", 283 | " username = 'ec2-user',\n", 284 | " key_filename='/home/jclindaniel/MACS_30123.pem')\n", 285 | " \n", 286 | " with SCPClient(ssh.get_transport()) as scp:\n", 287 | " scp.put(code[instance])\n", 288 | " \n", 289 | " if instance == 0:\n", 290 | " stdin[instance], stdout[instance], stderr[instance] = \\\n", 291 | " ssh.exec_command(\"sudo pip install boto3 testdata\")\n", 292 | " else:\n", 293 | " stdin[instance], stdout[instance], stderr[instance] = \\\n", 294 | " ssh.exec_command(\"sudo pip install boto3\")\n", 295 | "\n", 296 | " instance += 1\n", 297 | "\n", 298 | "# Block until Producer has installed boto3 and testdata, then start running Producer script:\n", 299 | "producer_exit_status = stdout[0].channel.recv_exit_status() \n", 300 | "if producer_exit_status == 0:\n", 301 | " ssh_producer.exec_command(\"python %s\" % code[0])\n", 302 | " print(\"Producer Instance is Running producer.py\\n.........................................\")\n", 303 | "else:\n", 304 | " print(\"Error\", producer_exit_status)\n", 305 | "\n", 306 | "# Close ssh and show connection instructions for manual access to Consumer Instance\n", 307 | "ssh_consumer.close; ssh_producer.close()\n", 308 | "\n", 309 | "print(\"Connect to Consumer Instance by running: ssh -i \\\"MACS_30123.pem\\\" ec2-user@%s\" % instance_dns[1])" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "If you run the command above (with the correct path to your actual `.pem` file), you should be inside your Consumer EC2 instance. If you run `python consumer.py`, you should also see a real-time count of the average number of unique words per tweet (along with a sample of the text in the most recent tweet), as in the screenshot:\n", 317 | "\n", 318 | "![](consumer_feed.png)\n", 319 | "\n", 320 | "Cool! Now we can scale this basic architecture up to perform any number of real-time data analyses, if we so desire. Also, if we execute our consumer code remotely via paramiko as well, the process will be entirely remote, so we don't need to keep any local resources running in order to keep streaming/processing real-time data.\n", 321 | "\n", 322 | "As a final note, when you are finished observing the real-time feed from your consumer instance, **be sure to terminate your EC2 instances and delete your Kinesis stream**. You don't want to be paying for these to run continuously! You can do so programmatically by running the following `boto3` code:" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 8, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "EC2 Instances Successfully Terminated\n", 335 | "Kinesis Stream Successfully Deleted\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "# Terminate EC2 Instances:\n", 341 | "ec2_client.terminate_instances(InstanceIds=[instance.id for instance in instances])\n", 342 | "\n", 343 | "# Confirm that EC2 instances were terminated:\n", 344 | "waiter = ec2_client.get_waiter('instance_terminated')\n", 345 | "waiter.wait(InstanceIds=[instance.id for instance in instances])\n", 346 | "print(\"EC2 Instances Successfully Terminated\")\n", 347 | "\n", 348 | "# Delete Kinesis Stream (if it currently exists):\n", 349 | "try:\n", 350 | " response = kinesis.delete_stream(StreamName='test_stream')\n", 351 | "except kinesis.exceptions.ResourceNotFoundException:\n", 352 | " pass\n", 353 | "\n", 354 | "# Confirm that Kinesis Stream was deleted:\n", 355 | "waiter = kinesis.get_waiter('stream_not_exists')\n", 356 | "waiter.wait(StreamName='test_stream')\n", 357 | "print(\"Kinesis Stream Successfully Deleted\")" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "Python 3", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.7.3" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 4 382 | } 383 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | import time 4 | import json 5 | 6 | kinesis = boto3.client('kinesis', region_name='us-east-1') 7 | 8 | shard_it = kinesis.get_shard_iterator(StreamName = "test_stream", 9 | ShardId = 'shardId-000000000000', 10 | ShardIteratorType = 'LATEST' 11 | )["ShardIterator"] 12 | 13 | i = 0 14 | s = 0 15 | while 1==1: 16 | out = kinesis.get_records(ShardIterator = shard_it, 17 | Limit = 1 18 | ) 19 | for o in out['Records']: 20 | jdat = json.loads(o['Data']) 21 | s = s + len(set(jdat['tweet'].split())) 22 | i = i + 1 23 | 24 | if i != 0: 25 | print("Average Unique Word Count Per Tweet: " + str(s/i)) 26 | print("Sample of Current Tweet: " + jdat['tweet'][:20]) 27 | print("\n") 28 | 29 | shard_it = out['NextShardIterator'] 30 | time.sleep(0.2) 31 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/producer.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | import testdata 4 | import json 5 | 6 | kinesis = boto3.client('kinesis', region_name='us-east-1') 7 | 8 | # Continously write Twitter-like data into Kinesis stream 9 | while 1 == 1: 10 | test_tweet = {'username': testdata.get_username(), 11 | 'tweet': testdata.get_ascii_words(280) 12 | } 13 | kinesis.put_record(StreamName = "test_stream", 14 | Data = json.dumps(test_tweet), 15 | PartitionKey = "partitionkey" 16 | ) 17 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S20/b1474da32282eb72fdc8206ad16a76f23e2e5c4a/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML in an EMR Notebook/Lab_6.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lab 6: Exploratory Data Analysis and Machine Learning in an EMR Notebook\n", 8 | "\n", 9 | "Today, we'll explore how you can use your PySpark coding skills in an [EMR Notebook](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html) on AWS. Specifically, we'll be working with all of the customer reviews for books in [Amazon's large customer reviews dataset on S3](https://s3.amazonaws.com/amazon-reviews-pds/readme.html). Note that this notebook is meant to be run on an EMR cluster, using a PySpark kernel, as demonstrated in the Lab video.\n", 10 | "\n", 11 | "First, let's load the customer book reviews data from S3 into our Spark session. The books data is spread across multiple parquet files (as we can see via the AWS CLI below), so we use the wildcard (\\*) to indicate that we want the data from all of these files to be included within our dataframe, spread out over our EMR cluster in partitions." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "2018-04-09 06:35:58 1.0 GiB part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 24 | "2018-04-09 06:35:59 1.0 GiB part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 25 | "2018-04-09 06:36:00 1.0 GiB part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 26 | "2018-04-09 06:36:00 1.0 GiB part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 27 | "2018-04-09 06:36:00 1.0 GiB part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 28 | "2018-04-09 06:36:33 1.0 GiB part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 29 | "2018-04-09 06:36:35 1.0 GiB part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 30 | "2018-04-09 06:36:35 1.0 GiB part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 31 | "2018-04-09 06:36:35 1.0 GiB part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 32 | "2018-04-09 06:36:35 1.0 GiB part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet\n", 33 | "\n", 34 | "Total Objects: 10\n", 35 | " Total Size: 10.2 GiB\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "%%bash \n", 41 | "aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Books/ --summarize --human-readable" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "data": { 51 | "application/vnd.jupyter.widget-view+json": { 52 | "model_id": "6fc7ede6213243089f51b4bd4cd1ce5f", 53 | "version_major": 2, 54 | "version_minor": 0 55 | }, 56 | "text/plain": [ 57 | "VBox()" 58 | ] 59 | }, 60 | "metadata": {}, 61 | "output_type": "display_data" 62 | }, 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "Starting Spark application\n" 68 | ] 69 | }, 70 | { 71 | "data": { 72 | "text/html": [ 73 | "\n", 74 | "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
0application_1589820935408_0001pysparkidleLinkLink
" 75 | ], 76 | "text/plain": [ 77 | "" 78 | ] 79 | }, 80 | "metadata": {}, 81 | "output_type": "display_data" 82 | }, 83 | { 84 | "data": { 85 | "application/vnd.jupyter.widget-view+json": { 86 | "model_id": "", 87 | "version_major": 2, 88 | "version_minor": 0 89 | }, 90 | "text/plain": [ 91 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 92 | ] 93 | }, 94 | "metadata": {}, 95 | "output_type": "display_data" 96 | }, 97 | { 98 | "name": "stdout", 99 | "output_type": "stream", 100 | "text": [ 101 | "SparkSession available as 'spark'.\n" 102 | ] 103 | }, 104 | { 105 | "data": { 106 | "application/vnd.jupyter.widget-view+json": { 107 | "model_id": "", 108 | "version_major": 2, 109 | "version_minor": 0 110 | }, 111 | "text/plain": [ 112 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 113 | ] 114 | }, 115 | "metadata": {}, 116 | "output_type": "display_data" 117 | } 118 | ], 119 | "source": [ 120 | "data = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "Now that we have our data loaded in a DataFrame, let's take a look at its structure and contents. We can see that there is a lot of data here (10 GB worth!), even in this small subset of the overall (~50 GB) Amazon Customer Reviews dataset." 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 3, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "application/vnd.jupyter.widget-view+json": { 138 | "model_id": "b571022b0d504d04baef37ff63f163b6", 139 | "version_major": 2, 140 | "version_minor": 0 141 | }, 142 | "text/plain": [ 143 | "VBox()" 144 | ] 145 | }, 146 | "metadata": {}, 147 | "output_type": "display_data" 148 | }, 149 | { 150 | "data": { 151 | "application/vnd.jupyter.widget-view+json": { 152 | "model_id": "", 153 | "version_major": 2, 154 | "version_minor": 0 155 | }, 156 | "text/plain": [ 157 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 158 | ] 159 | }, 160 | "metadata": {}, 161 | "output_type": "display_data" 162 | }, 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Total Columns: 15\n", 168 | "Total Rows: 20726160\n", 169 | "root\n", 170 | " |-- marketplace: string (nullable = true)\n", 171 | " |-- customer_id: string (nullable = true)\n", 172 | " |-- review_id: string (nullable = true)\n", 173 | " |-- product_id: string (nullable = true)\n", 174 | " |-- product_parent: string (nullable = true)\n", 175 | " |-- product_title: string (nullable = true)\n", 176 | " |-- star_rating: integer (nullable = true)\n", 177 | " |-- helpful_votes: integer (nullable = true)\n", 178 | " |-- total_votes: integer (nullable = true)\n", 179 | " |-- vine: string (nullable = true)\n", 180 | " |-- verified_purchase: string (nullable = true)\n", 181 | " |-- review_headline: string (nullable = true)\n", 182 | " |-- review_body: string (nullable = true)\n", 183 | " |-- review_date: date (nullable = true)\n", 184 | " |-- year: integer (nullable = true)" 185 | ] 186 | } 187 | ], 188 | "source": [ 189 | "print('Total Columns: %d' % len(data.dtypes))\n", 190 | "print('Total Rows: %d' % data.count())\n", 191 | "data.printSchema()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "We can take a look at (preview) data from all of our columns at once (using `data.show()`), but this is a bit busy on the screen, so let's select a subset of the columns and take a look at the first 20 rows of our data." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 4, 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "data": { 208 | "application/vnd.jupyter.widget-view+json": { 209 | "model_id": "a2769ba2f97e4ea9a1da43a4e455dabc", 210 | "version_major": 2, 211 | "version_minor": 0 212 | }, 213 | "text/plain": [ 214 | "VBox()" 215 | ] 216 | }, 217 | "metadata": {}, 218 | "output_type": "display_data" 219 | }, 220 | { 221 | "data": { 222 | "application/vnd.jupyter.widget-view+json": { 223 | "model_id": "", 224 | "version_major": 2, 225 | "version_minor": 0 226 | }, 227 | "text/plain": [ 228 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 229 | ] 230 | }, 231 | "metadata": {}, 232 | "output_type": "display_data" 233 | }, 234 | { 235 | "name": "stdout", 236 | "output_type": "stream", 237 | "text": [ 238 | "+--------------------+-----------+-----------+\n", 239 | "| product_title|total_votes|star_rating|\n", 240 | "+--------------------+-----------+-----------+\n", 241 | "|Standing Qigong f...| 10| 5|\n", 242 | "|A Universe from N...| 7| 4|\n", 243 | "|Hyacinth Girls: A...| 0| 4|\n", 244 | "| Bared to You| 1| 5|\n", 245 | "| Healer: A Novel| 0| 5|\n", 246 | "|The Missionary Po...| 7| 4|\n", 247 | "|I'm Tired of Bein...| 1| 4|\n", 248 | "|Fifty Shades of G...| 7| 1|\n", 249 | "|The Thrill of Vic...| 0| 4|\n", 250 | "|Fifty Shades of G...| 9| 5|\n", 251 | "|Romeo and Juliet ...| 0| 4|\n", 252 | "|Wheat Belly: Lose...| 1| 5|\n", 253 | "|Dangerous Dessert...| 0| 5|\n", 254 | "|Consciousness Bey...| 0| 5|\n", 255 | "|The Catcher in th...| 6| 1|\n", 256 | "|Fearless: The Und...| 0| 4|\n", 257 | "|Best-Ever Big Sister| 0| 5|\n", 258 | "| The Book Thief| 1| 5|\n", 259 | "| Large Print Sudoku| 1| 5|\n", 260 | "|Wild: From Lost t...| 1| 5|\n", 261 | "+--------------------+-----------+-----------+\n", 262 | "only showing top 20 rows" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "data[['product_title', 'total_votes', 'star_rating']].show()" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "Using PySpark, we can calculate any number of additional summary statistics using standard SQL-like operations on our DataFrame." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 5, 280 | "metadata": {}, 281 | "outputs": [ 282 | { 283 | "data": { 284 | "application/vnd.jupyter.widget-view+json": { 285 | "model_id": "136da8cfe79b4e12acfdcc4b1f01b7d2", 286 | "version_major": 2, 287 | "version_minor": 0 288 | }, 289 | "text/plain": [ 290 | "VBox()" 291 | ] 292 | }, 293 | "metadata": {}, 294 | "output_type": "display_data" 295 | }, 296 | { 297 | "data": { 298 | "application/vnd.jupyter.widget-view+json": { 299 | "model_id": "", 300 | "version_major": 2, 301 | "version_minor": 0 302 | }, 303 | "text/plain": [ 304 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 305 | ] 306 | }, 307 | "metadata": {}, 308 | "output_type": "display_data" 309 | }, 310 | { 311 | "name": "stdout", 312 | "output_type": "stream", 313 | "text": [ 314 | "+-----------+----------------+------------------+\n", 315 | "|star_rating|sum(total_votes)|sum(helpful_votes)|\n", 316 | "+-----------+----------------+------------------+\n", 317 | "| 5| 54808322| 44825468|\n", 318 | "| 4| 13954363| 11100563|\n", 319 | "| 3| 10117356| 7021927|\n", 320 | "| 2| 9010874| 5581929|\n", 321 | "| 1| 22624009| 10985502|\n", 322 | "+-----------+----------------+------------------+" 323 | ] 324 | } 325 | ], 326 | "source": [ 327 | "stars_votes = (data.groupBy('star_rating')\n", 328 | " .sum('total_votes', 'helpful_votes')\n", 329 | " .sort('star_rating', ascending=False)\n", 330 | " )\n", 331 | "stars_votes.show()" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "We can even plot the results of our summarizations using standard Python packages like Pandas and Seaborn. To do so within an EMR notebook, we will need to install these packages. If we list the packages currently available to us in our SparkContext, you can see that these packages are not already installed, so we will need to install them." 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 6, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "data": { 348 | "application/vnd.jupyter.widget-view+json": { 349 | "model_id": "e0e469b8a7db453fbf68453f8d83b3a0", 350 | "version_major": 2, 351 | "version_minor": 0 352 | }, 353 | "text/plain": [ 354 | "VBox()" 355 | ] 356 | }, 357 | "metadata": {}, 358 | "output_type": "display_data" 359 | }, 360 | { 361 | "data": { 362 | "application/vnd.jupyter.widget-view+json": { 363 | "model_id": "", 364 | "version_major": 2, 365 | "version_minor": 0 366 | }, 367 | "text/plain": [ 368 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 369 | ] 370 | }, 371 | "metadata": {}, 372 | "output_type": "display_data" 373 | }, 374 | { 375 | "name": "stdout", 376 | "output_type": "stream", 377 | "text": [ 378 | "Package Version\n", 379 | "-------------------------- -------\n", 380 | "beautifulsoup4 4.9.0 \n", 381 | "boto 2.49.0 \n", 382 | "jmespath 0.9.5 \n", 383 | "lxml 4.5.0 \n", 384 | "mysqlclient 1.4.2 \n", 385 | "nltk 3.4.5 \n", 386 | "nose 1.3.4 \n", 387 | "numpy 1.16.5 \n", 388 | "pip 9.0.1 \n", 389 | "py-dateutil 2.2 \n", 390 | "python37-sagemaker-pyspark 1.3.0 \n", 391 | "pytz 2019.3 \n", 392 | "PyYAML 5.3.1 \n", 393 | "setuptools 28.8.0 \n", 394 | "six 1.13.0 \n", 395 | "soupsieve 1.9.5 \n", 396 | "wheel 0.29.0 \n", 397 | "windmill 1.6" 398 | ] 399 | } 400 | ], 401 | "source": [ 402 | "sc.list_packages()" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 7, 408 | "metadata": {}, 409 | "outputs": [ 410 | { 411 | "data": { 412 | "application/vnd.jupyter.widget-view+json": { 413 | "model_id": "d78e1df3176d4a7696e100237e74bb74", 414 | "version_major": 2, 415 | "version_minor": 0 416 | }, 417 | "text/plain": [ 418 | "VBox()" 419 | ] 420 | }, 421 | "metadata": {}, 422 | "output_type": "display_data" 423 | }, 424 | { 425 | "data": { 426 | "application/vnd.jupyter.widget-view+json": { 427 | "model_id": "", 428 | "version_major": 2, 429 | "version_minor": 0 430 | }, 431 | "text/plain": [ 432 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 433 | ] 434 | }, 435 | "metadata": {}, 436 | "output_type": "display_data" 437 | }, 438 | { 439 | "name": "stdout", 440 | "output_type": "stream", 441 | "text": [ 442 | "Collecting seaborn\n", 443 | " Downloading https://files.pythonhosted.org/packages/c7/e6/54aaaafd0b87f51dfba92ba73da94151aa3bc179e5fe88fc5dfb3038e860/seaborn-0.10.1-py3-none-any.whl (215kB)\n", 444 | "Collecting pandas>=0.22.0 (from seaborn)\n", 445 | " Downloading https://files.pythonhosted.org/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0MB)\n", 446 | "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from seaborn)\n", 447 | "Collecting scipy>=1.0.1 (from seaborn)\n", 448 | " Downloading https://files.pythonhosted.org/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1MB)\n", 449 | "Collecting matplotlib>=2.1.2 (from seaborn)\n", 450 | " Downloading https://files.pythonhosted.org/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4MB)\n", 451 | "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn)\n", 452 | "Collecting python-dateutil>=2.6.1 (from pandas>=0.22.0->seaborn)\n", 453 | " Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)\n", 454 | "Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=2.1.2->seaborn)\n", 455 | " Downloading https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB)\n", 456 | "Collecting cycler>=0.10 (from matplotlib>=2.1.2->seaborn)\n", 457 | " Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl\n", 458 | "Collecting kiwisolver>=1.0.1 (from matplotlib>=2.1.2->seaborn)\n", 459 | " Downloading https://files.pythonhosted.org/packages/31/b9/6202dcae729998a0ade30e80ac00f616542ef445b088ec970d407dfd41c0/kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88kB)\n", 460 | "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn)\n", 461 | "Installing collected packages: python-dateutil, pandas, scipy, pyparsing, cycler, kiwisolver, matplotlib, seaborn\n", 462 | "Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 pandas-1.0.3 pyparsing-2.4.7 python-dateutil-2.8.1 scipy-1.4.1 seaborn-0.10.1\n", 463 | "\n", 464 | "Requirement already satisfied: pandas in /mnt/tmp/1589821122437-0/lib/python3.7/site-packages\n", 465 | "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas)\n", 466 | "Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib64/python3.7/site-packages (from pandas)\n", 467 | "Requirement already satisfied: python-dateutil>=2.6.1 in /mnt/tmp/1589821122437-0/lib/python3.7/site-packages (from pandas)\n", 468 | "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas)" 469 | ] 470 | } 471 | ], 472 | "source": [ 473 | "sc.install_pypi_package(\"seaborn\")\n", 474 | "sc.install_pypi_package(\"pandas\")" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 8, 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "data": { 484 | "application/vnd.jupyter.widget-view+json": { 485 | "model_id": "fa2b9668a2974c4da8d1e69e5c8a4918", 486 | "version_major": 2, 487 | "version_minor": 0 488 | }, 489 | "text/plain": [ 490 | "VBox()" 491 | ] 492 | }, 493 | "metadata": {}, 494 | "output_type": "display_data" 495 | }, 496 | { 497 | "data": { 498 | "application/vnd.jupyter.widget-view+json": { 499 | "model_id": "", 500 | "version_major": 2, 501 | "version_minor": 0 502 | }, 503 | "text/plain": [ 504 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 505 | ] 506 | }, 507 | "metadata": {}, 508 | "output_type": "display_data" 509 | }, 510 | { 511 | "name": "stdout", 512 | "output_type": "stream", 513 | "text": [ 514 | "Package Version\n", 515 | "-------------------------- -------\n", 516 | "beautifulsoup4 4.9.0 \n", 517 | "boto 2.49.0 \n", 518 | "cycler 0.10.0 \n", 519 | "jmespath 0.9.5 \n", 520 | "kiwisolver 1.2.0 \n", 521 | "lxml 4.5.0 \n", 522 | "matplotlib 3.2.1 \n", 523 | "mysqlclient 1.4.2 \n", 524 | "nltk 3.4.5 \n", 525 | "nose 1.3.4 \n", 526 | "numpy 1.16.5 \n", 527 | "pandas 1.0.3 \n", 528 | "pip 9.0.1 \n", 529 | "py-dateutil 2.2 \n", 530 | "pyparsing 2.4.7 \n", 531 | "python-dateutil 2.8.1 \n", 532 | "python37-sagemaker-pyspark 1.3.0 \n", 533 | "pytz 2019.3 \n", 534 | "PyYAML 5.3.1 \n", 535 | "scipy 1.4.1 \n", 536 | "seaborn 0.10.1 \n", 537 | "setuptools 28.8.0 \n", 538 | "six 1.13.0 \n", 539 | "soupsieve 1.9.5 \n", 540 | "wheel 0.29.0 \n", 541 | "windmill 1.6" 542 | ] 543 | } 544 | ], 545 | "source": [ 546 | "sc.list_packages()" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "Now that our packages have been successfully installed, let's plot some of our data. Note, though, that because our data is so large, we do not want to send all of it into a Pandas DataFrame-- we want to take advantage of the distributed nature of our Spark DataFrame to perform any computations and then send only small segments of our data into Pandas and our plotting libraries, so that they will be able to handle it in-memory. Here, we use the `toPandas()` method to only send the small dataframe that summarizes the total number of votes for each star rating -- all of the computation was done on the Spark DataFrame.\n", 554 | "\n", 555 | "It looks like 5 star reviews had a lot more \"helpful\" votes than all of the other star ratings." 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 9, 561 | "metadata": {}, 562 | "outputs": [ 563 | { 564 | "data": { 565 | "application/vnd.jupyter.widget-view+json": { 566 | "model_id": "c0dbcdefc8e84b9f9b7d5dbff8f5f061", 567 | "version_major": 2, 568 | "version_minor": 0 569 | }, 570 | "text/plain": [ 571 | "VBox()" 572 | ] 573 | }, 574 | "metadata": {}, 575 | "output_type": "display_data" 576 | }, 577 | { 578 | "data": { 579 | "application/vnd.jupyter.widget-view+json": { 580 | "model_id": "", 581 | "version_major": 2, 582 | "version_minor": 0 583 | }, 584 | "text/plain": [ 585 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 586 | ] 587 | }, 588 | "metadata": {}, 589 | "output_type": "display_data" 590 | }, 591 | { 592 | "data": { 593 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3de1TUdeL/8deIAsrNBMQ1EdwUyxAkb5FdvBu2FrZ5WpdWtN12KzWVtTzU5qWjXyxXMy+RXy3RVpfNe79as/Qr2Jr2VczU3MwMhZKiUrloTsDM749+za9ZrFCBz3x8Px/ncE7zmeHjy+a75/vsMzPgcLvdbgEAAMAYTaweAAAAgMZFAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYArCc7duzQsGHD1LZtWzkcDm3cuPGivn/69OlyOBy1voKCghpoMQAAMBUBWE/Onj2rxMRELV68+JK+f/LkySopKfH66tKli0aMGFHPSwEAgOkIwHqSkpKimTNnavjw4Re83+l0avLkybr66qsVFBSk3r17Ky8vz3N/cHCw2rRp4/n64osvdPjwYf3+979vpL8BAAAwBQHYSMaNG6ddu3YpNzdXBw4c0IgRI3T77bfr6NGjF3z8smXLFBcXp1tuuaWRlwIAgCsdAdgIioqKtHz5cq1Zs0a33HKLrrnmGk2ePFk333yzli9fXuvx58+f16pVq7j6BwAAGkRTqweY4ODBg6qpqVFcXJzXcafTqfDw8FqP37BhgyoqKpSent5YEwEAgEEIwEZQWVkpPz8/FRQUyM/Pz+u+4ODgWo9ftmyZfvWrXykqKqqxJgIAAIMQgI0gKSlJNTU1Ki0t/dn39BUWFmr79u169dVXG2kdAAAwDQFYTyorK/Xxxx97bhcWFmr//v1q1aqV4uLilJaWplGjRmnu3LlKSkrSl19+qW3btikhIUF33HGH5/teeukl/eIXv1BKSooVfw0AAGAAh9vtdls94kqQl5enfv361Tqenp6unJwcVVVVaebMmVq5cqU+++wzRURE6MYbb9SMGTPUtWtXSZLL5VJMTIxGjRqlWbNmNfZfAQAAGIIABAAAMAw/BgYAAMAwBCAAAIBhCEAAAADD8Cngy+ByuXTy5EmFhITI4XBYPQcAANSB2+1WRUWF2rZtqyZNzLwWRgBehpMnTyo6OtrqGQAA4BIUFxerXbt2Vs+wBAF4GUJCQiR9939AoaGhFq8BAAB1UV5erujoaM//HzcRAXgZvn/ZNzQ0lAAEAMBmTH77lpkvfAMAABiMAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwzS1egAAAPC26M//x+oJtjVu7jCrJ9gCVwABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMA/j+zZ8+Ww+HQxIkTrZ4CAADQoAhASXv27NGSJUuUkJBg9RQAAIAGZ3wAVlZWKi0tTUuXLtVVV11l9RwAAIAGZ3wAjh07VnfccYcGDhz4s491Op0qLy/3+gIAALCbplYPsFJubq727dunPXv21OnxWVlZmjFjRgOvAgAAaFjGXgEsLi7WhAkTtGrVKgUGBtbpezIzM1VWVub5Ki4ubuCVAAAA9c/YK4AFBQUqLS3VDTfc4DlWU1OjHTt2aNGiRXI6nfLz8/P6noCAAAUEBDT2VAAAgHplbAAOGDBABw8e9Do2ZswYXXvttZoyZUqt+AMAALhSGBuAISEhio+P9zoWFBSk8PDwWscBAACuJMa+BxAAAMBUxl4BvJC8vDyrJwAAADQ4rgACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYJimVg+4FIWFhXr77bd14sQJnTt3TpGRkUpKSlJycrICAwOtngcAAODTbBWAq1at0nPPPae9e/cqKipKbdu2VfPmzXXq1CkdO3ZMgYGBSktL05QpUxQTE2P1XAAAAJ9kmwBMSkqSv7+/Ro8erXXr1ik6OtrrfqfTqV27dik3N1c9evTQ888/rxEjRli0FgAAwHfZJgBnz56tIUOG/Oj9AQEB6tu3r/r27atZs2bp+PHjjTcOAADARmwTgD8Vf/8pPDxc4eHhDbgGAADAvmz5KeB9+/bp4MGDntubNm1SamqqHn/8cX377bcWLgMAAPB9tgzAP/3pT/roo48kSZ988ol+85vfqEWLFlqzZo0ee+wxi9cBAAD4NlsG4EcffaRu3bpJktasWaNbb71Vq1evVk5OjtatW2fxOgAAAN9mywB0u91yuVySpK1bt2ro0KGSpOjoaH311Vd1Pk92drYSEhIUGhqq0NBQJScna/PmzQ2yGQAAwFfYMgB79OihmTNn6uWXX1Z+fr7uuOMOSd/9gOioqKg6n6ddu3aaPXu2CgoKtHfvXvXv31933XWXPvjgg4aaDgAAYDlbBuD8+fO1b98+jRs3Tk888YQ6duwoSVq7dq1uuummOp9n2LBhGjp0qDp16qS4uDjNmjVLwcHB2r17d0NNBwAAsJxtfgzMDyUkJHh9Cvh7c+bMkZ+f3yWds6amRmvWrNHZs2eVnJx8uRMBAAB8li0DUJLOnDmjtWvX6tixY3r00UfVqlUrHT58WFFRUbr66qvrfJ6DBw8qOTlZ58+fV3BwsDZs2KAuXbpc8LFOp1NOp9Nzu7y8/LL/HgAAAI3Nli8BHzhwQJ06ddLTTz+tv/71rzpz5owkaf369crMzLyoc3Xu3Fn79+/Xu+++q4ceekjp6ek6fPjwBR+blZWlsLAwz9d//jo6AAAAO7BlAGZkZGjMmDE6evSoAgMDPceHDh2qHTt2XNS5/P391bFjR3Xv3l1ZWVlKTEzUc889d8HHZmZmqqyszPNVXFx8WX8PAAAAK9jyJeA9e/ZoyZIltY5fffXV+vzzzy/r3C6Xy+tl3h8KCAhQQEDAZZ0fAADAarYMwICAgAu+/+6jjz5SZGRknc+TmZmplJQUtW/fXhUVFVq9erXy8vK0ZcuW+pwLAADgU2z5EvCdd96pp556SlVVVZIkh8OhoqIiTZkyRb/+9a/rfJ7S0lKNGjVKnTt31oABA7Rnzx5t2bJFgwYNaqjpAAAAlrPlFcC5c+fqnnvuUevWrfXNN9/otttu0+eff67k5GTNmjWrzud58cUXG3AlAACAb7JlAIaFhemtt97Szp079f7776uyslI33HCDBg4caPU0AAAAn2fLAFy5cqXuvfde9enTR3369PEc//bbb5Wbm6tRo0ZZuA4AAMC32fI9gGPGjFFZWVmt4xUVFRozZowFiwAAAOzDlgHodrvlcDhqHf/0008VFhZmwSIAAAD7sNVLwElJSXI4HHI4HBowYICaNv3/82tqalRYWKjbb7/dwoUAAAC+z1YBmJqaKknav3+/hgwZouDgYM99/v7+io2NvagfAwMAAGAiWwXgtGnTJEmxsbG69957vX4NHAAAAOrGVgH4vfT0dElSQUGB/v3vf0uSrr/+eiUlJVk5CwAAwBZsGYClpaX6zW9+o7y8PLVs2VKSdObMGfXr10+5ubkX9evgAAAATGPLTwGPHz9eFRUV+uCDD3Tq1CmdOnVKhw4dUnl5uR555BGr5wEAAPg0W14BfOONN7R161Zdd911nmNdunTR4sWLNXjwYAuXAQAA+D5bXgF0uVxq1qxZrePNmjWTy+WyYBEAAIB92DIA+/fvrwkTJujkyZOeY5999pkmTZqkAQMGWLgMAADA99kyABctWqTy8nLFxsbqmmuu0TXXXKMOHTqovLxcCxcutHoeAACAT7PlewCjo6O1b98+bd26VR9++KEk6brrrtPAgQMtXgYAAOD7bBmAxcXFio6O1qBBgzRo0CCr5wAAANiKLV8Cjo2N1W233aalS5fq9OnTVs8BAACwFVsG4N69e9WrVy899dRT+sUvfqHU1FStXbtWTqfT6mkAAAA+z5YBmJSUpDlz5qioqEibN29WZGSk/vjHPyoqKkr333+/1fMAAAB8mi0D8HsOh0P9+vXT0qVLtXXrVnXo0EErVqywehYAAIBPs3UAfvrpp3rmmWfUrVs39erVS8HBwVq8eLHVswAAAHyaLT8FvGTJEq1evVo7d+7Utddeq7S0NG3atEkxMTFWTwMAAPB5tgzAmTNnauTIkVqwYIESExOtngMAAGArtgzAoqIiORyOn33cww8/rKeeekoRERGNsAoAAMAebPkewLrEnyT97W9/U3l5eQOvAQAAsBdbBmBdud1uqycAAAD4nCs6AAEAAFAbAQgAAGAYAhAAAMAwBCAAAIBhrugAvO+++xQaGmr1DAAAAJ9im58DeODAgTo/NiEhQZKUnZ3dUHMAAABsyzYB2K1bNzkcjh/90S7f3+dwOFRTU9PI6wAAAOzDNgFYWFho9QQAAIArgm0CMCYmxuoJAAAAVwTbBOAPrVy58ifvHzVqVCMtAQAAsB9bBuCECRO8bldVVencuXPy9/dXixYtCEAAAICfYMsfA3P69Gmvr8rKSh05ckQ333yz/v73v1s9DwAAwKfZMgAvpFOnTpo9e3atq4MAAADwdsUEoCQ1bdpUJ0+etHoGAACAT7PlewBfffVVr9tut1slJSVatGiR+vTpY9EqAAAAe7BlAKampnrddjgcioyMVP/+/TV37lyLVgEAANiDbQKwvLzc83t9XS6XxWsAAADsyzbvAbzqqqtUWloqSerfv7/OnDlj8SIAAAB7sk0ABgcH6+uvv5Yk5eXlqaqqyuJFAAAA9mSbl4AHDhyofv366brrrpMkDR8+XP7+/hd87P/8z/805jQAAABbsU0A/u1vf9OKFSt07Ngx5efn6/rrr1eLFi2sngUAAGA7tgnA5s2b68EHH5Qk7d27V08//bRatmxp8SoAAAD7sU0A/tD27ds9/+x2uyV996NgAAAA8PNs8yGQ//Tiiy8qPj5egYGBCgwMVHx8vJYtW2b1LAAAAJ9nyyuAU6dO1bx58zR+/HglJydLknbt2qVJkyapqKhITz31lMULAQAAfJctAzA7O1tLly7VyJEjPcfuvPNOJSQkaPz48QQgAADAT7DlS8BVVVXq0aNHrePdu3dXdXW1BYsAAADsw5YB+Lvf/U7Z2dm1jv/3f/+30tLSLFgEAABgH7Z8CVj67kMgb775pm688UZJ0rvvvquioiKNGjVKGRkZnsfNmzfPqokAAAA+yZYBeOjQId1www2SpGPHjkmSIiIiFBERoUOHDnkex4+GAQAAqM2WAfjDnwMIAACAi2PL9wACAADg0tnmCuDdd99d58euX7++AZcAAADYm20CMCwszOoJAAAAVwTbBODy5cutngAAAHBFsO17AKurq7V161YtWbJEFRUVkqSTJ0+qsrLS4mUAAAC+zTZXAH/oxIkTuv3221VUVCSn06lBgwYpJCRETz/9tJxOp1544QWrJwIAAPgsW14BnDBhgnr06KHTp0+refPmnuPDhw/Xtm3bLFwGAADg+2x5BfDtt9/WO++8I39/f6/jsbGx+uyzzyxaBQAAYA+2vALocrlUU1NT6/inn36qkJAQCxYBAADYhy0DcPDgwZo/f77ntsPhUGVlpaZNm6ahQ4dauAwAAMD32fIl4Llz52rIkCHq0qWLzp8/r9/+9rc6evSoIiIi9Pe//93qeQAAAD7NllcA27Vrp/fff1+PP/64Jk2apKSkJM2ePVvvvfeeWrduXefzZGVlqWfPngoJCVHr1q2VmpqqI0eONOByAAAA69nyCqAkNW3aVPfdd99lnSM/P19jx45Vz549VV1drccff1yDBw/W4cOHFRQUVE9LAQAAfIttA/Do0aPavn27SktL5XK5vO6bOnVqnc7xxhtveN3OyclR69atVVBQoFtvvbXetgIAAPgSWwbg0qVL9dBDDykiIkJt2rSRw+Hw3OdwOOocgP+prKxMktSqVasL3u90OuV0Oj23y8vLL+nPAQAAsJItA3DmzJmaNWuWpkyZUm/ndLlcmjhxovr06aP4+PgLPiYrK0szZsyotz8TAADACrb8EMjp06c1YsSIej3n2LFjdejQIeXm5v7oYzIzM1VWVub5Ki4urtcNAAAAjcGWAThixAi9+eab9Xa+cePG6bXXXtP27dvVrl27H31cQECAQkNDvb4AAADsxjYvAS9YsMDzzx07dtSTTz6p3bt3q2vXrmrWrJnXYx955JE6ndPtdmv8+PHasGGD8vLy1KFDh3rdDAAA4ItsE4DPPvus1+3g4GDl5+crPz/f67jD4ahzAI4dO1arV6/Wpk2bFBISos8//1ySFBYWpubNm9fPcAAAAB9jmwAsLCys93NmZ2dLkvr27et1fPny5Ro9enS9/3kAAAC+wDYB2BDcbrfVEwAAABqdbT4EMnv2bJ07d65Oj3333Xf1+uuvN/AiAAAAe7JNAB4+fFgxMTF6+OGHtXnzZn355Zee+6qrq3XgwAE9//zzuummm3TvvfcqJCTEwrUAAAC+yzYvAa9cuVLvv/++Fi1apN/+9rcqLy+Xn5+fAgICPFcGk5KS9Ic//EGjR49WYGCgxYsBAAB8k20CUJISExO1dOlSLVmyRAcOHNCJEyf0zTffKCIiQt26dVNERITVEwEAAHyerQLwe02aNFG3bt3UrVs3q6cAAADYji0D8HulpaUqLS2Vy+XyOp6QkGDRIgAAAN9nywAsKChQenq6/v3vf9f6US4Oh0M1NTUWLQMAAPB9tgzA+++/X3FxcXrxxRcVFRUlh8Nh9SQAAADbsGUAfvLJJ1q3bp06duxo9RQAAADbsc3PAfyhAQMG6P3337d6BgAAgC3Z8grgsmXLlJ6erkOHDik+Pl7NmjXzuv/OO++0aBkAAIDvs2UA7tq1Szt37tTmzZtr3ceHQAAAAH6aLV8CHj9+vO677z6VlJTI5XJ5fRF/AAAAP82WAfj1119r0qRJioqKsnoKAACA7dgyAO+++25t377d6hkAAAC2ZMv3AMbFxSkzM1P/+te/1LVr11ofAnnkkUcsWgYAAOD7bBmAy5YtU3BwsPLz85Wfn+91n8PhIAABAAB+gi0DsLCw0OoJAAAAtmXL9wACAADg0tnyCuD999//k/e/9NJLjbQEAADAfmwZgKdPn/a6XVVVpUOHDunMmTPq37+/RasAAADswZYBuGHDhlrHXC6XHnroIV1zzTUWLAIAALCPK+Y9gE2aNFFGRoaeffZZq6cAAAD4tCsmACXp2LFjqq6utnoGAACAT7PlS8AZGRlet91ut0pKSvT6668rPT3dolUAAAD2YMsAfO+997xuN2nSRJGRkZo7d+7PfkIYAADAdLYMwNdff11ut1tBQUGSpOPHj2vjxo2KiYlR06a2/CsBAAA0Glu+BzA1NVUvv/yyJOnMmTO68cYbNXfuXKWmpio7O9vidQAAAL7NlgG4b98+3XLLLZKktWvXKioqSidOnNDKlSu1YMECi9cBAAD4NlsG4Llz5xQSEiJJevPNN3X33XerSZMmuvHGG3XixAmL1wEAAPg2WwZgx44dtXHjRhUXF2vLli0aPHiwJKm0tFShoaEWrwMAAPBttgzAqVOnavLkyYqNjVXv3r2VnJws6burgUlJSRavAwAA8G22/MjsPffco5tvvlklJSVKTEz0HB8wYICGDx9u4TIAAADfZ8sAlKQ2bdqoTZs2Xsd69epl0RoAAAD7sOVLwAAAALh0BCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAM09TqAQAA35B/621WT7Ct23bkWz0BuCgEYCPp/uhKqyfYVsGcUVZPAADgisJLwAAAAIYhAAEAAAxjdADu2LFDw4YNU9u2beVwOLRx40arJwEAADQ4owPw7NmzSkxM1OLFi62eAgAA0GiM/hBISkqKUlJSrJ4BAADQqIy+AggAAGAio68AXiyn0ymn0+m5XV5ebuEaAACAS8MVwIuQlZWlsLAwz1d0dLTVkwAAAC4aAXgRMjMzVVZW5vkqLi62ehIAAMBF4yXgixAQEKCAgACrZwAAAFwWowOwsrJSH3/8sed2YWGh9u/fr1atWql9+/YWLgMAAGg4Rgfg3r171a9fP8/tjIwMSVJ6erpycnIsWgUAANCwjA7Avn37yu12Wz0DAACgUfEhEAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwzS1egAAc/VZ2MfqCba2c/xOqycAsCmuAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDL8KDsYpeqqr1RNsq/3Ug1ZPAADUA64AAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGIQABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAQAADEMAAgAAGIYABAAAMAwBCAAAYBgCEAAAwDAEIAAAgGEIQAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAwBCAAAIBhCEAAAADDEIAAAACGMT4AFy9erNjYWAUGBqp379763//9X6snAQAANCijA/Af//iHMjIyNG3aNO3bt0+JiYkaMmSISktLrZ4GAADQYIwOwHnz5umBBx7QmDFj1KVLF73wwgtq0aKFXnrpJaunAQAANJimVg+wyrfffquCggJlZmZ6jjVp0kQDBw7Url27Lvg9TqdTTqfTc7usrEySVF5e/rN/Xo3zm8tcbK66/Pu9GBXna+r1fCap7+ei+pvqej2faer7+ThbzfNxqer7ufjGea5ez2eSujwX3z/G7XY39ByfZWwAfvXVV6qpqVFUVJTX8aioKH344YcX/J6srCzNmDGj1vHo6OgG2YjvhC180OoJ+F5WmNUL8ANhU3g+fEYYz4WveGxx3R9bUVGhMEOfO2MD8FJkZmYqIyPDc9vlcunUqVMKDw+Xw+GwcNnlKS8vV3R0tIqLixUaGmr1HKPxXPgOngvfwXPhO66U58LtdquiokJt27a1eopljA3AiIgI+fn56YsvvvA6/sUXX6hNmzYX/J6AgAAFBAR4HWvZsmWDbWxsoaGhtv4f9JWE58J38Fz4Dp4L33ElPBemXvn7nrEfAvH391f37t21bds2zzGXy6Vt27YpOTnZwmUAAAANy9grgJKUkZGh9PR09ejRQ7169dL8+fN19uxZjRkzxuppAAAADcZv+vTp060eYZX4+Hi1bNlSs2bN0l//+ldJ0qpVq9S5c2eLlzU+Pz8/9e3bV02bGv3fBD6B58J38Fz4Dp4L38FzcWVwuE3+DDQAAICBjH0PIAAAgKkIQAAAAMMQgAAAAIYhAAEAAAxDABpsx44dGjZsmNq2bSuHw6GNGzdaPclIWVlZ6tmzp0JCQtS6dWulpqbqyJEjVs8yVnZ2thISEjw/6DY5OVmbN2+2ehYkzZ49Ww6HQxMnTrR6inGmT58uh8Ph9XXttddaPQuXgQA02NmzZ5WYmKjFiy/iFyei3uXn52vs2LHavXu33nrrLVVVVWnw4ME6e/as1dOM1K5dO82ePVsFBQXau3ev+vfvr7vuuksffPCB1dOMtmfPHi1ZskQJCQlWTzHW9ddfr5KSEs/Xv/71L6sn4TLwQ3wMlpKSopSUFKtnGO+NN97wup2Tk6PWrVuroKBAt956q0WrzDVs2DCv27NmzVJ2drZ2796t66+/3qJVZqusrFRaWgmUyrAAAAi6SURBVJqWLl2qmTNnWj3HWE2bNv3RX5UK++EKIOBjysrKJEmtWrWyeAlqamqUm5urs2fP8isiLTR27FjdcccdGjhwoNVTjHb06FG1bdtWv/zlL5WWlqaioiKrJ+EycAUQ8CEul0sTJ05Unz59FB8fb/UcYx08eFDJyck6f/68goODtWHDBnXp0sXqWUbKzc3Vvn37tGfPHqunGK13797KyclR586dVVJSohkzZuiWW27RoUOHFBISYvU8XAICEPAhY8eO1aFDh3hvjcU6d+6s/fv3q6ysTGvXrlV6erry8/OJwEZWXFysCRMm6K233lJgYKDVc4z2w7cLJSQkqHfv3oqJidErr7yi3//+9xYuw6UiAAEfMW7cOL322mvasWOH2rVrZ/Uco/n7+6tjx46SpO7du2vPnj167rnntGTJEouXmaWgoEClpaW64YYbPMdqamq0Y8cOLVq0SE6nU35+fhYuNFfLli0VFxenjz/+2OopuEQEIGAxt9ut8ePHa8OGDcrLy1OHDh2snoT/4HK55HQ6rZ5hnAEDBujgwYNex8aMGaNrr71WU6ZMIf4sVFlZqWPHjul3v/ud1VNwiQhAg1VWVnr911thYaH279+vVq1aqX379hYuM8vYsWO1evVqbdq0SSEhIfr8888lSWFhYWrevLnF68yTmZmplJQUtW/fXhUVFVq9erXy8vK0ZcsWq6cZJyQkpNZ7YYOCghQeHs57ZBvZ5MmTNWzYMMXExOjkyZOaNm2a/Pz8NHLkSKun4RIRgAbbu3ev+vXr57mdkZEhSUpPT1dOTo5Fq8yTnZ0tSerbt6/X8eXLl2v06NGNP8hwpaWlGjVqlEpKShQWFqaEhARt2bJFgwYNsnoaYJlPP/1UI0eO1Ndff63IyEjdfPPN2r17tyIjI62ehkvkcLvdbqtHAAAAoPHwcwABAAAMQwACAAAYhgAEAAAwDAEIAABgGAIQAADAMAQgAACAYQhAAAAAwxCAAAAAhiEAAaABxcbGav78+VbPAAAvBCAAnzd69GilpqZaPeMn5eTkqGXLlrWO79mzR3/84x8tWAQAP44ABGCMb7/9tlG+54ciIyPVokWLyzoHANQ3AhCAz1i7dq26du2q5s2bKzw8XAMHDtSjjz6qFStWaNOmTXI4HHI4HMrLy5MkTZkyRXFxcWrRooV++ctf6sknn1RVVZXnfNOnT1e3bt20bNkydejQQYGBgT+7oW/fvho3bpwmTpyoiIgIDRkyRJI0b948de3aVUFBQYqOjtbDDz+syspKSVJeXp7GjBmjsrIyz8bp06dLqv0SsMPh0LJlyzR8+HC1aNFCnTp10quvvuq14dVXX1WnTp0UGBiofv36acWKFXI4HDpz5szl/OsFAI+mVg8AAEkqKSnRyJEj9cwzz2j48OGqqKjQ22+/rVGjRqmoqEjl5eVavny5JKlVq1aSpJCQEOXk5Kht27Y6ePCgHnjgAYWEhOixxx7znPfjjz/WunXrtH79evn5+dVpy4oVK/TQQw9p586dnmNNmjTRggUL1KFDB33yySd6+OGH9dhjj+n555/XTTfdpPnz52vq1Kk6cuSIJCk4OPhHzz9jxgw988wzmjNnjhYuXKi0tDSdOHFCrVq1UmFhoe655x5NmDBBf/jDH/Tee+9p8uTJF/3vEwB+khsAfEBBQYFbkvv48eO17ktPT3ffddddP3uOOXPmuLt37+65PW3aNHezZs3cpaWldd5x2223uZOSkn72cWvWrHGHh4d7bi9fvtwdFhZW63ExMTHuZ5991nNbkvsvf/mL53ZlZaVbknvz5s1ut9vtnjJlijs+Pt7rHE888YRbkvv06dN1/nsAwE/hCiAAn5CYmKgBAwaoa9euGjJkiAYPHqx77rlHV1111Y9+zz/+8Q8tWLBAx44dU2VlpaqrqxUaGur1mJiYGEVGRl7Ulu7du9c6tnXrVmVlZenDDz9UeXm5qqurdf78eZ07d+6i3+OXkJDg+eegoCCFhoaqtLRUknTkyBH17NnT6/G9evW6qPMDwM/hPYAAfIKfn5/eeustbd68WV26dNHChQvVuXNnFRYWXvDxu3btUlpamoYOHarXXntN7733np544olaH9oICgq66C3/+T3Hjx/Xr371KyUkJGjdunUqKCjQ4sWLJV3ah0SaNWvmddvhcMjlcl30eQDgUnEFEIDPcDgc6tOnj/r06aOpU6cqJiZGGzZskL+/v2pqarwe+8477ygmJkZPPPGE59iJEycaZFdBQYFcLpfmzp2rJk2+++/mV155xesxF9p4KTp37qx//vOfXsf27Nlz2ecFgB/iCiAAn/Duu+/qv/7rv7R3714VFRVp/fr1+vLLL3XdddcpNjZWBw4c0JEjR/TVV1+pqqpKnTp1UlFRkXJzc3Xs2DEtWLBAGzZsaJBtHTt2VFVVlRYuXKhPPvlEL7/8sl544QWvx8TGxqqyslLbtm3TV199pXPnzl3Sn/WnP/1JH374oaZMmaKPPvpIr7zyinJyciR9F8gAUB8IQAA+ITQ0VDt27NDQoUMVFxenv/zlL5o7d65SUlL0wAMPqHPnzurRo4ciIyO1c+dO3XnnnZo0aZLGjRunbt266Z133tGTTz7ZINsSExM1b948Pf3004qPj9eqVauUlZXl9ZibbrpJDz74oO69915FRkbqmWeeuaQ/q0OHDlq7dq3Wr1+vhIQEZWdne65yBgQEXPbfBQAkyeF2u91WjwAA/LhZs2bphRdeUHFxsdVTAFwheA8gAPiY559/Xj179lR4eLh27typOXPmaNy4cVbPAnAFIQABGKOoqEhdunT50fsPHz6s9u3bN+KiCzt69KhmzpypU6dOqX379vrzn/+szMxMq2cBuILwEjAAY1RXV+v48eM/en9sbKyaNuW/iwFc+QhAAAAAw/ApYAAAAMMQgAAAAIYhAAEAAAxDAAIAABiGAAQAADAMAQgAAGAYAhAAAMAw/xdq920e4IJshwAAAABJRU5ErkJggg==\n", 594 | "text/plain": [ 595 | "" 596 | ] 597 | }, 598 | "metadata": {}, 599 | "output_type": "display_data" 600 | } 601 | ], 602 | "source": [ 603 | "import seaborn as sns\n", 604 | "import matplotlib.pyplot as plt\n", 605 | "\n", 606 | "df = stars_votes.toPandas()\n", 607 | "\n", 608 | "# Close previous plots; otherwise, will just overwrite and display again\n", 609 | "plt.close()\n", 610 | "\n", 611 | "sns.barplot(x='star_rating', y='sum(helpful_votes)', data=df)\n", 612 | "%matplot plt" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "Given the order of magnitude difference in 5-star helpful votes, we might wonder whether there are simply more 5 star reviews than other star counts. It seems like there are indeed a lot more 5-star ratings than anything else." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 10, 625 | "metadata": {}, 626 | "outputs": [ 627 | { 628 | "data": { 629 | "application/vnd.jupyter.widget-view+json": { 630 | "model_id": "33c9769f9c4a4556850430b94a80d872", 631 | "version_major": 2, 632 | "version_minor": 0 633 | }, 634 | "text/plain": [ 635 | "VBox()" 636 | ] 637 | }, 638 | "metadata": {}, 639 | "output_type": "display_data" 640 | }, 641 | { 642 | "data": { 643 | "application/vnd.jupyter.widget-view+json": { 644 | "model_id": "", 645 | "version_major": 2, 646 | "version_minor": 0 647 | }, 648 | "text/plain": [ 649 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 650 | ] 651 | }, 652 | "metadata": {}, 653 | "output_type": "display_data" 654 | }, 655 | { 656 | "name": "stdout", 657 | "output_type": "stream", 658 | "text": [ 659 | "+-----------+--------+\n", 660 | "|star_rating| count|\n", 661 | "+-----------+--------+\n", 662 | "| 5|13662131|\n", 663 | "| 4| 3546319|\n", 664 | "| 3| 1543611|\n", 665 | "| 2| 861867|\n", 666 | "| 1| 1112232|\n", 667 | "+-----------+--------+" 668 | ] 669 | } 670 | ], 671 | "source": [ 672 | "stars = (data.groupBy('star_rating')\n", 673 | " .count()\n", 674 | " .sort('star_rating', ascending=False)\n", 675 | " )\n", 676 | "stars.show()" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "We can also take a random sample of our distributed dataset and send that sample to Pandas in order to plot it and get a sense of what it looks like. For instance, we can produce a scatter plot of a limited subset of our data using this strategy." 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 11, 689 | "metadata": {}, 690 | "outputs": [ 691 | { 692 | "data": { 693 | "application/vnd.jupyter.widget-view+json": { 694 | "model_id": "e9529ac4455342ec8e20a23602268f6a", 695 | "version_major": 2, 696 | "version_minor": 0 697 | }, 698 | "text/plain": [ 699 | "VBox()" 700 | ] 701 | }, 702 | "metadata": {}, 703 | "output_type": "display_data" 704 | }, 705 | { 706 | "data": { 707 | "application/vnd.jupyter.widget-view+json": { 708 | "model_id": "", 709 | "version_major": 2, 710 | "version_minor": 0 711 | }, 712 | "text/plain": [ 713 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 714 | ] 715 | }, 716 | "metadata": {}, 717 | "output_type": "display_data" 718 | }, 719 | { 720 | "data": { 721 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdeXxU9b3/8fckJIEEZkLIRiSEIEFAFiMKRpQqpCxSBKFWU6rIpXKlAR4IWEtVRLtgbW1VinLttSD3SnC5uEAVi8giEMIiUQSkQQIJQshmEpJIEpLz+4Nfpg5kheTMTM7r+XjM48Gcc2bO52Qg8+a7HZthGIYAAABgGT7uLgAAAADmIgACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAx7dxdgDerqanRqVOn1KlTJ9lsNneXAwAAmsAwDJ09e1ZRUVHy8bFmWxgB8AqcOnVK0dHR7i4DAABchuzsbHXr1s3dZbgFAfAKdOrUSdKFv0B2u93N1QAAgKYoKSlRdHS083vcigiAV6C229dutxMAAQDwMlYevmXNjm8AAAALIwACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACL4U4gAADA4xzLK9WJwnL16BKk2NAgd5fT5hAAAQCAxygqr9SclHRty8hzbhseF6alSfFyBPq5sbK2hS5gAADgMeakpGvH0XyXbTuO5mt2yn43VdQ2EQABAIBHOJZXqm0Zeao2DJft1YahbRl5yswvc1NlbQ8BEAAAeIQTheUN7j9eQABsKQRAAADgEWJCAhvc36MLk0FaCgEQAAB4hJ5hHTU8Lky+NpvLdl+bTcPjwpgN3IIIgAAAwGMsTYrXsF6hLtuG9QrV0qR4N1XUNrEMDAAA8BiOQD+tmj5EmfllOl5QxjqArYQACAAAPE5sKMGvNdEFDAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAW45UB8OWXX9bAgQNlt9tlt9uVkJCgDz/80Ln/3LlzSk5OVpcuXdSxY0dNnjxZZ86ccXmPrKwsjRs3ToGBgQoPD9cjjzyi8+fPm30pAAAApvPKANitWzc988wz2rdvn/bu3asRI0ZowoQJOnjwoCTp4Ycf1rp16/TWW29p69atOnXqlCZNmuR8fXV1tcaNG6fKykrt3LlTr732mlauXKlFixa565IAAABMYzMMw3B3ES0hJCREf/zjH/XjH/9YYWFhWr16tX784x9Lkr766iv17dtXqampuummm/Thhx/qRz/6kU6dOqWIiAhJ0vLly/Xoo48qLy9P/v7+TTpnSUmJHA6HiouLZbfbW+3aAABAy+H720tbAL+vurpaa9asUVlZmRISErRv3z5VVVUpMTHReUyfPn3UvXt3paamSpJSU1M1YMAAZ/iTpNGjR6ukpMTZiliXiooKlZSUuDwAAAC8jdcGwAMHDqhjx44KCAjQQw89pHfeeUf9+vVTTk6O/P39FRwc7HJ8RESEcnJyJEk5OTku4a92f+2++ixZskQOh8P5iI6ObuGrAgAAaH1eGwCvueYapaenKy0tTTNnztTUqVN16NChVj3nwoULVVxc7HxkZ2e36vkAAABaQzt3F3C5/P391atXL0nS4MGDtWfPHr3wwgu65557VFlZqaKiIpdWwDNnzigyMlKSFBkZqd27d7u8X+0s4dpj6hIQEKCAgICWvhQAAABTeW0L4MVqampUUVGhwYMHy8/PT5s2bXLuO3LkiLKyspSQkCBJSkhI0IEDB5Sbm+s8ZuPGjbLb7erXr5/ptQMAAJjJK1sAFy5cqLFjx6p79+46e/asVq9erS1btuijjz6Sw+HQ9OnTNW/ePIWEhMhut2v27NlKSEjQTTfdJEkaNWqU+vXrp/vuu0/PPvuscnJy9Pjjjys5OZkWPgAA0OZ5ZQDMzc3V/fffr9OnT8vhcGjgwIH66KOP9MMf/lCS9Je//EU+Pj6aPHmyKioqNHr0aL300kvO1/v6+mr9+vWaOXOmEhISFBQUpKlTp+rpp5921yUBAACYps2sA+gOrCMEAID34fu7DY0BBAAAQNMQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBgCIAAAgMV4ZQBcsmSJbrzxRnXq1Enh4eGaOHGijhw54nLMbbfdJpvN5vJ46KGHXI7JysrSuHHjFBgYqPDwcD3yyCM6f/68mZcCAABgunbuLuBybN26VcnJybrxxht1/vx5/frXv9aoUaN06NAhBQUFOY978MEH9fTTTzufBwYGOv9cXV2tcePGKTIyUjt37tTp06d1//33y8/PT7///e9NvR4AAAAz2QzDMNxdxJXKy8tTeHi4tm7dquHDh0u60AJ43XXX6fnnn6/zNR9++KF+9KMf6dSpU4qIiJAkLV++XI8++qjy8vLk7+/f6HlLSkrkcDhUXFwsu93echcEAABaDd/fXtoFfLHi4mJJUkhIiMv2119/XaGhoerfv78WLlyo8vJy577U1FQNGDDAGf4kafTo0SopKdHBgwfNKRwAAMANvLIL+Ptqamo0d+5cDRs2TP3793du/+lPf6qYmBhFRUXpiy++0KOPPqojR45o7dq1kqScnByX8CfJ+TwnJ6fOc1VUVKiiosL5vKSkpKUvBwAAoNV5fQBMTk7Wl19+qe3bt7tsnzFjhvPPAwYMUNeuXTVy5Eh9/fXXuvrqqy/rXEuWLNFTTz11RfUCAAC4m1d3Ac+aNUvr16/X5s2b1a1btwaPHTp0qCTp6NGjkqTIyEidOXPG5Zja55GRkXW+x8KFC1VcXOx8ZGdnX+klAAAAmM4rA6BhGJo1a5beeecdffLJJ4qNjW30Nenp6ZKkrl27SpISEhJ04MAB5ebmOo/ZuHGj7Ha7+vXrV+d7BAQEyG63uzwAAAC8jVd2AScnJ2v16tV677331KlTJ+eYPYfDoQ4dOujrr7/W6tWrdccdd6hLly764osv9PDDD2v48OEaOHCgJGnUqFHq16+f7rvvPj377LPKycnR448/ruTkZAUEBLjz8gAAAFqVVy4DY7PZ6ty+YsUKPfDAA8rOztbPfvYzffnllyorK1N0dLTuuusuPf744y6tdidOnNDMmTO1ZcsWBQUFaerUqXrmmWfUrl3TcjHTyAEA8D58f3tpAPQU/AUCAMD78P3tpWMAAQAAcPkIgAAAABZDAAQAALAYr5wFDAAA3O9YXqlOFJarR5cgxYYGubscNAMBEAAANEtReaXmpKRrW0aec9vwuDAtTYqXI9DPjZWhqegCBgAAzTInJV07jua7bNtxNF+zU/a7qSI0FwEQAAA02bG8Um3LyFP1RavIVRuGtmXkKTO/zE2VoTkIgAAAoMlOFJY3uP94AQHQGxAAAQBAk8WEBDa4v0cXJoN4AwIgAABosp5hHTU8Lky+F92W1ddm0/C4MGYDewkCIAAAaJalSfEa1ivUZduwXqFamhTvporQXCwDAwAAmsUR6KdV04coM79MxwvKWAfQCxEAAQDAZYkNJfh5K7qAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBimAUMNNOxvFKdKCxn2QMAgNciAAJNVFReqTkp6dqWkefcNjwuTEuT4uUI9HNjZQAANA9dwEATzUlJ146j+S7bdhzN1+yU/W6qCACAy0MABJrgWF6ptmXkqdowXLZXG4a2ZeQpM7/MTZUBANB8BECgCU4Ulje4/3gBARAA4D0IgEATxIQENri/RxcmgwAAvAcBEGiCnmEdNTwuTL42m8t2X5tNw+PCmA0MAPAqBECgiZYmxWtYr1CXbcN6hWppUrybKgIA4PKwDAzQRI5AP62aPkSZ+WU6XlDGOoAAAK9FAASaKTaU4AcA8G50AQMAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEAACwGAIgAACAxXhlAFyyZIluvPFGderUSeHh4Zo4caKOHDnicsy5c+eUnJysLl26qGPHjpo8ebLOnDnjckxWVpbGjRunwMBAhYeH65FHHtH58+fNvBQAAADTeWUA3Lp1q5KTk7Vr1y5t3LhRVVVVGjVqlMrKypzHPPzww1q3bp3eeustbd26VadOndKkSZOc+6urqzVu3DhVVlZq586deu2117Ry5UotWrTIHZcEAABgGpthGIa7i7hSeXl5Cg8P19atWzV8+HAVFxcrLCxMq1ev1o9//GNJ0ldffaW+ffsqNTVVN910kz788EP96Ec/0qlTpxQRESFJWr58uR599FHl5eXJ39+/0fOWlJTI4XCouLhYdru9Va8RAAC0DL6/vbQF8GLFxcWSpJCQEEnSvn37VFVVpcTEROcxffr0Uffu3ZWamipJSk1N1YABA5zhT5JGjx6tkpISHTx40MTqAQAAzNXO3QVcqZqaGs2dO1fDhg1T//79JUk5OTny9/dXcHCwy7ERERHKyclxHvP98Fe7v3ZfXSoqKlRRUeF8XlJS0mLXAQAAYBavbwFMTk7Wl19+qTVr1rT6uZYsWSKHw+F8REdHt/o5AQAAWppXB8BZs2Zp/fr12rx5s7p16+bcHhkZqcrKShUVFbkcf+bMGUVGRjqPuXhWcO3z2mMutnDhQhUXFzsf2dnZLXk5AAAApvDKAGgYhmbNmqV33nlHn3zyiWJjY132Dx48WH5+ftq0aZNz25EjR5SVlaWEhARJUkJCgg4cOKDc3FznMRs3bpTdble/fv3qPG9AQIDsdrvLAwAAwNt45RjA5ORkrV69Wu+99546derkHLPncDjUoUMHORwOTZ8+XfPmzVNISIjsdrtmz56thIQE3XTTTZKkUaNGqV+/frrvvvv07LPPKicnR48//riSk5MVEBDgzssDAABoVW5dBqa6uloHDhxQTEyMOnfu3OTX2Wy2OrevWLFCDzzwgKQLC0HPnz9fKSkpqqio0OjRo/XSSy+5dO+eOHFCM2fO1JYtWxQUFKSpU6fqmWeeUbt2TcvFTCMHAMD78P1tcgCcO3euBgwYoOnTp6u6ulo/+MEPtHPnTgUGBmr9+vW67bbbzCqlRfAXCAAA78P3t8ljAN9++20NGjRIkrRu3TplZmbqq6++0sMPP6zHHnvMzFIAAAAsy9QAmJ+f7+yC/eCDD3T33Xerd+/e+o//+A8dOHDAzFIAAAAsy9QAGBERoUOHDqm6ulobNmzQD3/4Q0lSeXm5fH19zSwFAADAskydBTxt2jT95Cc/UdeuXWWz2Zy3aktLS1OfPn3MLAUAAMCyTA2AixcvVv/+/ZWdna27777budyKr6+vfvWrX5lZCgAAgGW5bRmYc+fOqX379u44dYthFhEAAN6H72+TxwBWV1frN7/5ja666ip17NhRx44dkyQ98cQTevXVV80sBQAAwLJMDYC/+93vtHLlSj377LPy9/d3bu/fv7/++7//28xSAAAALMvUALhq1Sq98sormjJlisus30GDBumrr74ysxQAAADLMjUAfvPNN+rVq9cl22tqalRVVWVmKQAAAJZlagDs16+fPv3000u2v/3224qPjzezFAAAAMsydRmYRYsWaerUqfrmm29UU1OjtWvX6siRI1q1apXWr19vZikAAACWZWoL4IQJE7Ru3Tp9/PHHCgoK0qJFi3T48GGtW7fOeVcQAAAAtC63rQPYFrCOEAAA3ofvb5NbAHv27KmCgoJLthcVFalnz55mlgIAAGBZpgbA48ePq7q6+pLtFRUV+uabb8wsBQAAwLJMmQTy/vvvO//80UcfyeFwOJ9XV1dr06ZN6tGjhxmlAAAAWJ4pAXDixImSJJvNpqlTp7rs8/PzU48ePfTcc8+ZUQoAAIDlmRIAa2pqJEmxsbHas2ePQkNDzTgtAAAA6mDqOoCZmZlmng4AAAB1MHUSiCRt3bpV48ePV69evdSrVy/deeeddd4dBAAAAK3D1AD4v//7v0pMTFRgYKDmzJmjOXPmqEOHDho5cqRWr15tZikAAACWZepC0H379tWMGTP08MMPu2z/85//rL/97W86fPiwWaW0CBaSBADA+/D9bXIL4LFjxzR+/PhLtt95552MDwQAADCJqQEwOjpamzZtumT7xx9/rOjoaDNLAQAAsCxTZwHPnz9fc+bMUXp6um6++WZJ0o4dO7Ry5Uq98MILZpYCAABgWaYGwJkzZyoyMlLPPfec3nzzTUkXxgW+8cYbmjBhgpmlAAAAWJapk0DaGgaRAgDgffj+NnkM4M9//nNt2bLFzFMCAADgIqYGwLy8PI0ZM0bR0dF65JFHlJ6ebubpAQAAIJMD4HvvvafTp0/riSee0J49ezR48GBde+21+v3vf6/jx4+bWQoAAIBluXUM4MmTJ5WSkqK///3vysjI0Pnz591VymVhDAEAAN6H72+TZwF/X1VVlfbu3au0tDQdP35cERER7ioFAFrMsbxSnSgsV48uQYoNDXJ3OQBQJ9MD4ObNm7V69Wr93//9n2pqajRp0iStX79eI0aMMLsUAGgxReWVmpOSrm0Zec5tw+PCtDQpXo5APzdWBgCXMjUAXnXVVSosLNSYMWP0yiuvaPz48QoICDCzBABoFXNS0rXjaL7Lth1H8zU7Zb9WTR/ipqoAoG6mBsDFixfr7rvvVnBwcIPHnTx5UlFRUfLxMXWOCgBclmN5pS4tf7WqDUPbMvKUmV9GdzAAj2JqwnrwwQcbDX+S1K9fP2YFA/AaJwrLG9x/vKDMpEoAoGncNgmkIdycBFbBhIG2ISYksMH9Pbrw2QLwLB4ZAIG2jgkDbUvPsI4aHhemHUfzVf29/8D62mwa1iuUcA/A4zDIDnCDhiYMwDstTYrXsF6hLtuG9QrV0qR4N1UEAPWjBRAwGRMG2iZHoJ9WTR+izPwyHS8oo1sfgEfzyBZAm83W4P5t27Zp/PjxioqKks1m07vvvuuy/4EHHpDNZnN5jBkzxuWYwsJCTZkyRXa7XcHBwZo+fbpKS0tb/FqAizFhoG2LDQ3S7deEE/4AeDSPDICNTQIpKyvToEGDtGzZsnqPGTNmjE6fPu18pKSkuOyfMmWKDh48qI0bN2r9+vXatm2bZsyY0SL1Aw1hwgAAwN08sgv40KFDioqKqnf/2LFjNXbs2AbfIyAgQJGRkXXuO3z4sDZs2KA9e/bohhtukCQtXbpUd9xxh/70pz81eG7gSjFhAADgbq0eACdNmtTkY9euXStJio6OvuLzbtmyReHh4ercubNGjBih3/72t+rSpYskKTU1VcHBwc7wJ0mJiYny8fFRWlqa7rrrris+P9CQpUnxmp2y32UsIBMGAABmafUA6HA4WvsUlxgzZowmTZqk2NhYff311/r1r3+tsWPHKjU1Vb6+vsrJyVF4eLjLa9q1a6eQkBDl5OTU+74VFRWqqKhwPi8pKWm1a0DbxoQBuBtrUALW1uoBcMWKFa19ikvce++9zj8PGDBAAwcO1NVXX60tW7Zo5MiRl/2+S5Ys0VNPPdUSJQKSLkwY4MsXZmINSgCSh04CaWk9e/ZUaGiojh49KkmKjIxUbm6uyzHnz59XYWFhveMGJWnhwoUqLi52PrKzs1u1bgBoaaxBCUBywySQt99+W2+++aaysrJUWVnpsu+zzz5rlXOePHlSBQUF6tq1qyQpISFBRUVF2rdvnwYPHixJ+uSTT1RTU6OhQ4fW+z4BAQEKCAholRoBoLWxBiWAWqa2AL744ouaNm2aIiIitH//fg0ZMkRdunTRsWPHGp3V+32lpaVKT09Xenq6JCkzM1Pp6enKyspSaWmpHnnkEe3atUvHjx/Xpk2bNGHCBPXq1UujR4+WJPXt21djxozRgw8+qN27d2vHjh2aNWuW7r33XmYAA2izWIMSQC1TA+BLL72kV155RUuXLpW/v79++ctfauPGjZozZ46Ki4ub/D579+5VfHy84uMvzJicN2+e4uPjtWjRIvn6+uqLL77QnXfeqd69e2v69OkaPHiwPv30U5fWu9dff119+vTRyJEjdccdd+iWW27RK6+80uLXDACegjUoAdSyGY2tutyCAgMDdfjwYcXExCg8PFwbN27UoEGDlJGRoZtuukkFBQVmldIiSkpK5HA4VFxcLLvd7u5yAKBR97+6u941KFdNH+LGygDz8P1tcgtgZGSkCgsLJUndu3fXrl27JF3owjUxhwKAZS1NitewXqEu21iDErAeUyeBjBgxQu+//77i4+M1bdo0Pfzww3r77be1d+/eZi0YDQC4PKxBCUAyuQu4pqZGNTU1atfuQu5cs2aNdu7cqbi4OP3nf/6n/P39zSqlRdCEDACA9+H72+QAmJWVpejoaNlsNpfthmEoOztb3bt3N6uUFsFfIAAAvA/f3yaPAYyNjVVe3qVrUBUWFio2NtbMUgAAACzL1ABoGMYlrX/ShXX92rdvb2YpAAAAlmXKJJB58+ZJkmw2m5544gkFBv57Larq6mqlpaXpuuuuM6MUAAAAyzMlAO7ff+Eek4Zh6MCBAy6TPfz9/TVo0CAtWLDAjFIAAAAsz5QAuHnzZknStGnT9MILL1h2wCUAAIAnMHUdwBUrVjj/fPLkSUlSt27dzCwBAADA8kydBFJTU6Onn35aDodDMTExiomJUXBwsH7zm9+opqbGzFIAAAAsy9QWwMcee0yvvvqqnnnmGQ0bNkyStH37di1evFjnzp3T7373OzPLAQAAsCRTF4KOiorS8uXLdeedd7psf++99/SLX/xC33zzjVmltAgWkgQAwPvw/W1yC2BhYaH69OlzyfY+ffqosLDQzFJwhY7llepEYblb7iPqznMDANAWmBoABw0apL/+9a968cUXXbb/9a9/1aBBg8wsBZepqLxSc1LStS3j33d0GR4XpqVJ8XIE+rXZcwMA0JaY2gW8detWjRs3Tt27d1dCQoIkKTU1VdnZ2frggw906623mlVKi7BiE/L9r+7WjqP5qv7eXxtfm03DeoVq1fQhbfbcAIC2w4rf3xcz/V7A//rXv3TXXXepqKhIRUVFmjRpko4cOaKYmBgzS8FlOJZXqm0ZeS4BTJKqDUPbMvKUmV/WJs8NAEBbY2oXcGxsrE6fPn3JbN+CggJFR0erurrazHLQTCcKyxvcf7ygrNXG5Lnz3AAAtDWmtgDW19tcWlqq9u3bm1kKLkNMSGCD+3t0ab0A5s5zAwDQ1pjSAjhv3jxJks1m06JFixQY+O8v8+rqaqWlpem6664zoxRcgZ5hHTU8LqzecXit2QLnznMDANDWmBIA9+/fL+lCC+CBAwfk7+/v3Ofv769BgwZpwYIFZpSCK7Q0KV6zU/a7zMQd1itUS5Pi2/S5AQBoS0ydBTxt2jS98MILbWbGjZVnEWXml+l4QZlb1uJz57nhPVgvEkB9rPz9XcvUANjW8BcI8DysFwmgMXx/mzwJBABa25yUdO04mu+ybcfRfM1O2e+migDA8xAAATToWF6pNh/J9Yq1FlkvEgCaxtR1AAF4D2/sSmW9SABoGloAAdTJG7tSWS8SAJqGAAiYwJu6USXv7UqtXS/S12arc/+T7x1UcXmVyVUBgOchAAKtqKi8Uve/ulsjntuqaSv26PY/bdH9r+72+BDSlK5UT7U0KV7DeoXWuc/TWzABwCwEQKAVeWM3quTdXamOQD8tvrNfnfs8vQUTAMxCAARaibd2o0r1d6X62mwaHhfm8RMpvLkFEwDMQAAEWom3h5C6ulK95dZ73tyCCQBmYBkYoJV4ewhxBPpp1fQhXnnrvdoWzB1H811aYH1tNg3rFeo11wEArYUWQKCVeHs3aq3Y0CDdfk2419Rby5tbMAGgtXEv4CvAvQTRmOLyKs1O2e9Viym3Nd7YggmgdfH9TQC8IvwFQlMRQgDAc/D9zRhAwBSxoQQ/AIDnYAwgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFiMVwbAbdu2afz48YqKipLNZtO7777rst8wDC1atEhdu3ZVhw4dlJiYqIyMDJdjCgsLNWXKFNntdgUHB2v69OkqLS018zIAXORYXqk2H8n16PskA0Bb4JUBsKysTIMGDdKyZcvq3P/ss8/qxRdf1PLly5WWlqagoCCNHj1a586dcx4zZcoUHTx4UBs3btT69eu1bds2zZgxw6xLANqkyw1wReWVuv/V3Rrx3FZNW7FHt/9pi+5/dbeKy6taqVIAsDavXwjaZrPpnXfe0cSJEyVdaP2LiorS/PnztWDBAklScXGxIiIitHLlSt177706fPiw+vXrpz179uiGG6M2xpMAACAASURBVG6QJG3YsEF33HGHTp48qaioqCadm4UkgQuKyis1JyX9su94cv+ru+u9b++q6UNapWYA1sX3t5e2ADYkMzNTOTk5SkxMdG5zOBwaOnSoUlNTJUmpqakKDg52hj9JSkxMlI+Pj9LS0up974qKCpWUlLg8AEhzUtK142i+y7YdR/M1O2V/o689lleqbRl5LuFPkqoNQ9sy8ugOBoBW0OYCYE5OjiQpIiLCZXtERIRzX05OjsLDw132t2vXTiEhIc5j6rJkyRI5HA7nIzo6uoWrB7xPYwFuze6sBkPcicLyBt//eAEBEABaWpsLgK1p4cKFKi4udj6ys7PdXRLgdo0FuF+tPdDgmL6YkMAGX9+jC7fQA4CW1uYCYGRkpCTpzJkzLtvPnDnj3BcZGanc3FyX/efPn1dhYaHzmLoEBATIbre7PACrayzA1aqvS7hnWEcNjwuTr83mst3XZtPwuDDuoQwAraDNBcDY2FhFRkZq06ZNzm0lJSVKS0tTQkKCJCkhIUFFRUXat2+f85hPPvlENTU1Gjp0qOk1A96svgB3sYbG9C1NitewXqEu24b1CtXSpPgWrRUAcEE7dxdwOUpLS3X06FHn88zMTKWnpyskJETdu3fX3Llz9dvf/lZxcXGKjY3VE088oaioKOdM4b59+2rMmDF68MEHtXz5clVVVWnWrFm69957mzwDGMC/LU2K1+yU/S6zgOtzvKDsklY9R6CfVk0fosz8Mh0vKFOPLkG0/AFAK/LKZWC2bNmi22+//ZLtU6dO1cqVK2UYhp588km98sorKioq0i233KKXXnpJvXv3dh5bWFioWbNmad26dfLx8dHkyZP14osvqmPHjk2ug2nkgKvM/DLtOlaghWsP1HvM5gW3Ee4AuBXf314aAD0Ff4GAurGuHwBPxvd3GxwDCMD9GNMHAJ7NK8cAAvBsjOkDAM9GAATQamJDCX4A4InoAgYAALAYAiAAAIDFEAABAAAshgAIAABgMQRAAAAAi2EWMNBGHcsr1YnCcpZgAQBcggAItDFF5ZWak5Lucl/e4XFhWpoUL0egnxsrAwB4CrqAgTZmTkq6dhzNd9m242i+Zqfsd1NFAABPQwAE2pBjeaXalpHncg9eSao2DG3LyFNmfpmbKgMAeBICINCGnCgsb3D/8QICIACAAAi0KTEhgQ3u79GFySAAAAIg0Kb0DOuo4XFh8rXZXLb72mwaHhfGbGAAgCQCINDmLE2K17BeoS7bhvUK1dKkeDdVBADwNCwDA7QxjkA/rZo+RJn5ZTpeUMY6gACASxAALYgFgq0hNpTPFwBQNwKghbBAcOsjXAMAvAEB0EIaWiB41fQhbqqqbWipcE2ABACYgQBoEbULBF/s+wsEe0Lg8NYAdKXhmtZZAICZCIAW0ZQFgt0ZuLw5ALVEuKZ1FgBgJpaBsQhPXyDYm+9fe6V33+D2bQAAsxEALcKTFwj29gB0OeH6WF6pNh/JVWZ+GbdvAwCYji5gC1maFK/ZKftduis9YYFgT++ebkxtuN5xNN8lxPrabBrWK9Sl9rq6um+I6dzg+7u7dRYA0PYQAC3EUxcI9vTu6aZoariuq6t7f1aROgf6qeS7840GSAAAWgIB0II8bYHg5rSgeaqmhOuGJot8W16lG3t01p7j3zq3e0LrLACgbSIAwiN4avd0czUUrhvr6v7F7b3Uo0uQR7XOAgDaJgIgPIKndk+3pKZ0dXta6ywAoG0iAMKjtOUA1Ba6ugEAbQPLwAAmWpoUr2G9Ql22eWNXNwDAu9ECCMsz8/ZzVujqBgB4PgIgWoU33NPXnbefa8td3QAAz0cARIvypnv6cv9dAIBVMQYQLcpb7unbUref23okVy9s+pc+rWN9PwAAPBUtgGgxDS10XBuqPKXb80pvP/fpv3I18/XPVFpR7dzWOdBP7yffouguDS/3AgCAu9ECiBbTlFBlpmN5pdp8JLfO1rzLvf1cUXml7n91t+77+x6X8CdJ35ZX6c5l2y+/YAAATEILIFqMp9zTtynjEC93Tb45KekNdvd+W16lTzPydGtcWAtdDQAALY8WQLSY2lDla7O5bPe12TQ8Lsy07t+mjkNs7pp8tV3cRp17/+2zrG8bOQIAAPeiBRAtyt339G3OOMTmrsnXWBd3reu7d7684gEAMAkBsA3xhLX33L3Q8eVM7jCMxtr0Lmisi1u6MBGE7l8AgKdrswFw8eLFeuqpp1y2XXPNNfrqq68kSefOndP8+fO1Zs0aVVRUaPTo0XrppZcUERHhjnKviCeuveeuhY6bMg6xNiiHBPrruX/+q8k/t/rGDdaqnQUMAICnsxlNbf7wMosXL9bbb7+tjz/+2LmtXbt2Cg29MOZr5syZ+sc//qGVK1fK4XBo1qxZ8vHx0Y4dO5p8jpKSEjkcDhUXF8tut7f4NTTV/a/urncyQ1tc0Lixls66fh4+kgZFB6tTe786u4hrNfZzKy6vuqSLOyakgxaO7asxA7pe/kUBAEzjKd/f7tRmWwClC4EvMjLyku3FxcV69dVXtXr1ao0YMUKStGLFCvXt21e7du3STTfdZHapl82b1t67Uk1t6axrHGKNpP3ZRY2eo7Gfm7u7uAEAaAltehZwRkaGoqKi1LNnT02ZMkVZWVmSpH379qmqqkqJiYnOY/v06aPu3bsrNTW13verqKhQSUmJy8PdPG3tvdY0JyVd24+6ht26ZvfWhrQbe3SWj+uE5CZr7OcWGxqk268JJ/wBALxSmw2AQ4cO1cqVK7Vhwwa9/PLLyszM1K233qqzZ88qJydH/v7+Cg4OdnlNRESEcnJy6n3PJUuWyOFwOB/R0dGtfRmN8pS191rb59nfaltGnmouGrBQ363bjuWVas/xby85vqnays8NAIC6tNku4LFjxzr/PHDgQA0dOlQxMTF688031aFDh8t6z4ULF2revHnO5yUlJW4PgZe7oLHUvFnD7p5h/Ng7Xza4/+LZvU1dsuViTfm5AQDg7dpsALxYcHCwevfuraNHj+qHP/yhKisrVVRU5NIKeObMmTrHDNYKCAhQQECAGeU2S1PX3vv37Fc/PffPjCbNfvWEGcbH8kr15amGu9svbrFrypItdTFzzUIAANzFMgGwtLRUX3/9te677z4NHjxYfn5+2rRpkyZPnixJOnLkiLKyspSQkODmSpuvsYkJdYW4i9WOpbt49mtDd9Uwa4ZxY615/a+yX9Ji1zOsozoH+unb8qoGXzs8LkwLRvVWQXklEzoAAJbRZgPgggULNH78eMXExOjUqVN68skn5evrq6SkJDkcDk2fPl3z5s1TSEiI7Ha7Zs+erYSEBK+aAXyx+tbeqyvEXayu2a+eMsO4sda8Wbf3kuTaTW0YRoPhb8mkAbqpZxcCHwDAktpsADx58qSSkpJUUFCgsLAw3XLLLdq1a5fCwi7cpeEvf/mLfHx8NHnyZJeFoNua+kJcfb4/lu5y7qpxpeoaa9jYAswP/e9nl7T29b+q4XWdIh3tCX8AAMtqswFwzZo1De5v3769li1bpmXLlplUkXs0dzLE98fSmTnDuLGxhnWNc/y+i1v7DjVzzCAAAFbSZpeBwQVNnQzha7NpeFyYS6tYbcvbxX9J6jr2SjU01lD69zjHVf/RtHGHtcu/mFE7AADeps22AFpBU5Zmaaz7tFZds1+Lyit1vqZGNRcdOyQ25Ipnyl48Xq+xsYaGYejg6RK9+HFGs87TL8ruMoOYWb4AABAAvVJzl2apq/t0eFyYFozurYKy+me//uzVNB38xrUr1ccm+fn6OM/T3PUB66q9sfF6s1d/1ugyMPVZ+tPrJYnbtgEA8D02w2igWQgNctfNpO9/dXe9Cz83tDRLU+9fe6KgTOOXblfJufP1HvPylHil7D55SaicPypOheVV9Z6jrtp9bGrwjh0+0iWtkI1pys8DAGBN7vr+9iS0AHqZK1mapb5lYi42cdmOBsOfJM18ff8l27Zl5DXYKllf7Y3drq254U+iqxcAgIYwCcTLNGVpliux9Uhuo4snN9X3J3Fc6Pq9NDS2tAWjemvzgtu0avoQ0+5UAgCAtyEAepnGZvW+9MlRFV9BgEs/WXTZr73Y91sl56SkN7o0y5XqHOinWSPiGOcHAEAjCIBepHbCxY09OsvXZqvzmM+yipytbpfjum7BjR/UTLuOFWhbRt5ldeU2lb19O72ffEsrngEAgLaDMYBeoK6Zs/b27eocp3clt2krKq/Uq9uPX2m5l6g7qrbce/ePsmvdnFtb8SwAALQtBEAP9f3lVZ587+AliySfbWSSRkO3aatv6ZaG7hnczsem6hpDzZkyXjsTd0hsSDNe1Ty3/v+JJgAAoOkIgB6mrta+ujQWxOq61Vmda/BF2fX7uwaoY/t2DZ5z7cyb9et3DjS4Hl9MSAedKPzO+XxYr1DNH9VbJwrL622xvBzPTBqgCEd71vUDAOAyEQA9TEOtcE1R2+pWVzCq672/PFWiO5ftUP+ohtdBKiiv1ItJ8Rrx3NZ6j6kNf/2vsuuBhB5amXpcE5btaP5FNGJozy4EPwAArgCTQDxI7Tp5Dd2yrTH2Du30u4n9m/3eBxuZodujS5DztnL1TUCp9eU3JVrw9hf68puWnfXLfXwBAGgZBEAP0tgaf03xbXmV5qy5dBZwY+9dGwt9Lsp2taHLMAxtPpKrBaN7a1iv0Cuusy4Xx8p2FxXD4s4AALQMuoA9SGNr/DXV/uwi3f3yTv331BvlCPRTUXmlXtp8tEmvDfT3VWlFtfN5ez+bisorXbp+h8eF6f3kYVqxM1Pv7D/VIjVL0pJJA3RV5w76LOtbXd+9s26NC2vy7esAAEDTEQA9SM+wjuoc6FfnnTg6B/qpW3B7HTh1tknvtffEt5qdsl+rpg/RnJR0fXaiaQs8l1dWuzwvq6zRF98Uu2zbcTRfZ89VtUiL5ffVju27NS7Mua2pt68DAABNRwD0IMfySuu9Ddu35VU6X930sYGG/v+9ef+V1+iMYunCWIAaNX5fXunCWoP7s1vujiGSdPPVTOwAAMAsjAH0II21qJ2taP4yKvuzv23Scf0amQXcmobHhenlKYPddn4AAKyGFkAP0lJjAL8vPrpzg/uXTBqgm3p2kWEYDS7x0hpiQjpoadL1Ghjd8refAwAA9aMF0IPUt8yKr82mAVc1r4WudvZut84dGjyu9kw9wzoqPtrRrHNcCXv7dtr6yxGEPwAA3IAWQA+zNCles1P2u4zbq13+5LY/ba5zjGDHAF9Jcpm9W7se4NH80gbP96u1BySp3sknraFzoJ/eT77FlHMBAIBL2QzjClYdtriSkhI5HA4VFxfLbm/ZMXR1LX+SXVCuO5dtdwlqnQP9FBfeUftOFLks8uxjk66NsuunQ7tr4dovW7S2yxUT0kELx/bVmAFd3V0KAMDCWvP721sQAK+AWX+BjuWV6kRhuTMMvrknWzuP5WvY1aEaHNPZ9LF7zdUxwFev/OwG3RzXOgtIAwDQHARAuoA92ufZ3+qxd77Ul9+7Tdv3u2rf3X9K/Zs5NtAsfSM6aszArs4FnQEAgOcgAHqgovJKPbhqr/Ycv3QJl4vH6R1q5B6+7jA8LkxLk+LlCPRzdykAAKAOBEAPU1Reqdv/tKXJEzKasnBzawvvFKDk269W9y5B3LINAAAvQAD0MA/8fbdps3Fbyhv/mUDoAwDAixAAPcixvFKlnyxu/EAP4WOTbukVRvgDAMDLsBC0B0nLLHR3Cc1yS68LY/0AAIB3oQXQg+SdPefuEprs/eRh3MUDAAAvRQD0IGGd2ru7hEb5+kjrkm9Rv6vMu20cAABoWQRADzI0NsTdJdTL39emeT/srYdu6+XuUgAAwBUiAHqQnmEd1cHPR99V1bi7FKeYLoFaOLaPxvTn9m0AALQVBEAPciyv1CPCX6Cfj5b9bDBr+gEA0EYRAD3IicJyd5egzoF+ej/5FkV3CXR3KQAAoJUQAD1ITIh7Q9evx/bRjB9c7dYaAABA6yMAWlw7H+k/hvXQr8dd6+5SAACASQiAHuR/Uk+Ydi4fSU9NuFb3JfQw7ZwAAMAzEAA9yK5j+aac56HhsfrVHf1MORcAAPA8BEAPEuDn26rvf/s1oVoxbWirngMAAHg+AqAHCWqlANg9OEDbfpXYKu8NAAC8j4+7C3C3ZcuWqUePHmrfvr2GDh2q3bt3u62Wr/PKWvT9unby0/FnxhH+AACAC0sHwDfeeEPz5s3Tk08+qc8++0yDBg3S6NGjlZub65Z6cs5WtMj7dAyQjj8zTqmPjWqR9wMAAG2LpQPgn//8Zz344IOaNm2a+vXrp+XLlyswMFB///vf3V3aZTv+zDh9+dQ4d5cBAAA8mGXHAFZWVmrfvn1auHChc5uPj48SExOVmppa52sqKipUUfHvVrqSkpJWr7Opjj9D6AMAAE1j2QCYn5+v6upqRUREuGyPiIjQV199VedrlixZoqeeesqM8pqM4AcAAJrL0l3AzbVw4UIVFxc7H9nZ2W6r5fgz4wh/AADgsli2BTA0NFS+vr46c+aMy/YzZ84oMjKyztcEBAQoICCg1Wo6/sw49fjVPxo9BgAA4EpYtgXQ399fgwcP1qZNm5zbampqtGnTJiUkJLixsvoR/gAAQEuwbAugJM2bN09Tp07VDTfcoCFDhuj5559XWVmZpk2b5raaakPe91sCCX4AAKAlWToA3nPPPcrLy9OiRYuUk5Oj6667Ths2bLhkYog7EPoAAEBrsRmGYbi7CG9VUlIih8Oh4uJi2e12d5cDAACagO9vC48BBAAAsCoCIAAAgMUQAAEAACyGAAgAAGAxBEAAAACLIQACAABYDAEQAADAYgiAAAAAFkMABAAAsBhL3wruStXeRKWkpMTNlQAAgKaq/d628s3QCIBX4OzZs5Kk6OhoN1cCAACa6+zZs3I4HO4uwy24F/AVqKmp0alTp9SpUyfZbLYWfe+SkhJFR0crOzvbsvcpdDc+A/fjM3A/PgP34zNoeYZh6OzZs4qKipKPjzVHw9ECeAV8fHzUrVu3Vj2H3W7nH7yb8Rm4H5+B+/EZuB+fQcuyastfLWvGXgAAAAsjAAIAAFiM7+LFixe7uwjUzdfXV7fddpvataOn3l34DNyPz8D9+Azcj88ALY1JIAAAABZDFzAAAIDFEAABAAAshgAIAABgMQRAAAAAiyEAeqBly5apR48eat++vYYOHardu3e7u6Q2a/HixbLZbC6PPn36OPefO3dOycnJ6tKlizp27KjJkyfrzJkzbqzY+23btk3jx49XVFSUbDab3n33XZf9hmFo0aJF6tq1qzp06KDExERlZGS4HFNYWKgpU6bIbrcrODhY06dPV2lpqZmX4dUa+wweeOCBS/5djBkzxuUYPoMrs2TJEt14443q1KmTwsPDNXHiRB05csTlmKb8/snKytK4ceMUGBio8PBwPfLIIzp//ryZlwIvRQD0MG+88YbmzZunJ598Up999pkGDRqk0aNHKzc3192ltVnXXnutTp8+7Xxs377due/hhx/WunXr9NZbb2nr1q06deqUJk2a5MZqvV9ZWZkGDRqkZcuW1bn/2Wef1Ysvvqjly5crLS1NQUFBGj16tM6dO+c8ZsqUKTp48KA2btyo9evXa9u2bZoxY4ZZl+D1GvsMJGnMmDEu/y5SUlJc9vMZXJmtW7cqOTlZu3bt0saNG1VVVaVRo0aprKzMeUxjv3+qq6s1btw4VVZWaufOnXrttde0cuVKLVq0yB2XBG9jwKMMGTLESE5Odj6vrq42oqKijCVLlrixqrbrySefNAYNGlTnvqKiIsPPz8946623nNsOHz5sSDJSU1PNKrFNk2S88847zuc1NTVGZGSk8cc//tG5raioyAgICDBSUlIMwzCMQ4cOGZKMPXv2OI/58MMPDZvNZnzzzTfmFd9GXPwZGIZhTJ061ZgwYUK9r+EzaHm5ubmGJGPr1q2GYTTt988HH3xg+Pj4GDk5Oc5jXn75ZcNutxsVFRXmXgC8Di2AHqSyslL79u1TYmKic5uPj48SExOVmprqxsratoyMDEVFRalnz56aMmWKsrKyJEn79u1TVVWVy+fRp08fde/enc+jlWRmZionJ8flZ+5wODR06FDnzzw1NVXBwcG64YYbnMckJibKx8dHaWlpptfcVm3ZskXh4eG65pprNHPmTBUUFDj38Rm0vOLiYklSSEiIpKb9/klNTdWAAQMUERHhPGb06NEqKSnRwYMHTawe3ogA6EHy8/NVXV3t8o9ZkiIiIpSTk+Omqtq2oUOHauXKldqwYYNefvllZWZm6tZbb9XZs2eVk5Mjf39/BQcHu7yGz6P11P5cG/o3kJOTo/DwcJf97dq1U0hICJ9LCxkzZoxWrVqlTZs26Q9/+IO2bt2qsWPHqrq6WhKfQUurqanR3LlzNWzYMPXv31+SmvT7Jycnp85/K7X7gIZwTxlY2tixY51/HjhwoIYOHaqYmBi9+eab6tChgxsrA9zn3nvvdf55wIABGjhwoK6++mpt2bJFI0eOdGNlbVNycrK+/PJLl/HHQGujBdCDhIaGytfX95JZXmfOnFFkZKSbqrKW4OBg9e7dW0ePHlVkZKQqKytVVFTkcgyfR+up/bk29G8gMjLykklR58+fV2FhIZ9LK+nZs6dCQ0N19OhRSXwGLWnWrFlav369Nm/erG7dujm3N+X3T2RkZJ3/Vmr3AQ0hAHoQf39/DR48WJs2bXJuq6mp0aZNm5SQkODGyqyjtLRUX3/9tbp27arBgwfLz8/P5fM4cuSIsrKy+DxaSWxsrCIjI11+5iUlJUpLS3P+zBMSElRUVKR9+/Y5j/nkk09UU1OjoUOHml6zFZw8eVIFBQXq2rWrJD6DlmAYhmbNmqV33nlHn3zyiWJjY132N+X3T0JCgg4cOOASxjdu3Ci73a5+/fqZcyHwXu6ehQJXa9asMQICAoyVK1cahw4dMmbMmGEEBwe7zPJCy5k/f76xZcsWIzMz09ixY4eRmJhohIaGGrm5uYZhGMZDDz1kdO/e3fjkk0+MvXv3GgkJCUZCQoKbq/ZuZ8+eNfbv32/s37/fkGT8+c9/Nvbv32+cOHHCMAzDeOaZZ4zg4GDjvffeM7744gtjwoQJRmxsrPHdd98532PMmDFGfHy8kZaWZmzfvt2Ii4szkpKS3HVJXqehz+Ds2bPGggULjNTUVCMzM9P4+OOPjeuvv96Ii4szzp0753wPPoMrM3PmTMPhcBhbtmwxTp8+7XyUl5c7j2ns98/58+eN/v37G6NGjTLS09ONDRs2GGFhYcbChQvdcUnwMgRAD7R06VKje/fuhr+/vzFkyBBj165d7i6pzbrnnnuMrl27Gv7+/sZVV11l3HPPPcbRo0ed+7/77jvjF7/4hdG5c2cjMDDQuOuuu4zTp0+7sWLvt3nzZkPSJY+pU6cahnFhKZgnnnjCiIiIMAICAoyRI0caR44ccXmPgoICIykpyejYsaNht9uNadOmGWfPnnXD1Xinhj6D8vJyY9SoUUZYWJjh5+dnxMTEGA8++OAl/wnlM7gydf38JRkrVqxwHtOU3z/Hjx83xo4da3To0MEIDQ015s+fb1RVVZl8NfBGNsMwDLNbHQEAAOA+jAEEAACwGAIgAACAxRAAAQAALIYACAAAYDEEQAAAAIshAAIAAFgMARAAAMBiCIAAAAAWQwAEYIrbbrtNc+fOvezXL168WNddd12zXmMYhmbMmKGQkBDZbDalp6c36XU2m03vvvvu5ZQJAF6BAAigzdqwYYNWrlyp9evX6/Tp0+rfv7+7S7rEypUrFRwc7O4yAFhMO3cXAACt5euvv1bXrl118803u7sUAPAotAACME1NTY1++ctfKiQkRJGRkVq8eLFzX1FRkX7+858rLCxMdrtdI0aM0Oeff17vez3wwAOaOHGinnrqKedrHnroIVVWVjr3z549W1lZWbLZbOrRo4ckqUePHnr++edd3uu6665zqaWpbr75Zj366KMu2/Ly8uTn56dt27ZJkr799lvdf//96ty5swIDAzV27FhlZGRIkrZs2aJp06apuLhYNptNNpvNWUdFRYUWLFigq666SkFBQRo6dKi2bNniPM+JEyc0fvx4de7cWUFBQbr22mv1wQcfNPsaAFgTARCAaV577TUFBQUpLS1Nzz77rJ5++mlt3LhRknT33XcrNzdXH374ofbt26frr79eI0eOVGFhYb3vt2nTJh0+fFhbtmxRSkqK1q5dq6eeekqS9MILL+jpp59Wt27ddPr0ae3Zs6fFr2fKlClas2aNDMNwbnvjjTcUFRWlW2+9VdKFILp37169//77Sk1NlWEYuuOOO1RVVaWbb75Zzz//vOx2u06fPq3Tp09rwYIFkqRZs2YpNTVVa9as0RdffKG7775bY8aMcYbH5ORkVVRUaNu2bTpw4ID+8Ic/qGPHji1+jQDaJrqAAZhm4MCBevLJJyVJcXFx+utf/6pNmzapQ4cO2r17t3JzcxUQECBJ+tOf/qR3331Xb7/9tmbMmFHn+/n7++vvf/+7AgMDde211+rpp5/WI488ot/85jdyOBzq1KmTfH19FRkZ2SrX85Of/ERz587V9u3bnYFv9erVSkpKks1mU0ZGht5//33t2LHD2Q39+uuvKzo6Wu+++67uvvtuORwO2Ww2lxqzsrK0YsUKZWVlKSoqSpK0YMECbdiwQStWrNDvf/97ZWVlafLkyRowYIAkqWfPnq1yjQDaJgIgANMMHDjQ5XnXrl2Vm5urzz//XKWlperSpYvL/u+++05ff/11ve83aNAgBQYGOp8nJCSotLRU2dnZiomJadni6xAWFqZRo0bp9ddf16233qrMzEylpqbqv/7rvyRJhw8fVrt27TR06FDna7p06aJrrrlGhw8frvd9Dxw4oOrqavXu3dtlSs5EAwAAA29JREFUe0VFhfNnNGfOHM2cOVP//Oc/lZiYqMmTJ1/y8wWA+hAAAZjGz8/P5bnNZlNNTY1KS0vVtWtXlzFutVp6hqyPj49Ll60kVVVVXfb7TZkyRXPmzNHSpUu1evVqDRgwwNkqd7lKS0vl6+urffv2ydfX12VfbTfvz3/+c40ePVr/+Mc/9M9//lNLlizRc889p9mzZ1/RuQFYA2MAAbjd9ddfr5ycHLVr1069evVyeYSGhtb7us8//1zfffed8/muXbvUsWNHRUdH1/uasLAwnT592vm8pKREmZmZl137hAkTdO7cOW3YsEGrV6/WlClTnPv69u2r8+fPKy0tzbmtoKBAR44cUb9+/SRd6Maurq52ec/4+HhVV1crNzf3kp/H97uKo6Oj9dBDD2nt2rWaP3++/va3v132dQCwFgIgALdLTExUQkKCJk6cqH/+8586fvy4du7cqccee0x79+6t93WVlZWaPn26Dh06pA8++EBPPvmkZs2aJR+f+n+1jRgxQv/zP/+jTz/9VAcOHNDUqVMvaWVrjqCgIE2cOFFPPPGEDh8+rKSkJOe+uLg4TZgwQQ8++KC2b9+uzz//XD/72c901VVXacKECZIuzEouLS3Vpk2blJ+fr/LycvXu3VtTpkzR/fffr7Vr1yozM1O7d+/WkiVL9I9//EOSNHfuXH300UfKzMzUZ599ps2bN6tv376XfR0ArIUACMDtbDabPvjgAw0fPlzTpk1T7969de+99+rEiROKiIio93UjR45UXFychg8frnvuuUd33nlno8u5LFy4UD/4wQ/+X3t3jKJoDIYB+J1SsLUSbATF2krBXvEv/s5CEEvB0jtYeABbb+ElrLyIeAFxOmGXtZlZZlnyPJAqJHzpXvIFkqqqMp/PU9d1ut3ut+pfLpe5Xq+ZTCbpdDq/zJ1OpwyHw1RVldFolOfzmfP5/GqHj8fjbDabLBaLtFqtHA6H17rVapXdbpd+v5+6rnO5XF77Px6PbLfbDAaDTKfT9Hq9HI/Hb50DKMfH8/fHMAD/gfV6nfv97ss2gC9wAwgAUBgBEOCN/X6fZrP5xzGbzf51eQBfpgUM8Mbtdnv7E0mj0Ui73f7higD+DgEQAKAwWsAAAIURAAEACiMAAgAURgAEACiMAAgAUBgBEACgMAIgAEBhPgE75GKuXMt7WgAAAABJRU5ErkJggg==\n", 722 | "text/plain": [ 723 | "" 724 | ] 725 | }, 726 | "metadata": {}, 727 | "output_type": "display_data" 728 | } 729 | ], 730 | "source": [ 731 | "# Close previous plots; otherwise, will just overwrite and display again\n", 732 | "plt.close()\n", 733 | "\n", 734 | "sampled_df = data.sample(fraction=0.0001).toPandas()\n", 735 | "sampled_df.plot.scatter('helpful_votes', 'total_votes')\n", 736 | "%matplot plt" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "## Fitting a Machine Learning Model to Predict whether a Rating will be Good or Bad\n", 744 | "\n", 745 | "Once we've explored our data, let's say that we want to fit a Machine Learning model to predict a review's star rating based on other features about the review. Here, let's fit a simple (and extremely naive, for the sake of demonstration) model that predicts whether or a customer review will have a \"good\" star rating (>= 4) or a \"bad\" star rating (<4) based on the total number of votes a review receives. In your homework, you will improve upon this model, engineering other features from the dataset, as well as doing a better job of balancing the data (see below -- there are a lot more \"good\" star-ratings than \"bad\" ones) and setting up a reproducible machine learning pipeline.\n", 746 | "\n", 747 | "First, let's create a column that indicates whether a review is good or bad. We'll use this column as labels for machine learning. " 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 12, 753 | "metadata": {}, 754 | "outputs": [ 755 | { 756 | "data": { 757 | "application/vnd.jupyter.widget-view+json": { 758 | "model_id": "68028de131b8490aadac465419d6dac1", 759 | "version_major": 2, 760 | "version_minor": 0 761 | }, 762 | "text/plain": [ 763 | "VBox()" 764 | ] 765 | }, 766 | "metadata": {}, 767 | "output_type": "display_data" 768 | }, 769 | { 770 | "data": { 771 | "application/vnd.jupyter.widget-view+json": { 772 | "model_id": "", 773 | "version_major": 2, 774 | "version_minor": 0 775 | }, 776 | "text/plain": [ 777 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 778 | ] 779 | }, 780 | "metadata": {}, 781 | "output_type": "display_data" 782 | }, 783 | { 784 | "name": "stdout", 785 | "output_type": "stream", 786 | "text": [ 787 | "+-----------+-----------+\n", 788 | "|star_rating|good_review|\n", 789 | "+-----------+-----------+\n", 790 | "| 5| 1|\n", 791 | "| 4| 1|\n", 792 | "| 4| 1|\n", 793 | "| 5| 1|\n", 794 | "| 5| 1|\n", 795 | "+-----------+-----------+\n", 796 | "only showing top 5 rows\n", 797 | "\n", 798 | "+-----------+--------+\n", 799 | "|good_review| count|\n", 800 | "+-----------+--------+\n", 801 | "| 1|17208450|\n", 802 | "| 0| 3517710|\n", 803 | "+-----------+--------+" 804 | ] 805 | } 806 | ], 807 | "source": [ 808 | "# Good == 1, Bad == 0 (cast as integers so that pyspark.ml can understand them)\n", 809 | "data = data.withColumn('good_review', (data.star_rating >= 4).cast(\"integer\"))\n", 810 | "\n", 811 | "# Check to make sure new column is capturing star_rating correctly\n", 812 | "data[['star_rating', 'good_review']].show(5)\n", 813 | "\n", 814 | "# Take a look at how many good and bad reviews we have, respectively\n", 815 | "(data.groupBy('good_review')\n", 816 | " .count()\n", 817 | " .show()\n", 818 | ")" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "Then, let's use the `VectorAssembler` to get our total_votes feature into a form that `pyspark.ml` expects it to be in." 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 13, 831 | "metadata": {}, 832 | "outputs": [ 833 | { 834 | "data": { 835 | "application/vnd.jupyter.widget-view+json": { 836 | "model_id": "d501f73cce924fccbb84d89a7854d72f", 837 | "version_major": 2, 838 | "version_minor": 0 839 | }, 840 | "text/plain": [ 841 | "VBox()" 842 | ] 843 | }, 844 | "metadata": {}, 845 | "output_type": "display_data" 846 | }, 847 | { 848 | "data": { 849 | "application/vnd.jupyter.widget-view+json": { 850 | "model_id": "", 851 | "version_major": 2, 852 | "version_minor": 0 853 | }, 854 | "text/plain": [ 855 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 856 | ] 857 | }, 858 | "metadata": {}, 859 | "output_type": "display_data" 860 | }, 861 | { 862 | "name": "stdout", 863 | "output_type": "stream", 864 | "text": [ 865 | "+-----------+-----------+--------+\n", 866 | "|good_review|total_votes|features|\n", 867 | "+-----------+-----------+--------+\n", 868 | "| 1| 10| [10.0]|\n", 869 | "| 1| 7| [7.0]|\n", 870 | "| 1| 0| [0.0]|\n", 871 | "| 1| 1| [1.0]|\n", 872 | "| 1| 0| [0.0]|\n", 873 | "| 1| 7| [7.0]|\n", 874 | "| 1| 1| [1.0]|\n", 875 | "| 0| 7| [7.0]|\n", 876 | "| 1| 0| [0.0]|\n", 877 | "| 1| 9| [9.0]|\n", 878 | "| 1| 0| [0.0]|\n", 879 | "| 1| 1| [1.0]|\n", 880 | "| 1| 0| [0.0]|\n", 881 | "| 1| 0| [0.0]|\n", 882 | "| 0| 6| [6.0]|\n", 883 | "| 1| 0| [0.0]|\n", 884 | "| 1| 0| [0.0]|\n", 885 | "| 1| 1| [1.0]|\n", 886 | "| 1| 1| [1.0]|\n", 887 | "| 1| 1| [1.0]|\n", 888 | "+-----------+-----------+--------+\n", 889 | "only showing top 20 rows" 890 | ] 891 | } 892 | ], 893 | "source": [ 894 | "from pyspark.ml.feature import VectorAssembler\n", 895 | "\n", 896 | "features = ['total_votes']\n", 897 | "assembler = VectorAssembler(inputCols = features, outputCol = 'features')\n", 898 | "\n", 899 | "data = assembler.transform(data)\n", 900 | "data[['good_review', 'total_votes', 'features']].show()" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "metadata": {}, 906 | "source": [ 907 | "Then, we split up our data into training and test data and train a logistic regression model on our training data." 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 14, 913 | "metadata": {}, 914 | "outputs": [ 915 | { 916 | "data": { 917 | "application/vnd.jupyter.widget-view+json": { 918 | "model_id": "b3349118c86646a388d1f126d2243b4a", 919 | "version_major": 2, 920 | "version_minor": 0 921 | }, 922 | "text/plain": [ 923 | "VBox()" 924 | ] 925 | }, 926 | "metadata": {}, 927 | "output_type": "display_data" 928 | }, 929 | { 930 | "data": { 931 | "application/vnd.jupyter.widget-view+json": { 932 | "model_id": "", 933 | "version_major": 2, 934 | "version_minor": 0 935 | }, 936 | "text/plain": [ 937 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 938 | ] 939 | }, 940 | "metadata": {}, 941 | "output_type": "display_data" 942 | } 943 | ], 944 | "source": [ 945 | "from pyspark.ml.classification import LogisticRegression\n", 946 | "\n", 947 | "train, test = data.randomSplit([0.7, 0.3])\n", 948 | "\n", 949 | "lr = LogisticRegression(featuresCol='features', labelCol='good_review')\n", 950 | "model = lr.fit(train)" 951 | ] 952 | }, 953 | { 954 | "cell_type": "markdown", 955 | "metadata": {}, 956 | "source": [ 957 | "And, finally, we can make predictions for our test data and see how good our model is. Note that our predictive accuracy is not bad overall. However, from our \"false positive rate by label\" metric, we can see that our model is essentially just predicting that most reviews will be \"good\" reviews (label 1) -- likely because of our highly unbalanced data. Accordingly, you'll notice that our AUC is only barely above .5 (see ROC Curve below). In Assignment 3, it'll be your job to try to improve this model, so that we can better distinguish good from bad customer reviews." 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": 15, 963 | "metadata": {}, 964 | "outputs": [ 965 | { 966 | "data": { 967 | "application/vnd.jupyter.widget-view+json": { 968 | "model_id": "189a87f193c941418c73f6bc52bdc950", 969 | "version_major": 2, 970 | "version_minor": 0 971 | }, 972 | "text/plain": [ 973 | "VBox()" 974 | ] 975 | }, 976 | "metadata": {}, 977 | "output_type": "display_data" 978 | }, 979 | { 980 | "data": { 981 | "application/vnd.jupyter.widget-view+json": { 982 | "model_id": "", 983 | "version_major": 2, 984 | "version_minor": 0 985 | }, 986 | "text/plain": [ 987 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 988 | ] 989 | }, 990 | "metadata": {}, 991 | "output_type": "display_data" 992 | } 993 | ], 994 | "source": [ 995 | "# Training Summary Data\n", 996 | "trainingSummary = model.summary\n", 997 | "evaluationSummary = model.evaluate(test)" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "code", 1002 | "execution_count": null, 1003 | "metadata": {}, 1004 | "outputs": [ 1005 | { 1006 | "data": { 1007 | "application/vnd.jupyter.widget-view+json": { 1008 | "model_id": "3c72d869035144ff8291ba382180fe94", 1009 | "version_major": 2, 1010 | "version_minor": 0 1011 | }, 1012 | "text/plain": [ 1013 | "VBox()" 1014 | ] 1015 | }, 1016 | "metadata": {}, 1017 | "output_type": "display_data" 1018 | }, 1019 | { 1020 | "data": { 1021 | "application/vnd.jupyter.widget-view+json": { 1022 | "model_id": "bae04754455f4c91a3ce17987262680a", 1023 | "version_major": 2, 1024 | "version_minor": 0 1025 | }, 1026 | "text/plain": [ 1027 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 1028 | ] 1029 | }, 1030 | "metadata": {}, 1031 | "output_type": "display_data" 1032 | } 1033 | ], 1034 | "source": [ 1035 | "print(\"Training AUC: \" + str(trainingSummary.areaUnderROC))\n", 1036 | "print(\"Test AUC: \", str(evaluationSummary.areaUnderROC))\n", 1037 | "\n", 1038 | "print(\"\\nFalse positive rate by label (Training):\")\n", 1039 | "for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):\n", 1040 | " print(\"label %d: %s\" % (i, rate))\n", 1041 | "\n", 1042 | "print(\"\\nTrue positive rate by label (Training):\")\n", 1043 | "for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):\n", 1044 | " print(\"label %d: %s\" % (i, rate))\n", 1045 | " \n", 1046 | "print(\"\\nTraining Accuracy: \" + str(trainingSummary.accuracy))\n", 1047 | "print(\"Test Accuracy: \", str(evaluationSummary.accuracy))" 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "code", 1052 | "execution_count": null, 1053 | "metadata": {}, 1054 | "outputs": [], 1055 | "source": [ 1056 | "# Get ROC curve and send it to Pandas so that we can plot it\n", 1057 | "roc_df = evaluationSummary.roc.toPandas()" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "code", 1062 | "execution_count": null, 1063 | "metadata": {}, 1064 | "outputs": [], 1065 | "source": [ 1066 | "# Close previous plots; otherwise, will just overwrite and display again\n", 1067 | "plt.close()\n", 1068 | "\n", 1069 | "plt.plot(roc_df.FPR, roc_df.TPR, 'b', label = 'AUC = %0.2f' % evaluationSummary.areaUnderROC)\n", 1070 | "plt.legend(loc = 'lower right')\n", 1071 | "plt.plot([0, 1], [0, 1],'r--')\n", 1072 | "plt.xlim([0, 1])\n", 1073 | "plt.ylim([0, 1])\n", 1074 | "plt.ylabel('True Positive Rate')\n", 1075 | "plt.xlabel('False Positive Rate')\n", 1076 | "plt.title('ROC Curve')\n", 1077 | "plt.show()\n", 1078 | "\n", 1079 | "%matplot plt" 1080 | ] 1081 | } 1082 | ], 1083 | "metadata": { 1084 | "kernelspec": { 1085 | "display_name": "PySpark", 1086 | "language": "", 1087 | "name": "pysparkkernel" 1088 | }, 1089 | "language_info": { 1090 | "codemirror_mode": { 1091 | "name": "python", 1092 | "version": 3 1093 | }, 1094 | "mimetype": "text/x-python", 1095 | "name": "pyspark", 1096 | "pygments_lexer": "python3" 1097 | } 1098 | }, 1099 | "nbformat": 4, 1100 | "nbformat_minor": 4 1101 | } 1102 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML in an EMR Notebook/Local_Colab_Spark_Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Local/Colab Spark Setup", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true, 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "name": "python3" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "view-in-github", 22 | "colab_type": "text" 23 | }, 24 | "source": [ 25 | "\"Open" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "qpaNU3ETh0Qc", 32 | "colab_type": "text" 33 | }, 34 | "source": [ 35 | "# Setting up PySpark in a Colab Notebook\n", 36 | "\n", 37 | "You can run Spark both locally and on a cluster. Here, I'll demonstrate how you can set up Spark to run in a Colab notebook for debugging purposes. You can also set up Spark locally in this same way if you want to take advantage of multiple CPU cores on your laptop (the setup will vary slightly, though, depending on your operating system and you'll need to figure out these specifics on your own; however, this setup does also work on WSL for me).\n", 38 | "\n", 39 | "This being said, this local option should be for testing purposes on sample datasets only. If you want to run big PySpark jobs, you will want to run these in an EMR notebook (with an EMR cluster as your backend).\n", 40 | "\n", 41 | "First, we need to install Spark and PySpark, by running the following commands:" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "metadata": { 47 | "id": "R8f1D7wfaCgF", 48 | "colab_type": "code", 49 | "outputId": "75895338-8205-406b-be0a-c674341d07d5", 50 | "colab": { 51 | "base_uri": "https://localhost:8080/", 52 | "height": 68 53 | } 54 | }, 55 | "source": [ 56 | "%%bash\n", 57 | "apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", 58 | "wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", 59 | "tar -xzf spark-2.4.5-bin-hadoop2.7.tgz\n", 60 | "pip install pyspark findspark" 61 | ], 62 | "execution_count": 0, 63 | "outputs": [ 64 | { 65 | "output_type": "stream", 66 | "text": [ 67 | "Requirement already satisfied: pyspark in /usr/local/lib/python3.6/dist-packages (2.4.5)\n", 68 | "Requirement already satisfied: findspark in /usr/local/lib/python3.6/dist-packages (1.3.0)\n", 69 | "Requirement already satisfied: py4j==0.10.7 in /usr/local/lib/python3.6/dist-packages (from pyspark) (0.10.7)\n" 70 | ], 71 | "name": "stdout" 72 | } 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": { 78 | "id": "fHjZLeId0nvR", 79 | "colab_type": "text" 80 | }, 81 | "source": [ 82 | "OK, now that we have Spark, we need to set a path to it, so PySpark knows where to find it. We do this using the `os` Python library below.\n", 83 | "\n", 84 | "On my machine (WSL, Ubuntu 20.04), where I unpacked Spark in my home directory, this can be achieved with:\n", 85 | "```\n", 86 | "os.environ[\"SPARK_HOME\"] = \"~/spark-2.4.5-bin-hadoop2.7\"\n", 87 | "```\n", 88 | "\n", 89 | "In Colab, it is automatically downloaded to the `/content` directory, so we indicate that as its location here. Then, we run `findspark` to find Spark for us on the machine, and finally start up a SparkSession running on all available cores (`local[4]` means your code will run on 4 threads locally, `local[*]` means that your code will run as many threads as there are logical cores on your machine.\n" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "id": "CljIupW0aE06", 96 | "colab_type": "code", 97 | "colab": {} 98 | }, 99 | "source": [ 100 | "# Set path to Spark\n", 101 | "import os\n", 102 | "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"\n", 103 | "\n", 104 | "# Find Spark so that we can access session within our notebook\n", 105 | "import findspark\n", 106 | "findspark.init()\n", 107 | "\n", 108 | "# Start SparkSession on all available cores\n", 109 | "from pyspark.sql import SparkSession\n", 110 | "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" 111 | ], 112 | "execution_count": 0, 113 | "outputs": [] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": { 118 | "id": "VNTQOBLthDrC", 119 | "colab_type": "text" 120 | }, 121 | "source": [ 122 | "Now that we've installed everything and set up our paths correctly, we can run (small) Spark jobs both in Colab notebooks and locally (for bigger jobs, you will want to run these jobs on an EMR cluster, though. Remember, for instance, that Google only allocates us one CPU core for free)!\n", 123 | "\n", 124 | "Let's make sure our setup is working by doing couple of simple things with the pyspark.sql package on the Amazon Customer Review Sample Dataset." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "metadata": { 130 | "id": "fbXWBQfSAX8q", 131 | "colab_type": "code", 132 | "outputId": "69af5811-2391-49ba-b101-311ee10adcb5", 133 | "colab": { 134 | "base_uri": "https://localhost:8080/", 135 | "height": 51 136 | } 137 | }, 138 | "source": [ 139 | "! pip install wget\n", 140 | "import wget\n", 141 | "\n", 142 | "wget.download('https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv', 'sample_data/sample_us.tsv')" 143 | ], 144 | "execution_count": 0, 145 | "outputs": [ 146 | { 147 | "output_type": "stream", 148 | "text": [ 149 | "Requirement already satisfied: wget in /usr/local/lib/python3.6/dist-packages (3.2)\n" 150 | ], 151 | "name": "stdout" 152 | }, 153 | { 154 | "output_type": "execute_result", 155 | "data": { 156 | "text/plain": [ 157 | "'sample_data/sample_us.tsv'" 158 | ] 159 | }, 160 | "metadata": { 161 | "tags": [] 162 | }, 163 | "execution_count": 18 164 | } 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "metadata": { 170 | "id": "KrXWEMxjeFx1", 171 | "colab_type": "code", 172 | "colab": {} 173 | }, 174 | "source": [ 175 | "# Read TSV file from default data download directory in Colab\n", 176 | "data = spark.read.csv('sample_data/sample_us.tsv',\n", 177 | " sep=\"\\t\",\n", 178 | " header=True,\n", 179 | " inferSchema=True)" 180 | ], 181 | "execution_count": 0, 182 | "outputs": [] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "metadata": { 187 | "id": "2qvOOIYqeWw9", 188 | "colab_type": "code", 189 | "outputId": "a43569f6-d2bb-4b67-e849-c51a4aa62758", 190 | "colab": { 191 | "base_uri": "https://localhost:8080/", 192 | "height": 306 193 | } 194 | }, 195 | "source": [ 196 | "data.printSchema()" 197 | ], 198 | "execution_count": 0, 199 | "outputs": [ 200 | { 201 | "output_type": "stream", 202 | "text": [ 203 | "root\n", 204 | " |-- marketplace: string (nullable = true)\n", 205 | " |-- customer_id: integer (nullable = true)\n", 206 | " |-- review_id: string (nullable = true)\n", 207 | " |-- product_id: string (nullable = true)\n", 208 | " |-- product_parent: integer (nullable = true)\n", 209 | " |-- product_title: string (nullable = true)\n", 210 | " |-- product_category: string (nullable = true)\n", 211 | " |-- star_rating: integer (nullable = true)\n", 212 | " |-- helpful_votes: integer (nullable = true)\n", 213 | " |-- total_votes: integer (nullable = true)\n", 214 | " |-- vine: string (nullable = true)\n", 215 | " |-- verified_purchase: string (nullable = true)\n", 216 | " |-- review_headline: string (nullable = true)\n", 217 | " |-- review_body: string (nullable = true)\n", 218 | " |-- review_date: timestamp (nullable = true)\n", 219 | "\n" 220 | ], 221 | "name": "stdout" 222 | } 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "metadata": { 228 | "id": "ngb25JINcUNr", 229 | "colab_type": "code", 230 | "outputId": "029777eb-e3b9-4cec-8d30-e0cf55f6fcaf", 231 | "colab": { 232 | "base_uri": "https://localhost:8080/", 233 | "height": 187 234 | } 235 | }, 236 | "source": [ 237 | "(data.groupBy('star_rating')\n", 238 | " .sum('total_votes')\n", 239 | " .sort('star_rating', ascending=False)\n", 240 | " .show()\n", 241 | ")" 242 | ], 243 | "execution_count": 0, 244 | "outputs": [ 245 | { 246 | "output_type": "stream", 247 | "text": [ 248 | "+-----------+----------------+\n", 249 | "|star_rating|sum(total_votes)|\n", 250 | "+-----------+----------------+\n", 251 | "| 5| 13|\n", 252 | "| 4| 3|\n", 253 | "| 3| 8|\n", 254 | "| 2| 2|\n", 255 | "| 1| 8|\n", 256 | "+-----------+----------------+\n", 257 | "\n" 258 | ], 259 | "name": "stdout" 260 | } 261 | ] 262 | } 263 | ] 264 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Large-Scale Computing for the Social Sciences 2 | ### Spring 2020 - MACS 30123/MAPS 30123/PLSC 30123 3 | 4 | | Instructor Information | **TA Information** | Course Information | 5 | | :------------- | :------------- | :------------ | 6 | | Jon Clindaniel | Dhruval Bhatt | Location: [Online](https://canvas.uchicago.edu/courses/28258) | 7 | | 1155 E. 60th Street, Rm. 215 | | Monday/Wednesday | 8 | | jclindaniel@uchicago.edu | dhruval@uchicago.edu | 9:30-11:20 AM (CDT)| 9 | | **Office Hours:** [Schedule](https://appoint.ly/s/jclindaniel/office-hours)\* | **Office Hours:** [Schedule](https://appoint.ly/s/dhruval/office_hours)\*| **Lab:** Prerecorded, [Online](https://canvas.uchicago.edu/courses/28258)| 10 | 11 | \* Office Hours held via Zoom 12 | 13 | ## Course Description 14 | Computational social scientists increasingly need to grapple with data that is either too big for a local machine and/or code that is too resource intensive to process on a local machine. In this course, students will learn how to effectively scale their computational methods beyond their local machines. The focus of the course will be social scientific applications, ranging from training machine learning models on large economic time series to processing and analyzing social media data in real-time. Students will be introduced to several large-scale computing frameworks such as MPI, MapReduce, Spark, and OpenCL, with a special emphasis on employing these frameworks using cloud resources and the Python programming language. 15 | 16 | *Prerequisites: CAPP 30121 and CAPP 30122, or equivalent.* 17 | 18 | ## Course Structure 19 | This course is structured into several modules, or overarching thematic learning units, focused on teaching students fundamental large-scale computing concepts, as well as giving them the opportunity to apply these concepts to Computational Social Science research problems. Students can access the content in these modules on the [Canvas course site](https://canvas.uchicago.edu/courses/28258). Each module consists of a series of asynchronous lectures, readings, and assignments, which will make up a portion of the instruction time each week. If students have any questions about the asynchronous content, they should post these questions in the Piazza forum for the course, which they can access by clicking the "Piazza" tab on the left side of the screen on the Canvas course site. To see an overall schedule and syllabus for the course, as well as access additional course-related files, please visit the GitHub Course Repository, available here. 20 | 21 | Additionally, students will attend short, interactive Zoom sessions during regular class hours (starting at 9:30 AM CDT) on Mondays and Wednesdays meant to give them the opportunity to discuss and apply the skills they have learned asynchronously and receive live instructor feedback. Attendance to the synchronous class sessions is mandatory and is an important component of the final course grade. Students can access the Zoom sessions by clicking on the "Zoom" tab on the left side of the screen on the Canvas course site. Students should prepare for these classes by reading the assigned readings ahead of every session. All readings are available online and are linked in the course schedule below. 22 | 23 | Students will also virtually participate in a hands-on Python lab section each week, meant to instill practical large-scale computing skills related to the week’s topic. These labs are accessible on the Canvas course site and related code will be posted here in this GitHub repository. In order to practice these large-scale computing skills and complete the course assignments, students will be given free access to UChicago's [Midway Research Computing Cluster](https://rcc.uchicago.edu/docs/), [Amazon Web Services (AWS)](https://aws.amazon.com/) cloud computing resources, and [DataCamp](https://www.datacamp.com/). More information about accessing these resources will be provided to registered students in the first several weeks of the quarter. 24 | 25 | ## Grading 26 | There will be an assignment due at the end of each unit (3 in total). Each assignment is worth 20% of the overall grade, with all assignments together worth a total of 60%. Additionally, attendance and participation will be worth 10% of the overall grade. Finally, students will complete a final project that is worth 30% of the overall grade (25% for the project itself, and 5% for an end-of-quarter presentation). 27 | 28 | | Course Component | Grade Percentage | 29 | | :------------- | :------------- | 30 | | Assignments (Total: 3) | 60% | 31 | | Attendance/Participation | 10% | 32 | | Final Project | 5% (Presentation) | 33 | | | 25% (Project) | 34 | 35 | Students do have the option of taking this course on a pass/fail basis. Note that in order to earn a "Pass," students must earn at least a B- (80%) in each of the above course components (including participation) and inform the instructor that they would like to be graded on a pass/fail basis before their Final Project due date. 36 | 37 | ## Final Project 38 | For their final project (due 6/11/2020\*), students will write their own large-scale computing code that solves a social science research problem of their choosing. For instance, students might perform a computationally intensive demographic simulation, collect and analyze large-scale streaming social media data, or train a machine learning model on a large-scale economic dataset. Students will additionally record a short video presentation about their project in the final week of the course. Detailed descriptions and grading rubrics for the project and presentation are available [on the Canvas course site.](https://canvas.uchicago.edu/courses/28258) 39 | 40 | \* Due 6/5/2020 for students graduating in June 41 | 42 | ## Late Assignments/Projects 43 | Unexcused Late Assignment/Project Submissions will be penalized 10 percentage points for every hour they are late. For example, if an assignment is due on Wednesday at 2:00pm, the following percentage points will be deducted based on the time stamp of the last commit. 44 | 45 | | Example last commit | Percentage points deducted | 46 | | ---- | ---- | 47 | | 2:01pm to 3:00pm | -10 percentage points | 48 | | 3:01pm to 4:00pm |-20 percentage points | 49 | | 4:01pm to 5:00pm | -30 percentage points | 50 | | 5:01pm to 6:00pm | -40 percentage points | 51 | | ... | ... | 52 | | 11:01pm and beyond | -100 percentage points (no credit) | 53 | 54 | ## Plagiarism on Assignments/Projects 55 | Academic honesty is an extremely important principle in academia and at the University of Chicago. 56 | * Writing assignments must quote and cite any excerpts taken from another work. 57 | * If the cited work is the particular paper referenced in the Assignment, no works cited or references are necessary at the end of the composition. 58 | * If the cited work is not the particular paper referenced in the Assignment, you MUST include a works cited or references section at the end of the composition. 59 | * Any copying of other students' work will result in a zero grade and potential further academic discipline. 60 | If you have any questions about citations and references, consult with your instructor. 61 | 62 | ## Statement of Diversity and Inclusion 63 | The University of Chicago is committed to diversity and rigorous inquiry from multiple perspectives. The MAPSS, CIR, and Computation programs share this commitment and seek to foster productive learning environments based upon inclusion, open communication, and mutual respect for a diverse range of identities, experiences, and positions. 64 | 65 | Any suggestions for how we might further such objectives both in and outside the classroom are appreciated and will be given serious consideration. Please share your suggestions or concerns with your instructor, your preceptor, or your program’s Diversity and Inclusion representatives: Darcy Heuring (MAPSS), Matthias Staisch (CIR), and Chad Cyrenne (Computation). You are also welcome and encouraged to contact the Faculty Director of your program. 66 | 67 | This course is open to all students who meet the academic requirements for participation. Any student who has a documented need for accommodation should contact Student Disability Services (773-702-6000 or disabilities@uchicago.edu) and the instructor as soon as possible. 68 | 69 | ## Course Schedule 70 | | Unit | Week | Day | Topic | Readings | Assignment | 71 | | --- | --- | --- | --- | --- | --- | 72 | | Fundamentals of Large-Scale Computing | Week 1: Introduction to Large-Scale Computing for the Social Sciences | 4/6/2020 | Introduction to the course and course goals | | | 73 | | | | 4/8/2020 | General Considerations for Large-Scale Computing | [Robey and Zamora 2020 (Chapter 1)](https://livebook.manning.com/book/parallel-and-high-performance-computing/chapter-1) | | 74 | | | Week 2: On-Premise Large-Scale CPU-computing with MPI | 4/13/2020 | An Introduction to CPUs and Research Computing Clusters | [Pacheco 2011](https://canvas.uchicago.edu/files/3391260/download?download_frd=1) (Ch. 1-2), [Midway RCC User Guide](https://rcc.uchicago.edu/docs/) | | 75 | | | | 4/15/2020 | Cluster Computing via Message Passing Interface (MPI) for Python | [Pacheco 2011](https://canvas.uchicago.edu/files/3391260/download?download_frd=1) (Ch. 3), [Dalcín et al. 2008](https://www-sciencedirect-com.proxy.uchicago.edu/science/article/pii/S0743731507001712?via%3Dihub) | | 76 | |||| Lab: Hands-on Introduction to UChicago's Midway Computing Cluster and mpi4py ||| 77 | | | Week 3: On-Premise GPU-computing with OpenCL | 4/20/2020 | An Introduction to GPUs and Heterogenous Computing with OpenCL | [Scarpino 2012](https://canvas.uchicago.edu/files/3391262/download?download_frd=1) (Read Ch. 1, Skim Ch. 2-5,9) | | 78 | | | | 4/22/2020 | Harnessing GPUs with PyOpenCL | [Klöckner et al. 2012](https://arxiv.org/pdf/0911.3456.pdf) | | 79 | |||| Lab: Introduction to PyOpenCL and GPU Computing on UChicago's Midway Computing Cluster ||| 80 | | Architecting Computational Social Science Data Solutions in the Cloud | Week 4: An Introduction to Cloud Computing and Cloud HPC Architectures | 4/27/2020 | An Introduction to the Cloud Computing Landscape and AWS | [Jorissen and Bouffler 2017](https://canvas.uchicago.edu/files/3391263/download?download_frd=1) (Read Ch. 1, Skim Ch. 4-7), [Armbrust et al. 2009](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf), [Jonas et al. 2019](https://arxiv.org/pdf/1902.03383.pdf) | | 81 | | | | 4/29/2020 | Architectures for Large-Scale Computation in the Cloud | [Introduction to HPC on AWS](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf), [HPC Architectural Best Practices](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) | Due: Assignment 1 (11:59 PM) | 82 | |||| Lab: Running "Serverless" HPC Jobs in the AWS Cloud ||| 83 | | | Week 5: Architecting Large-Scale Data Solutions in the Cloud | 5/4/2020 | "Data Lake" Architectures | [Data Lakes and Analytics on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/), [AWS Data Lake Whitepaper](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf), [*Introduction to AWS Boto in Python*](https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python) (DataCamp Course; Practice working with S3 Data Lake in Python) | | 84 | | | | 5/6/2020 | Architectures for Large-Scale Data Structuring and Storage | General, Open Source: ["What is a Database?" (YouTube),](https://www.youtube.com/watch?v=Tk1t3WKK-ZY) ["How to Choose the Right Database?" (YouTube),](https://www.youtube.com/watch?v=v5e_PasMdXc) [“NoSQL Explained,”](https://www.mongodb.com/nosql-explained) AWS-specific solutions: ["Which Database to Use When?" (YouTube),](https://youtu.be/KWOSGVtHWqA) Optional: [Data Warehousing on AWS Whitepaper](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf), [AWS Big Data Whitepaper](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) | | 85 | |||| Lab: Exploring Large Data Sources in an S3 Data Lake with the AWS Python SDK, Boto ||| 86 | | | Week 6: Large-Scale Data Ingestion and Processing | 5/11/2020 | Stream Ingestion and Processing with Apache Kafka, AWS Kinesis | [Narkhede et al. 2017](https://canvas.uchicago.edu/files/3391266/download?download_frd=1) (Read Ch. 1, Skim 3-6,11), [Dean and Crettaz 2019 (Ch. 4)](https://livebook.manning.com/book/event-streams-in-action/chapter-4/), [AWS Kinesis Whitepaper](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) || 87 | | | | 5/13/2020 | Batch Processing with Apache Hadoop and MapReduce | [White 2015](https://canvas.uchicago.edu/files/3391265/download?download_frd=1) (read Ch. 1-2, Skim 3-4), [Dean and Ghemawat 2004](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf), ["What is Amazon EMR?"](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html), Running MapReduce Jobs with Python’s “mrjob” package on EMR ([Fundamentals](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html) and [Elastic MapReduce Quickstart](https://mrjob.readthedocs.io/en/latest/guides/emr-quickstart.html)) | | 88 | |||| Lab: Ingesting and Processing Batch Data with MapReduce on AWS EMR and Streaming Data with AWS Kinesis | || 89 | | Large-Scale Data Analysis and Prediction | Week 7: An Introduction to High-Level Large-Scale Computing Paradigms for Data Analysis and Prediction | 5/18/2020 | Large-Scale Data Processing and Analysis with Spark and Dask | [Karau et al. 2015](https://canvas.uchicago.edu/files/3391268/download?download_frd=1) (Read Ch. 1-4, Skim 9-11), [*Introduction to PySpark*](https://learn.datacamp.com/courses/introduction-to-pyspark) (DataCamp Course), [“Why Dask,”](https://docs.dask.org/en/latest/why.html) [Dask Slide Deck](https://dask.org/slides.html), Optional: [*Parallel Programming with Dask*](https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python) (DataCamp Course) | | 90 | | | | 5/20/2020 | A Deeper Dive into Spark (with Social Science Machine Learning Applications) | [*Machine Learning with PySpark*](https://campus.datacamp.com/courses/machine-learning-with-pyspark) (DataCamp Course), Optional: [*Feature Engineering with PySpark*](https://learn.datacamp.com/courses/feature-engineering-with-pyspark) (DataCamp Course), Videos about accelerating Spark with GPUs (via [Horovod](https://www.youtube.com/watch?v=D1By2hy4Ecw) for deep learning, and the RAPIDS Library for general operations [Part 1](https://www.youtube.com/watch?v=Qw-TB6EHmR8) and [Part 2](https://www.youtube.com/watch?v=ApI2EZIJU_Q)) | Due: Assignment 2 (11:59 PM) | 91 | |||| Lab: Running PySpark in an AWS EMR Notebook for Large-Scale Data Analysis and Prediction ||| 92 | | | Week 8: Large-Scale Network Analysis | 5/25/2020 | Memorial Day (No Class) | | | 93 | | | | 5/27/2020 | Processing and Analyzing Large-Scale Graphs with Spark | [Guller 2015](https://canvas.uchicago.edu/files/3391270/download?download_frd=1), [Hunter 2017](https://www.youtube.com/watch?v=NmbKst7ny5Q) (Spark Summit Talk), [GraphFrames Documentation for Python](https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html) | | 94 | |||| Lab: Network Analysis with PySpark ||| 95 | | Student Projects and Presentations | Week 9: Final Project Presentations | 6/1/2020 | View and Comment on Student Presentations | | Due: Final Project Presentation Video (9:30 AM) | 96 | ||| 6/3/2020 | View and Comment on Student Presentations | | Due: Assignment 3 (11:59 PM) | 97 | ||| 6/5/2020 ||| Due: Final Project (11:59 PM; For June 2020 Graduates) 98 | || Week 10: Final Projects | 6/11/2020 ||| Due: Final Project (11:59 PM; For all other students) | 99 | 100 | ## Works Cited 101 | 102 | Armbrust, Michael, Fox, Armando, Griffith, Rean, Joseph, Anthony D., Katz, Randy H., Konwinski, Andrew, Lee, Gunho, Patterson, David A., Rabkin, Ariel, Stoica, Ion, and Matei Zaharia. 2009. "Above the Clouds: A Berkeley View of Cloud Computing." Technical report, EECS Department, University of California, Berkeley. 103 | 104 | ["AWS Big Data Analytics Options on AWS." December 2018.)](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) AWS Whitepaper. 105 | 106 | ["Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility." July 2017.](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf) 107 | 108 | Dalcín, Lisandro, Paz, Rodrigo, Storti, Mario, and Jorge D'Elía. 2008. "MPI for Python: Performance improvements and MPI-2 extensions." *J. Parallel Distrib. Comput.* 68: 655-662. 109 | 110 | ["Data Warehousing on AWS." March 2016.](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf) AWS Whitepaper. 111 | 112 | Dean, Alexander, and Valentin Crettaz. 2019. *Event Streams in Action: Real-time event systems with Kafka and Kinesis*. Shelter Island, NY: Manning. 113 | 114 | Dean, Jeffrey, and Sanjay Ghemawat. 2004. "MapReduce: Simplified data processing on large clusters." In *Proceedings of Operating Systems Design and Implementation (OSDI)*. San Francisco, CA. 137-150. 115 | 116 | *Feature Engineering with PySpark*. https://learn.datacamp.com/courses/feature-engineering-with-pyspark. Accessed 3/2020. 117 | 118 | "GraphFrames user guide - Python." https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html. Accessed 3/2020. 119 | 120 | Guller, Mohammed. 2015. "Graph Processing with Spark." In *Big Data Analytics with Spark*. New York: Apress. 121 | 122 | ["High Performance Computing Lens AWS Well-Architected Framework." December 2019.](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) AWS Whitepaper. 123 | 124 | Hunter, Tim. October 26, 2017. "GraphFrames: Scaling Web-Scale Graph Analytics with Apache Spark." https://www.youtube.com/watch?v=NmbKst7ny5Q. 125 | 126 | *Introduction to AWS Boto in Python*. https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python. Accessed 3/2020. 127 | 128 | ["Introduction to HPC on AWS." n.d.](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf) AWS Whitepaper. 129 | 130 | *Introduction to PySpark*. https://learn.datacamp.com/courses/introduction-to-pyspark. Accessed 3/2020. 131 | 132 | Jonas, Eric, Schleier-Smith, Johann, Sreekanti, Vikram, and Chia-Che Tsai. 2019. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." Technical report, EECS Department, University of California, Berkeley. 133 | 134 | Jorissen, Kevin, and Brendan Bouffler. 2017. *AWS Research Cloud Program: Researcher's Handbook*. Amazon Web Services. 135 | 136 | Kane, Frank. March 23, 2018. ["How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase"](https://www.youtube.com/watch?v=v5e_PasMdXc). https://www.youtube.com/watch?v=v5e_PasMdXc 137 | 138 | Karau, Holden, Konwinski, Andy, Wendell, Patrick, and Matei Zaharia. 2015. *Learning Spark*. Sebastopol, CA: O'Reilly. 139 | 140 | Klöckner, Andreas, Pinto, Nicolas, Lee, Yunsup, Catanzaro, Bryan, Ivanov, Paul, and Ahmed Fasih. 2012. "PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation." *Parallel Computing* 38(3): 157-174. 141 | 142 | Linux Academy. July 10, 2019. ["What is a Database?"](https://www.youtube.com/watch?v=Tk1t3WKK-ZY). https://www.youtube.com/watch?v=Tk1t3WKK-ZY 143 | 144 | *Machine Learning with PySpark*. https://campus.datacamp.com/courses/machine-learning-with-pyspark. Accessed 3/2020. 145 | 146 | Martinez, Miguel, and Thomas Graves. "Accelerating Apache Spark by Several Orders of Magnitude with GPUs." https://www.youtube.com/watch?v=Qw-TB6EHmR8. 147 | 148 | "mrjob v0.7.1 documentation." https://mrjob.readthedocs.io/en/latest/index.html. Accessed 3/2020. 149 | 150 | Narkhede, Neha, Shapira, Gwen, and Todd Palino. 2017. *Kafka: The Definitive Guide*. Sebastopol, CA: O'Reilly. 151 | 152 | “NoSQL Explained.” https://www.mongodb.com/nosql-explained. Accessed 3/2020. 153 | 154 | Pacheco, Peter. 2011. *An Introduction to Parallel Programming*. Burlington, MA: Morgan Kaufmann. 155 | 156 | *Parallel Programming with Dask*. https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python. Accessed 3/2020. 157 | 158 | Petrossian, Tony, and Ian Meyers. November 30, 2017. "Which Database to Use When?" https://youtu.be/KWOSGVtHWqA. AWS re:Invent 2017. 159 | 160 | "RCC User Guide." rcc.uchicago.edu/docs/. Accessed March 2020. 161 | 162 | Robey, Robert and Yuliana Zamora. 2020. *Parallel and High Performance Computing*. Manning Early Access Program. 163 | 164 | Scarpino, Matthew. 2012. *OpenCL in Action*. Shelter Island, NY: Manning. 165 | 166 | Sergeev, Alex. March 28, 2019. "Distributed Deep Learning with Horovod." https://www.youtube.com/watch?v=D1By2hy4Ecw. 167 | 168 | ["Streaming Data Solutions on AWS with Amazon Kinesis." July 2017.](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) AWS Whitepaper. 169 | 170 | "What is Amazon EMR." https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html. Accessed 3/2020. 171 | 172 | White, Tom. 2015. *Hadoop: The Definitive Guide*. Sebastopol, CA: O'Reilly. 173 | 174 | "Why Dask." https://docs.dask.org/en/latest/why.html. Accessed 3/2020. 175 | --------------------------------------------------------------------------------