├── .gitignore ├── LICENSE ├── Labs ├── Lab 1 Midway RCC and mpi4py │ ├── midway_cheat_sheet.md │ ├── mpi.sbatch │ ├── mpi_multi_job.sbatch │ └── mpi_rand_walk.py ├── Lab 10 Launching a Scalable API │ ├── api_demo │ │ ├── api.zip │ │ ├── api │ │ │ ├── application.py │ │ │ ├── requirements.txt │ │ │ └── templates │ │ │ │ └── index.html │ │ ├── launch_db.py │ │ └── terminate_db.py │ └── flask_basics │ │ ├── app.py │ │ └── templates │ │ ├── base.html │ │ └── scores.html ├── Lab 11 Visualizing Large Data │ └── 9W_VisualizingLargeData.ipynb ├── Lab 2 PyOpenCL │ ├── Lab_2_PyOpenCL_Random_Walk_Tutorial.ipynb │ ├── gpu.sbatch │ ├── gpu_rand_walk.py │ └── print_gpu_info.py ├── Lab 3 AWS EC2 and PyWren │ ├── Lab_3_PyWren.ipynb │ ├── isbn.txt │ └── pywren_workflow.png ├── Lab 4 Storing and Structuring Large Data │ ├── 5W_large_scale_databases.ipynb │ └── Lab 4 Working with Large Data Sources in S3.ipynb ├── Lab 5 Ingesting and Processing Large-Scale Data │ ├── Part I MapReduce │ │ ├── .mrjob.conf │ │ ├── mapreduce_lab5.py │ │ ├── mrjob_cheatsheet.md │ │ └── sample_us.tsv │ └── Part II Kinesis │ │ ├── Lab 5 Kinesis.ipynb │ │ ├── consumer.py │ │ ├── consumer_feed.png │ │ ├── producer.py │ │ └── simple_kinesis_architecture.png ├── Lab 6 PySpark EDA and ML │ ├── 7M_PySpark_Midway.ipynb │ ├── 7M_PySpark_Midway.py │ ├── Lab_6.ipynb │ ├── Local_Colab_Spark_Setup.ipynb │ ├── ec2_limit_increase.png │ ├── gpu_cluster_instructions.md │ ├── my-bootstrap-action.sh │ ├── my-configurations.json │ └── spark.sbatch ├── Lab 7 Exploring the Larger Spark Ecosystem │ ├── Lab_7_GraphFrames.ipynb │ ├── bootstrap │ ├── edges.csv │ ├── nodes.csv │ ├── spark_nlp.ipynb │ └── spark_streaming_emr.ipynb ├── Lab 8 Parallel Computing with Dask │ ├── Part I - Dask on EMR │ │ ├── Dask on an EMR Cluster.ipynb │ │ ├── bootstrap-dask │ │ └── dask_bootstrap_workflow.md │ └── Part II - Dask on Midway │ │ └── Dask ML on Midway.ipynb └── Lab 9 Accelerating Dask │ ├── 8W_Dask_Numba.ipynb │ ├── 8W_Dask_Rapids.ipynb │ ├── listings_bos.csv │ ├── listings_chi.csv │ ├── listings_sf.csv │ └── test_listings.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | build/* 3 | dist/* 4 | *.aux 5 | *.bbl 6 | *.blg 7 | *.fdb_latexmk 8 | *.idx 9 | *.ilg 10 | *.ind 11 | *.lof 12 | *.log 13 | *.lot 14 | *.out 15 | *.pdfsync 16 | *.synctex.gz 17 | *.toc 18 | *.swp 19 | *.asv 20 | *.nav 21 | *.snm 22 | *.gz 23 | *.bib.bak 24 | *.fls 25 | *.m~ 26 | *.sublime* 27 | .DS_Store 28 | *.dta 29 | *.ipynb_checkpoints* 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Jon Clindaniel 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/midway_cheat_sheet.md: -------------------------------------------------------------------------------- 1 | # Login, Configuration, and Copying Files to the Midway RCC 2 | See [Midway RCC user guide](https://rcc.uchicago.edu/docs/) for login details specific to your system and additional options. The instructions below assume a Unix-style command line interface. 3 | 4 | ### SSH into Cluster with cNetID and Password 5 | ``` 6 | ssh your_cnet_id@midway2.rcc.uchicago.edu 7 | ``` 8 | Note that you'll now be able to move around your home directory using standard unix commands (`cd`, `pwd`, `ls`, etc.). If you are on a Windows machine, I (Jon) recommend enabling [Windows Subsystem for Linux and installing Ubuntu 18.04](https://docs.microsoft.com/en-us/windows/wsl/install-win10) to complete all of these tasks. This is what I do and I find this makes my life a lot easier than having to mentally transition between DOS and Unix commands (or needing to use an additional third-party tool). 9 | 10 | ### SCP files to/from local directory 11 | If you need to transfer files to and from Midway's storage, I find it easiest to just copy files via the `scp` command. Here, I copy a local file `local_file` in my current directory to my home directory on Midway. Then, I copy a file `remote_file` from my Midway home directory back to my current directory on my local machine. If you prefer not to use this command line approach, there are tutorials in the Midway documentation with [alternative approaches](https://rcc.uchicago.edu/docs/data-transfer/index.html). 12 | 13 | ``` 14 | scp local_file your_cnet_id@midway2.rcc.uchicago.edu: 15 | ``` 16 | ``` 17 | scp your_cnet_id@midway2.rcc.uchicago.edu:remote_file . 18 | ``` 19 | 20 | To copy an entire directory, use the `-r` flag: 21 | ``` 22 | scp -r your_cnet_id@midway2.rcc.uchicago.edu:remote_directory . 23 | ``` 24 | 25 | ### Clone GitHub Repository 26 | Another good option is to clone a GitHub repository (for instance this public course repository, or your personal fork of it) to your home directory on Midway and access files/code this way. You can then pull updates from the course repository as new material is added. 27 | 28 | ``` 29 | git clone https://github.com/jonclindaniel/LargeScaleComputing_S21.git 30 | ``` 31 | 32 | ### Adjustments to text on the Command Line 33 | If need to make any adjustments to text files before/after running them (or create new ones) on Midway, can do so on the command line with text editing tools such as nano, vim, etc. 34 | ``` 35 | nano mpi.sbatch 36 | ``` 37 | 38 | # Loading Modules and installing programs 39 | 40 | ### Load appropriate modules for Python, MPI, and OpenCL development 41 | ``` 42 | module load cuda 43 | module load python/anaconda-2019.03 44 | module load intelmpi/2018.2.199+intel-18.0 45 | ``` 46 | 47 | ### Install additional Python packages to local user directory from login node 48 | To install packages that are not already included in a module (such as installing mpi4py and PyOpenCL, which we will be using in this class, along with Mako, which is used in PyOpenCL programs), you can use `pip` to install these packages in your local user directory from the login node on Midway. Make sure you have loaded the modules above before you try to run the below commands to ensure you install the correct versions of the packages. 49 | 50 | ``` 51 | pip install --user mpi4py 52 | pip install --user pyopencl 53 | pip install --user mako 54 | ``` 55 | 56 | # Running jobs 57 | 58 | Resource sharing and job scheduling is handled by Slurm software on the Midway RCC. You can see detailed information in the Midway documentation, but Slurm allows you to see which partitions are available at any given time via the `sinfo` command, submit batch jobs via the `sbatch` command (which will allow you to request specific node/interconnect resources for a period of time and run your code on those resources), and schedule interactive sessions via the `sinteractive` command. Listed below are some of the fundamental commands you will need to know how to perform for this class. 59 | 60 | ### Run Batch jobs with Slurm scripts 61 | You will use Slurm scripts to request computational resources for a period of time and run your code. These scripts can be run with the `sbatch` commands, as demonstrated below. You can check out sample Slurm scripts in our GitHub course repository, for more detail on how these scripts are structured. 62 | 63 | ``` 64 | sbatch mpi.sbatch 65 | sbatch gpu.sbatch 66 | ``` 67 | 68 | Note, if you wrote your sbatch file on a Windows machine, you might need to convert the text into a Unix format for it to run properly on the Midway RCC. To do so, you can install `dos2unix` on WSL via `apt-get install dos2unix` and then run the converter on your sbatch file. 69 | 70 | ``` 71 | dos2unix gpu.sbatch 72 | ``` 73 | 74 | You can monitor the progress of your job with (the sbatch command will print out your job ID): 75 | ``` 76 | squeue --job your_job_id 77 | ``` 78 | 79 | You can also cancel jobs with: 80 | ``` 81 | scancel your_job_id 82 | ``` 83 | 84 | ### Check results of your batch job (assuming write to `.out` file) 85 | In your Slurm script, you will specify a `.out` file, where the output of your program will be written. You can download the file to your local machine to look at the results (via `scp`), or you can check the results on the Midway command line with the standard Unix `cat` command. 86 | 87 | ``` 88 | cat gpu.out 89 | ``` 90 | 91 | ### Run interactive jobs 92 | You should not perform intensive computational tasks on the login nodes. Use the `sinteractive` command to set up an interactive session on other Midway nodes if you want to have an interactive command line experience (you can specify exactly which nodes you would like to access; see the documentation for syntax). Here we request 4 cores for 15 minutes. Additionally, you can use the interactive mode to run Jupyter notebooks, which you can view in your browser (see documentation for more details). 93 | 94 | ``` 95 | sinteractive --time=00:15:00 --ntasks=4 96 | ``` 97 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=mpi 4 | #SBATCH --output=mpi.out 5 | #SBATCH --ntasks=4 6 | #SBATCH --partition=broadwl 7 | #SBATCH --constraint=fdr 8 | 9 | # Load Python and MPI modules 10 | module load python/anaconda-2019.03 11 | module load intelmpi/2018.2.199+intel-18.0 12 | 13 | # Run the python program with mpirun. The -n flag is not required; 14 | # mpirun will automatically figure out the best configuration from the 15 | # Slurm environment variables. 16 | mpirun python3 ./mpi_rand_walk.py 17 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi_multi_job.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=mpi_multi_job 4 | #SBATCH --ntasks=11 5 | #SBATCH --partition=broadwl 6 | #SBATCH --constraint=fdr 7 | 8 | # Load Python and MPI modules 9 | module load python/anaconda-2019.03 10 | module load intelmpi/2018.2.199+intel-18.0 11 | 12 | # Run the python program with mpirun, using & to run jobs at the same time 13 | mpirun -n 1 python3 ./mpi_rand_walk.py > ./mpi_nprocs1.out & 14 | mpirun -n 10 python3 ./mpi_rand_walk.py > ./mpi_nprocs10.out & 15 | 16 | # Wait until all simultaneous mpiruns are done 17 | wait 18 | -------------------------------------------------------------------------------- /Labs/Lab 1 Midway RCC and mpi4py/mpi_rand_walk.py: -------------------------------------------------------------------------------- 1 | from mpi4py import MPI 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | import time 5 | 6 | def sim_rand_walks_parallel(n_runs): 7 | # Get rank of process and overall size of communicator: 8 | comm = MPI.COMM_WORLD 9 | rank = comm.Get_rank() 10 | size = comm.Get_size() 11 | 12 | # Start time: 13 | t0 = time.time() 14 | 15 | # Evenly distribute number of simulation runs across processes 16 | N = int(n_runs / size) 17 | 18 | # Simulate N random walks on each MPI Process and specify as a NumPy Array 19 | r_walks = [] 20 | for i in range(N): 21 | steps = np.random.normal(loc = 0, scale = 1, size = 100) 22 | steps[0] = 0 23 | r_walks.append(100 + np.cumsum(steps)) 24 | r_walks_array = np.array(r_walks) 25 | 26 | # Gather all simulation arrays to buffer of expected size/dtype on rank 0 27 | r_walks_all = None 28 | if rank == 0: 29 | r_walks_all = np.empty([N * size, 100], dtype = 'float') 30 | comm.Gather(sendbuf = r_walks_array, recvbuf = r_walks_all, root = 0) 31 | 32 | # Print/plot simulation results on rank 0 33 | if rank == 0: 34 | # Calculate time elapsed after computing mean and std 35 | average_finish = np.mean(r_walks_all[:, -1]) 36 | std_finish = np.std(r_walks_all[:, -1]) 37 | time_elapsed = time.time() - t0 38 | 39 | # Print time elapsed + simulation results 40 | print("Simulated %d Random Walks in: %f seconds on %d MPI processes" 41 | % (n_runs, time_elapsed, size)) 42 | print("Average final position: %f, Standard Deviation: %f" 43 | % (average_finish, std_finish)) 44 | 45 | # Plot Simulations and save to file 46 | plt.plot(r_walks_all.transpose()) 47 | plt.savefig("r_walk_nprocs%d_nruns%d.png" % (size, n_runs)) 48 | 49 | return 50 | 51 | def main(): 52 | sim_rand_walks_parallel(n_runs = 10000) 53 | 54 | if __name__ == '__main__': 55 | main() 56 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/api.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 10 Launching a Scalable API/api_demo/api.zip -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/api/application.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, render_template, jsonify 2 | import boto3 3 | 4 | # Create an instance of Flask class (represents our application) 5 | # Pass in name of application's module (__name__ evaluates to current module name) 6 | app = Flask(__name__) 7 | application = app # AWS EB requires it to be called "application" 8 | 9 | # on EC2, needs to know region name as well; no config 10 | dynamodb = boto3.resource('dynamodb', region_name='us-east-1') 11 | table = dynamodb.Table('books') 12 | 13 | # Provide a landing page with some documentation on how to use API 14 | @app.route("/") 15 | def home(): 16 | return render_template('index.html') 17 | 18 | # Get items from DynamoDB `books` table 19 | # Can provide additional API functionality with more complicated SQL queries 20 | @app.route("/api/isbn:") 21 | def isbn(isbn): 22 | response = table.get_item(Key={'isbn': str(isbn)}) 23 | return jsonify(response['Item']) 24 | 25 | if __name__ == "__main__": 26 | application.run() 27 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/api/requirements.txt: -------------------------------------------------------------------------------- 1 | flask 2 | boto3 3 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/api/templates/index.html: -------------------------------------------------------------------------------- 1 | 4 | 5 | 12 | 13 | 14 | 15 |

16 | My Books API 17 |

18 |

19 | This is my books API and here's some documentation on how to use it: 20 |

21 |

22 | Example Query: URL_BASE/api/isbn:043591010 23 |

24 | 25 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/launch_db.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | 3 | def create_books_table(): 4 | dynamodb = boto3.resource('dynamodb') 5 | 6 | table = dynamodb.create_table( 7 | TableName='books', 8 | KeySchema=[ 9 | { 10 | 'AttributeName': 'isbn', 11 | 'KeyType': 'HASH' 12 | } 13 | ], 14 | AttributeDefinitions=[ 15 | { 16 | 'AttributeName': 'isbn', 17 | 'AttributeType': 'S' 18 | } 19 | ], 20 | ProvisionedThroughput={ 21 | 'ReadCapacityUnits': 1, 22 | 'WriteCapacityUnits': 1 23 | } 24 | ) 25 | 26 | # Wait until AWS confirms that table exists before moving on 27 | table.meta.client.get_waiter('table_exists').wait(TableName='books') 28 | 29 | # Put some data into the table 30 | table.put_item( 31 | Item={ 32 | 'isbn': '043591010', 33 | 'title': "Opening Spaces: An Anthology of Contemporary African Women's Writing", 34 | 'author': 'Yvonne Vera', 35 | 'year': '1999' 36 | } 37 | ) 38 | 39 | table.put_item( 40 | Item={ 41 | 'isbn': '0394721179', 42 | 'title': "African Folktales: Traditional Stories of the Black World", 43 | 'author': 'Roger D. Abrahams', 44 | 'year': '1983' 45 | } 46 | ) 47 | 48 | if __name__ == "__main__": 49 | create_books_table() 50 | print("DynamoDB 'books' table has been created and populated") 51 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/api_demo/terminate_db.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | 3 | def delete_books_table(): 4 | dynamodb = boto3.resource('dynamodb') 5 | table = dynamodb.Table('books') 6 | table.delete() 7 | 8 | if __name__ == "__main__": 9 | delete_books_table() 10 | print("DynamoDB 'books' table has been terminated") 11 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/flask_basics/app.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Run application with `python app.py` or `flask run` command 3 | in terminal window 4 | ''' 5 | from flask import Flask, render_template 6 | import numpy as np 7 | 8 | # Create an instance of Flask class (represents our application) 9 | # Pass in name of application's module (__name__ evaluates to current module name) 10 | app = Flask(__name__) 11 | 12 | # Define Python functions that will be triggered if we go to defined URL paths 13 | # Anything we `return` is rendered in our browser as HTML by default 14 | @app.route("/") 15 | def hello_world(): 16 | return "

Hello, World!

" 17 | 18 | # We're not limited to short HTML phrases or a single URL path: 19 | @app.route("/scores") 20 | def scores(): 21 | student = {'name': 'Jon'} 22 | assignments = [ 23 | { 24 | 'name': 'A1', 25 | 'score': 89 26 | }, 27 | { 28 | 'name': 'A2', 29 | 'score': 82 30 | }, 31 | { 32 | 'name': 'A3', 33 | 'score': 95 34 | } 35 | ] 36 | 37 | # Compute the average score: 38 | avg = np.mean([assignment['score'] for assignment in assignments]) 39 | 40 | # Render results using HTML template (Flask automatically looks 41 | # for HTML templates in app/templates/ directory) 42 | return render_template('scores.html', 43 | student=student, 44 | assignments=assignments, 45 | avg=avg) 46 | 47 | # We can also run arbitrary Python code and return the results 48 | # of our computations: 49 | @app.route("/random") 50 | def random(): 51 | ran = np.random.randint(1, 10) 52 | return "

Your random number is {}

".format(ran) 53 | 54 | # And return results of computations based on user-defined input: 55 | @app.route("/num_chars/") 56 | def num_chars(word): 57 | return "

There are {} characters in {}

".format(len(word), word) 58 | 59 | if __name__ == "__main__": 60 | app.run() # allows us to run app via `python app.py` 61 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/flask_basics/templates/base.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | {% if student['name'] %} 4 | {{ student['name'] }}'s Scores 5 | {% else %} 6 | MACS 30123 Scores 7 | {% endif %} 8 | 9 | 10 |
Home
11 |
12 | {% block content %}{% endblock %} 13 | 14 | 15 | -------------------------------------------------------------------------------- /Labs/Lab 10 Launching a Scalable API/flask_basics/templates/scores.html: -------------------------------------------------------------------------------- 1 | {% extends "base.html" %} 2 | 3 | {% block content %} 4 |

MACS 30123 Scores

5 | 6 | 7 | 8 | 9 | 10 | {% for assignment in assignments %} 11 | 12 | 13 | 14 | 15 | {% endfor %} 16 |
AssignmentScore
{{ assignment['name'] }} {{ assignment['score'] }}
17 |

Your average score is {{ avg }}%

18 | {% endblock %} 19 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/gpu.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=gpu # job name 3 | #SBATCH --output=gpu.out # output log file 4 | #SBATCH --error=gpu.err # error file 5 | #SBATCH --time=00:05:00 # 5 minutes of wall time 6 | #SBATCH --nodes=1 # 1 GPU node 7 | #SBATCH --partition=gpu2 # GPU2 partition 8 | #SBATCH --ntasks=1 # 1 CPU core to drive GPU 9 | #SBATCH --gres=gpu:1 # Request 1 GPU 10 | 11 | module load cuda 12 | module load python/anaconda-2019.03 13 | 14 | python3 ./print_gpu_info.py 15 | # python3 ./gpu_rand_walk.py 16 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/gpu_rand_walk.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pyopencl as cl 3 | import pyopencl.array as cl_array 4 | import pyopencl.clrandom as clrand 5 | import pyopencl.tools as cltools 6 | from pyopencl.scan import GenericScanKernel 7 | import matplotlib.pyplot as plt 8 | import time 9 | 10 | def sim_rand_walks(n_runs): 11 | # Set up context and command queue 12 | ctx = cl.create_some_context() 13 | queue = cl.CommandQueue(ctx) 14 | 15 | # mem_pool = cltools.MemoryPool(cltools.ImmediateAllocator(queue)) 16 | 17 | t0 = time.time() 18 | 19 | # Generate an array of Normal Random Numbers on GPU of length n_sims*n_steps 20 | n_steps = 100 21 | rand_gen = clrand.PhiloxGenerator(ctx) 22 | ran = rand_gen.normal(queue, (n_runs*n_steps), np.float32, mu=0, sigma=1) 23 | 24 | # Establish boundaries for each simulated walk (i.e. start and end) 25 | # Necessary so that we perform scan only within rand walks and not between 26 | seg_boundaries = [1] + [0]*(n_steps-1) 27 | seg_boundaries = np.array(seg_boundaries, dtype=np.uint8) 28 | seg_boundary_flags = np.tile(seg_boundaries, int(n_runs)) 29 | seg_boundary_flags = cl_array.to_device(queue, seg_boundary_flags) 30 | 31 | # GPU: Define Segmented Scan Kernel, scanning simulations: f(n-1) + f(n) 32 | prefix_sum = GenericScanKernel(ctx, np.float32, 33 | arguments="__global float *ary, __global char *segflags, " 34 | "__global float *out", 35 | input_expr="ary[i]", 36 | scan_expr="across_seg_boundary ? b : (a+b)", neutral="0", 37 | is_segment_start_expr="segflags[i]", 38 | output_statement="out[i] = item + 100", 39 | options=[]) 40 | 41 | # Allocate space for result of kernel on device 42 | ''' 43 | Note: use a Memory Pool (commented out above and below) if you're invoking 44 | multiple times to avoid wasting time creating brand new memory areas each 45 | time you invoke the kernel: https://documen.tician.de/pyopencl/tools.html 46 | ''' 47 | # dev_result = cl_array.arange(queue, len(ran), dtype=np.float32, 48 | # allocator=mem_pool) 49 | dev_result = cl_array.empty_like(ran) 50 | 51 | # Enqueue and Run Scan Kernel 52 | prefix_sum(ran, seg_boundary_flags, dev_result) 53 | 54 | # Get results back on CPU to plot and do final calcs, just as in Lab 1 55 | r_walks_all = (dev_result.get() 56 | .reshape(n_runs, n_steps) 57 | .transpose() 58 | ) 59 | 60 | average_finish = np.mean(r_walks_all[-1]) 61 | std_finish = np.std(r_walks_all[-1]) 62 | final_time = time.time() 63 | time_elapsed = final_time - t0 64 | 65 | print("Simulated %d Random Walks in: %f seconds" 66 | % (n_runs, time_elapsed)) 67 | print("Average final position: %f, Standard Deviation: %f" 68 | % (average_finish, std_finish)) 69 | 70 | # Plot Random Walk Paths 71 | ''' 72 | Note: Scan already only starts scanning at the second entry, but for the 73 | sake of the plot, let's set all of our random walk starting positions to 100 74 | and then plot the random walk paths. 75 | ''' 76 | r_walks_all[0] = [100]*n_runs 77 | plt.plot(r_walks_all) 78 | plt.savefig("r_walk_nruns%d_gpu.png" % n_runs) 79 | 80 | return 81 | 82 | def main(): 83 | sim_rand_walks(n_runs = 10000) 84 | 85 | if __name__ == '__main__': 86 | main() 87 | -------------------------------------------------------------------------------- /Labs/Lab 2 PyOpenCL/print_gpu_info.py: -------------------------------------------------------------------------------- 1 | import pyopencl as cl 2 | 3 | def print_device_info() : 4 | print('\n' + '=' * 60 + '\nOpenCL Platforms and Devices') 5 | for platform in cl.get_platforms(): 6 | print('=' * 60) 7 | print('Platform - Name: ' + platform.name) 8 | print('Platform - Vendor: ' + platform.vendor) 9 | print('Platform - Version: ' + platform.version) 10 | print('Platform - Profile: ' + platform.profile) 11 | for device in platform.get_devices(): 12 | print(' ' + '-' * 56) 13 | print(' Device - Name: ' \ 14 | + device.name) 15 | print(' Device - Type: ' \ 16 | + cl.device_type.to_string(device.type)) 17 | print(' Device - Max Clock Speed: {0} Mhz'\ 18 | .format(device.max_clock_frequency)) 19 | print(' Device - Compute Units: {0}'\ 20 | .format(device.max_compute_units)) 21 | print(' Device - Local Memory: {0:.0f} KB'\ 22 | .format(device.local_mem_size/1024.0)) 23 | print(' Device - Constant Memory: {0:.0f} KB'\ 24 | .format(device.max_constant_buffer_size/1024.0)) 25 | print(' Device - Global Memory: {0:.0f} GB'\ 26 | .format(device.global_mem_size/1073741824.0)) 27 | print(' Device - Max Buffer/Image Size: {0:.0f} MB'\ 28 | .format(device.max_mem_alloc_size/1048576.0)) 29 | print(' Device - Max Work Group Size: {0:.0f}'\ 30 | .format(device.max_work_group_size)) 31 | print('\n') 32 | 33 | if __name__ == "__main__": 34 | print_device_info() 35 | -------------------------------------------------------------------------------- /Labs/Lab 3 AWS EC2 and PyWren/isbn.txt: -------------------------------------------------------------------------------- 1 | 0435910108 2 | 1906523371 3 | 0394721179 4 | 0813190762 5 | 1558615008 6 | 190652324X 7 | 1592211518 8 | 0803211023 9 | 0253211107 10 | 0912469099 11 | 0140100040 12 | 1558615342 13 | 0795701845 14 | 0471380601 15 | 0814781438 16 | 1906523142 17 | 0844212024 18 | 186914001X 19 | 1770091459 20 | 1558614060 21 | 0739105620 22 | 0325070253 23 | 1588264912 24 | 1594606471 25 | 1904456731 26 | 0795701063 27 | 0869809180 28 | 1563976986 29 | 1919931236 30 | 0325002118 31 | 0739105639 32 | 0814781446 33 | 0939691027 34 | 044657922X 35 | 0547241631 36 | 1607886154 37 | 0486285537 38 | 0393930572 39 | 161552164X 40 | 0195092627 41 | 0618982728 42 | 0393929949 43 | 1400500281 44 | 0486296040 45 | 006059537X 46 | 0393979210 47 | 1439181454 48 | 0451528247 49 | 0547241607 50 | 0393326659 51 | 0618155872 52 | 0385000197 53 | 0393927393 54 | 0393979202 55 | 0393977781 56 | 0312463197 57 | 0393061817 58 | 0142437530 59 | 1557837600 60 | 0425234282 61 | 1435114760 62 | 1400030935 63 | 0205655106 64 | 0061626392 65 | 1598530208 66 | 0060540427 67 | 0061728942 68 | 1565841476 69 | 0486457575 70 | 1402221118 71 | 0679745254 72 | 0393930157 73 | 0135145287 74 | 0553262637 75 | 0385044011 76 | 0312421001 77 | 0806645776 78 | 0073384895 79 | 0312452837 80 | 0393927415 81 | 1598530593 82 | 014023778X 83 | 0806627158 84 | 0205763103 85 | 054720180X 86 | 0140170367 87 | 0321484894 88 | 0486272842 89 | 0321012690 90 | 0395765285 91 | 0820334316 92 | 0132216477 93 | 048627165X 94 | 0802150322 95 | 0393930149 96 | 1598530216 97 | 1616836865 98 | 1616823763 99 | 0743299779 100 | 039333645X 101 | 0689808690 102 | 0451529154 103 | 0060882506 104 | 0142001945 105 | 0060927631 106 | 0641974124 107 | 0393927423 108 | 0312474911 109 | 1558850031 110 | 0521540283 111 | 0822220555 112 | 0077239040 113 | 0307268349 114 | 1598530690 115 | 0609808400 116 | 1560255501 117 | 1934633941 118 | 0618197338 119 | 044990508X 120 | 1571312692 121 | 1598530410 122 | 193108209X 123 | 0896086267 124 | 142360699X 125 | 0806639628 126 | 0375761276 127 | 023108028X 128 | 1598530313 129 | 0061174076 130 | 0321113411 131 | 0205779395 132 | 054720194X 133 | 0375415041 134 | 0679767991 135 | 0684823071 136 | 0451529634 137 | 081297509X 138 | 188184708X 139 | 0618197354 140 | 0814775721 141 | 0451527828 142 | 1400077184 143 | 155783749X 144 | 039331698X 145 | 0813538629 146 | 0486294501 147 | 0618751181 148 | 1440151512 149 | 0822220768 150 | 0896086992 151 | 1581825234 152 | 1598530518 153 | 0618379029 154 | 0618427058 155 | 1590170385 156 | 0077239059 157 | 0195144562 158 | 1569472181 159 | 0312412088 160 | 0393318737 161 | 1893996387 162 | 0312115520 163 | 0618983228 164 | 0743299752 165 | 1598530429 166 | 0806628448 167 | 0393040011 168 | 1615580190 169 | 0811844943 170 | 0826217796 171 | 0321116240 172 | 1558490418 173 | 0963290622 174 | 1573833320 175 | 0060924209 176 | 019516251X 177 | 014310506X 178 | 0345395026 179 | 1595580557 180 | 0231109091 181 | 0385423098 182 | 0440378648 183 | 1557837481 184 | 0805211144 185 | 0684868695 186 | 0688171613 187 | 1596911484 188 | 037575881X 189 | 0807131512 190 | 032101149X 191 | 0618256636 192 | 0140390871 193 | 0767918460 194 | 0393930130 195 | 0375755020 196 | 1428262504 197 | 0131937928 198 | 0679736336 199 | 0813545757 200 | 0451528735 201 | 0674027639 202 | 0385491182 203 | 0393318281 204 | 0393927431 205 | 0321088999 206 | 1573241385 207 | 1558853995 208 | 0963290614 209 | 0813533309 210 | 032101006X 211 | 0981519423 212 | 0395926882 213 | 0231112599 214 | 0618902813 215 | 1575257629 216 | 0375703004 217 | 0385422431 218 | 1575255898 219 | 1604330376 220 | 0553375180 221 | 0965834441 222 | 0375414584 223 | 0807028231 224 | 0874177596 225 | 0822220423 226 | 0451527445 227 | 1557836965 228 | 0321093836 229 | 1555837476 230 | 0195122712 231 | 0786869186 232 | 0877970602 233 | 0618532986 234 | 0791419045 235 | 1572331321 236 | 1555839975 237 | 1575252716 238 | 0814727476 239 | 0415924448 240 | 0374525218 241 | 1405188626 242 | 1604732148 243 | 1880000768 244 | 0313308608 245 | 0804111677 246 | 0345383176 247 | 0345382226 248 | 1931082561 249 | 0140390146 250 | 0820321044 251 | 0918949254 252 | 0393310906 253 | 1400033217 254 | 159853033X 255 | 1589801768 256 | 0321116232 257 | 0631204490 258 | 0813531640 259 | 0816513848 260 | 0786436840 261 | 0312596227 262 | 0072564024 263 | 0812979923 264 | 0231081227 265 | 1575256185 266 | 0253206375 267 | 1575255596 268 | 0295969741 269 | 0881508594 270 | 0140273050 271 | 1570758468 272 | 0060874112 273 | 156584744X 274 | 0811807843 275 | 019955465X 276 | 0982338201 277 | 1934633194 278 | 0940450674 279 | 0813529301 280 | 0673469565 281 | 0140435875 282 | 1590304551 283 | 0375413324 284 | 1594482829 285 | 157525591X 286 | 1416537465 287 | 0060923520 288 | 0140296395 289 | 1579127738 290 | 097662964X 291 | 1887366768 292 | 0743299736 293 | 0743257596 294 | 0814757219 295 | 0743243501 296 | 0520222121 297 | 1598530054 298 | 1597140015 299 | 156368019X 300 | 1887178988 301 | 1558853804 302 | 0816527938 303 | 0691133247 304 | 1931082332 305 | 0195093607 306 | 091478370X 307 | 080712320X 308 | 0688175864 309 | 0393316858 310 | 1439181470 311 | 0395986052 312 | 0078454816 313 | 0195122135 314 | 0618131736 315 | 0914783599 316 | 0230620655 317 | 0809329549 318 | 0486410986 319 | 1551117282 320 | 094045078X 321 | 1563681277 322 | 0312191766 323 | 0874519187 324 | 0385003587 325 | 061890283X 326 | 0809015641 327 | 0470237694 328 | 0743476972 329 | 1559362758 330 | 159853047X 331 | 0802132790 332 | 1929642377 333 | 0195125630 334 | 0786713879 335 | 1593761945 336 | 0820332259 337 | 1598530151 338 | 1416906355 339 | 1560373792 340 | 0945353820 341 | 159714049X 342 | 0226322742 343 | 0195395921 344 | 962634248X 345 | 0816524939 346 | 0595362974 347 | 0813538866 348 | 0870498746 349 | 1931082278 350 | 0976876027 351 | 1400076757 352 | 0879054700 353 | 1555536778 354 | 1880399989 355 | 1933859865 356 | 0631206523 357 | 0743299760 358 | 0813529662 359 | 0920661599 360 | 0743203887 361 | 0967023432 362 | 0761131655 363 | 1557837473 364 | 0816645612 365 | 0395980755 366 | 0932379192 367 | 0072400196 368 | 0979485223 369 | 1879960583 370 | 0813527260 371 | 1931010471 372 | 0813520185 373 | 0073221538 374 | 0618570489 375 | 0873514289 376 | 0673469778 377 | 0446676918 378 | 0879103663 379 | 0061175358 380 | 0130266876 381 | 039304677X 382 | 157525249X 383 | 089672638X 384 | 0813531624 385 | 1403979995 386 | 0978625110 387 | 0872863131 388 | 0815320809 389 | 1590212282 390 | 0230606326 391 | 0873384660 392 | 1559363320 393 | 0979485215 394 | 0395875099 395 | 0321198433 396 | 0573695393 397 | 0842028293 398 | 0936839228 399 | 1573441880 400 | 1558853316 401 | 1932663177 402 | 0393316718 403 | 023105419X 404 | 0940450666 405 | 061853301X 406 | 0060749539 407 | 029598824X 408 | 0674030915 409 | 1555533019 410 | 158648009X 411 | 0807136360 412 | 0070113963 413 | 0415234689 414 | 0252064100 415 | 0874807905 416 | 157525297X 417 | 1575254247 418 | 0814776558 419 | 0684802406 420 | 080213694X 421 | 1598530488 422 | 0873515420 423 | 1888160381 424 | 1555838278 425 | 0813122791 426 | 1932183701 427 | 1575254468 428 | 0671047566 429 | 1598875787 430 | 0910043078 431 | 1597140376 432 | 189251401X 433 | 0195020588 434 | 0806513497 435 | 0875652530 436 | 0809327635 437 | 0879058323 438 | 1567920780 439 | 0813007267 440 | 1587298716 441 | 0820327751 442 | 1583220224 443 | 0813190665 444 | 155936291X 445 | 188736675X 446 | 0873514378 447 | 1931082766 448 | 0156135396 449 | 1883011760 450 | 089886710X 451 | 0743257588 452 | 0872865010 453 | 0872865282 454 | 0415936829 455 | 0201846705 456 | 1888160438 457 | 082632889X 458 | 0826348181 459 | 0875653421 460 | 1557837120 461 | 1558853464 462 | 0271027215 463 | 0452011744 464 | 0870498762 465 | 0804011079 466 | 0674006038 467 | 0618249338 468 | 0806133678 469 | 0140230262 470 | 0673990176 471 | 1558491112 472 | 0573697558 473 | 1931082499 474 | 0945575939 475 | 1565849868 476 | 1580051588 477 | 0268041342 478 | 0881461830 479 | 0253214920 480 | 0060953225 481 | 1575253860 482 | 1585360953 483 | 0316347108 484 | 1566890012 485 | 1892514117 486 | 1566891418 487 | 0944350615 488 | 1931082529 489 | 0199273154 490 | 0814321429 491 | 0295977469 492 | 0822955601 493 | 0878054790 494 | 0978848934 495 | 0520216849 496 | 0275935671 497 | 0877457956 498 | 159179188X 499 | 1573240370 500 | 1560253789 501 | -------------------------------------------------------------------------------- /Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 3 AWS EC2 and PyWren/pywren_workflow.png -------------------------------------------------------------------------------- /Labs/Lab 4 Storing and Structuring Large Data/5W_large_scale_databases.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Implementing Large-Scale Database Solutions\n", 8 | "## DynamoDB\n", 9 | "First, let's create a DynamoDB table. Let's say that we're collecting and storing streaming Twitter data in our database. We'll use Twitter 'username' as our primary key here, since this will be unique to each user and will make for a good input for DynamoDB's hash function (you can also specify a sort key if you would like, though). We'll also set our Read and Write Capacity down to the minimum for this demo, but you can [scale this up](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html) if you need more throughput for your application." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "0\n", 22 | "2021-04-24 12:55:12.129000-05:00\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "import boto3\n", 28 | "\n", 29 | "dynamodb = boto3.resource('dynamodb')\n", 30 | "\n", 31 | "table = dynamodb.create_table(\n", 32 | " TableName='twitter',\n", 33 | " KeySchema=[\n", 34 | " {\n", 35 | " 'AttributeName': 'username',\n", 36 | " 'KeyType': 'HASH'\n", 37 | " }\n", 38 | " ],\n", 39 | " AttributeDefinitions=[\n", 40 | " {\n", 41 | " 'AttributeName': 'username',\n", 42 | " 'AttributeType': 'S'\n", 43 | " }\n", 44 | " ],\n", 45 | " ProvisionedThroughput={\n", 46 | " 'ReadCapacityUnits': 1,\n", 47 | " 'WriteCapacityUnits': 1\n", 48 | " } \n", 49 | ")\n", 50 | "\n", 51 | "# Wait until AWS confirms that table exists before moving on\n", 52 | "table.meta.client.get_waiter('table_exists').wait(TableName='twitter')\n", 53 | "\n", 54 | "# get data about table (should currently be no items in table)\n", 55 | "print(table.item_count)\n", 56 | "print(table.creation_date_time)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "OK, so we currently have an empty DynamoDB table. Let's actually put some items into our table:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "{'ResponseMetadata': {'RequestId': '0BDCGRSJQE4725IML53BV0RJAJVV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 75 | " 'HTTPStatusCode': 200,\n", 76 | " 'HTTPHeaders': {'server': 'Server',\n", 77 | " 'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n", 78 | " 'content-type': 'application/x-amz-json-1.0',\n", 79 | " 'content-length': '2',\n", 80 | " 'connection': 'keep-alive',\n", 81 | " 'x-amzn-requestid': '0BDCGRSJQE4725IML53BV0RJAJVV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 82 | " 'x-amz-crc32': '2745614147'},\n", 83 | " 'RetryAttempts': 0}}" 84 | ] 85 | }, 86 | "execution_count": 2, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "table.put_item(\n", 93 | " Item={\n", 94 | " 'username': 'macs30123',\n", 95 | " 'num_followers': 100,\n", 96 | " 'num_tweets': 5\n", 97 | " }\n", 98 | ")\n", 99 | "\n", 100 | "table.put_item(\n", 101 | " Item={\n", 102 | " 'username': 'jon_c',\n", 103 | " 'num_followers': 10,\n", 104 | " 'num_tweets': 0\n", 105 | " }\n", 106 | ")" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "We can then easily get items from our table using the `get_item` method and providing our key:" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 3, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "{'num_tweets': Decimal('5'), 'num_followers': Decimal('100'), 'username': 'macs30123'}\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "response = table.get_item(\n", 131 | " Key={\n", 132 | " 'username': 'macs30123'\n", 133 | " }\n", 134 | ")\n", 135 | "item = response['Item']\n", 136 | "print(item)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "We can also update existing items using the `update_item` method:" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 4, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "text/plain": [ 154 | "{'ResponseMetadata': {'RequestId': '8632M7FSUI78RENPC4O73IIN37VV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 155 | " 'HTTPStatusCode': 200,\n", 156 | " 'HTTPHeaders': {'server': 'Server',\n", 157 | " 'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n", 158 | " 'content-type': 'application/x-amz-json-1.0',\n", 159 | " 'content-length': '2',\n", 160 | " 'connection': 'keep-alive',\n", 161 | " 'x-amzn-requestid': '8632M7FSUI78RENPC4O73IIN37VV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 162 | " 'x-amz-crc32': '2745614147'},\n", 163 | " 'RetryAttempts': 0}}" 164 | ] 165 | }, 166 | "execution_count": 4, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "table.update_item(\n", 173 | " Key={\n", 174 | " 'username': 'macs30123'\n", 175 | " },\n", 176 | " UpdateExpression='SET num_tweets = :val1',\n", 177 | " ExpressionAttributeValues={\n", 178 | " ':val1': 6\n", 179 | " }\n", 180 | ")" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "Then, if we take a look again at this item, we'll see that it's been updated (note, though, that DynamoDB tables are [*eventually consistent* unless we specify otherwise](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html), so this might not always return the expected result immediately):" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 5, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "{'num_followers': Decimal('100'), 'num_tweets': Decimal('6'), 'username': 'macs30123'}\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "response = table.get_item(\n", 205 | " Key={\n", 206 | " 'username': 'macs30123'\n", 207 | " }\n", 208 | ")\n", 209 | "item = response['Item']\n", 210 | "print(item)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "Note as well, that even though it is not optimal to perform complicated queries in DynamoDB tables, we can write and run SQL-like queries to run again our DynamoDB tables if we want to:" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 6, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "name": "stdout", 227 | "output_type": "stream", 228 | "text": [ 229 | "[{'num_followers': Decimal('100'), 'num_tweets': Decimal('6'), 'username': 'macs30123'}]\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "response = table.meta.client.execute_statement(\n", 235 | " Statement='''\n", 236 | " SELECT *\n", 237 | " FROM twitter\n", 238 | " WHERE num_followers > 20\n", 239 | " '''\n", 240 | ")\n", 241 | "item = response['Items']\n", 242 | "print(item)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "Finally, you should make sure to delete your table (if you no longer plan to use it), so that you do not incur further charges while it is running:" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "{'TableDescription': {'TableName': 'twitter',\n", 261 | " 'TableStatus': 'DELETING',\n", 262 | " 'ProvisionedThroughput': {'NumberOfDecreasesToday': 0,\n", 263 | " 'ReadCapacityUnits': 1,\n", 264 | " 'WriteCapacityUnits': 1},\n", 265 | " 'TableSizeBytes': 0,\n", 266 | " 'ItemCount': 0,\n", 267 | " 'TableArn': 'arn:aws:dynamodb:us-east-1:009068789081:table/twitter',\n", 268 | " 'TableId': '7bad4d6e-c716-4cf6-a3fa-5fd0738f3258'},\n", 269 | " 'ResponseMetadata': {'RequestId': 'S6NO2AI592GHJMIH01Q6IO594BVV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 270 | " 'HTTPStatusCode': 200,\n", 271 | " 'HTTPHeaders': {'server': 'Server',\n", 272 | " 'date': 'Sat, 24 Apr 2021 17:55:32 GMT',\n", 273 | " 'content-type': 'application/x-amz-json-1.0',\n", 274 | " 'content-length': '316',\n", 275 | " 'connection': 'keep-alive',\n", 276 | " 'x-amzn-requestid': 'S6NO2AI592GHJMIH01Q6IO594BVV4KQNSO5AEMVJF66Q9ASUAAJG',\n", 277 | " 'x-amz-crc32': '2358790140'},\n", 278 | " 'RetryAttempts': 0}}" 279 | ] 280 | }, 281 | "execution_count": 7, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "table.delete()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "## RDS\n", 295 | "\n", 296 | "We can also create and interact with scalable cloud relational databases via `boto3`. Let's launch a MySQL database via AWS's RDS service. Note that we can explicitly scale up the hardware (e.g. instance class, and allocated storage) for our database via the `create_db_instance` parameters. We can also add additional read replicas of our database instance that we launch via [the `create_db_instance_read_replica` method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.create_db_instance_read_replica) or create a cluster of a certain size from the start using [the `create_db_cluster` method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.create_db_cluster)." 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 8, 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "name": "stdout", 306 | "output_type": "stream", 307 | "text": [ 308 | "relational-db is available at relational-db.cr2k4l3gqki8.us-east-1.rds.amazonaws.com on Port 3306\n" 309 | ] 310 | } 311 | ], 312 | "source": [ 313 | "rds = boto3.client('rds')\n", 314 | "\n", 315 | "response = rds.create_db_instance(\n", 316 | " DBInstanceIdentifier='relational-db',\n", 317 | " DBName='twitter',\n", 318 | " MasterUsername='username',\n", 319 | " MasterUserPassword='password',\n", 320 | " DBInstanceClass='db.t2.micro',\n", 321 | " Engine='MySQL',\n", 322 | " AllocatedStorage=5\n", 323 | ")\n", 324 | "\n", 325 | "# Wait until DB is available to continue\n", 326 | "rds.get_waiter('db_instance_available').wait(DBInstanceIdentifier='relational-db')\n", 327 | "\n", 328 | "# Describe where DB is available and on what port\n", 329 | "db = rds.describe_db_instances()['DBInstances'][0]\n", 330 | "ENDPOINT = db['Endpoint']['Address']\n", 331 | "PORT = db['Endpoint']['Port']\n", 332 | "DBID = db['DBInstanceIdentifier']\n", 333 | "\n", 334 | "print(DBID,\n", 335 | " \"is available at\", ENDPOINT,\n", 336 | " \"on Port\", PORT,\n", 337 | " ) " 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "In order to access our MySQL database, we'll need to adjust some security settings associated with our server, though. By default, we're not able to access port 3306 on our database server over the internet and we will need to change this setting in order to connect to our database from our local machine. In practice, you should limit the allowed IP range as much as possible (to your home or office, for example) to avoid intruders from connecting to your databases. For the purposes of this demo, though, I am going to make it possible to connect to my database from anywhere on the internet (IP range 0.0.0.0/0):" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "name": "stdout", 354 | "output_type": "stream", 355 | "text": [ 356 | "Permissions already adjusted.\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "# Get Name of Security Group\n", 362 | "SGNAME = db['VpcSecurityGroups'][0]['VpcSecurityGroupId']\n", 363 | "\n", 364 | "# Adjust Permissions for that security group so that we can access it on Port 3306\n", 365 | "# If already SG is already adjusted, print this out\n", 366 | "try:\n", 367 | " ec2 = boto3.client('ec2')\n", 368 | " data = ec2.authorize_security_group_ingress(\n", 369 | " GroupId=SGNAME,\n", 370 | " IpPermissions=[\n", 371 | " {'IpProtocol': 'tcp',\n", 372 | " 'FromPort': PORT,\n", 373 | " 'ToPort': PORT,\n", 374 | " 'IpRanges': [{'CidrIp': '0.0.0.0/0'}]}\n", 375 | " ]\n", 376 | " )\n", 377 | "except ec2.exceptions.ClientError as e:\n", 378 | " if e.response[\"Error\"][\"Code\"] == 'InvalidPermission.Duplicate':\n", 379 | " print(\"Permissions already adjusted.\")\n", 380 | " else:\n", 381 | " print(e)" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "Alright, we're ready to connect to our database! This is a MySQL database, so let's install a Python package that will allow us to effectively handle this connection:" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": {}, 395 | "outputs": [ 396 | { 397 | "name": "stdout", 398 | "output_type": "stream", 399 | "text": [ 400 | "Requirement already satisfied: mysql-connector-python in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (8.0.24)\r\n", 401 | "Requirement already satisfied: protobuf>=3.0.0 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (from mysql-connector-python) (3.14.0)\r\n", 402 | "Requirement already satisfied: six>=1.9 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (from protobuf>=3.0.0->mysql-connector-python) (1.15.0)\r\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "! pip install mysql-connector-python # Install mysql-connector if you haven't already" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "Then, we can just connect to the database and run queries in the same way that you have seen while working with SQLite databases in CS 122 (using the SQLite3 package). Very cool!" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 11, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "import mysql.connector\n", 424 | "conn = mysql.connector.connect(host=ENDPOINT, user=\"username\", passwd=\"password\", port=PORT, database='twitter')\n", 425 | "cur = conn.cursor()" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 12, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "create_table = '''\n", 435 | " CREATE TABLE IF NOT EXISTS users (\n", 436 | " username VARCHAR(10),\n", 437 | " num_followers INT,\n", 438 | " num_tweets INT,\n", 439 | " PRIMARY KEY (username)\n", 440 | " )\n", 441 | " '''\n", 442 | "insert_data = '''\n", 443 | " INSERT INTO users (username, num_followers, num_tweets)\n", 444 | " VALUES \n", 445 | " ('macs30123', 100, 5),\n", 446 | " ('jon_c', 10, 0)\n", 447 | " '''\n", 448 | "\n", 449 | "for op in [create_table, insert_data]:\n", 450 | " cur.execute(op)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "Our relational database is optimized for performing small, fast queries like these and will tend to out-perform our DynamoDB table at these kinds of operations:" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 13, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "[('jon_c', 10, 0), ('macs30123', 100, 5)]\n" 470 | ] 471 | } 472 | ], 473 | "source": [ 474 | "cur.execute('''SELECT * FROM users''')\n", 475 | "query_results = cur.fetchall()\n", 476 | "print(query_results)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 14, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "name": "stdout", 486 | "output_type": "stream", 487 | "text": [ 488 | "[('macs30123',)]\n" 489 | ] 490 | } 491 | ], 492 | "source": [ 493 | "cur.execute('''SELECT username FROM users WHERE num_followers > 20''')\n", 494 | "query_results = cur.fetchall()\n", 495 | "print(query_results)" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "Once we're done executing SQL queries on our MySQL database, we can close our connection to the database and delete the database on AWS so that we're no longer charged for it:" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 15, 508 | "metadata": {}, 509 | "outputs": [ 510 | { 511 | "name": "stdout", 512 | "output_type": "stream", 513 | "text": [ 514 | "deleting\n", 515 | "RDS Database has been deleted\n" 516 | ] 517 | } 518 | ], 519 | "source": [ 520 | "conn.close()\n", 521 | "response = rds.delete_db_instance(DBInstanceIdentifier='relational-db',\n", 522 | " SkipFinalSnapshot=True\n", 523 | " )\n", 524 | "print(response['DBInstance']['DBInstanceStatus'])\n", 525 | "\n", 526 | "# wait until DB is deleted before proceeding\n", 527 | "rds.get_waiter('db_instance_deleted').wait(DBInstanceIdentifier='relational-db')\n", 528 | "print(\"RDS Database has been deleted\")" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "# Data Warehousing with Redshift\n", 536 | "\n", 537 | "When you need to run especially big queries against large datasets, it can make sense to perform these in a Data Warehouse like AWS Redshift. Recall that Redshift clusters organize our data in columnar storage (instead of rows, like a standard relational database) and can efficiently perform operations on these columns in parallel.\n", 538 | "\n", 539 | "Let's spin up a Redshift cluster to see how this works (for our small Twitter demonstration data). Notice that we do need to provide the particular type of hardware that we want each one of our nodes to be, as well as the number of nodes that we want to include in our cluster (we can increase this for greater parallelism and storage capacity). For this demo, let's just select a two of one of the smaller nodes." 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 16, 545 | "metadata": {}, 546 | "outputs": [ 547 | { 548 | "name": "stdout", 549 | "output_type": "stream", 550 | "text": [ 551 | "mycluster is available at mycluster.cvbbnglvrhqs.us-east-1.redshift.amazonaws.com on Port 5439\n" 552 | ] 553 | } 554 | ], 555 | "source": [ 556 | "redshift = boto3.client('redshift')\n", 557 | "\n", 558 | "response = redshift.create_cluster(\n", 559 | " ClusterIdentifier='myCluster',\n", 560 | " DBName='twitter',\n", 561 | " NodeType='dc1.large',\n", 562 | " NumberOfNodes=2,\n", 563 | " MasterUsername='username',\n", 564 | " MasterUserPassword='Password123'\n", 565 | ")\n", 566 | "\n", 567 | "# Wait until cluster is available before proceeding\n", 568 | "redshift.get_waiter('cluster_available').wait(ClusterIdentifier='myCluster')\n", 569 | "\n", 570 | "# Describe where cluster is available and on what port\n", 571 | "cluster = redshift.describe_clusters(ClusterIdentifier='myCluster')['Clusters'][0]\n", 572 | "ENDPOINT = cluster['Endpoint']['Address']\n", 573 | "PORT = cluster['Endpoint']['Port']\n", 574 | "CLUSTERID = cluster['ClusterIdentifier']\n", 575 | "\n", 576 | "print(CLUSTERID,\n", 577 | " \"is available at\", ENDPOINT,\n", 578 | " \"on Port\", PORT,\n", 579 | " )" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "Again, we'll need to make sure that we can connect with our cluster from our local machine. For the purposes of this demo, we'll open the port up to the Internet (although, again, you should only allow a narrow IP range in your own applications)." 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 17, 592 | "metadata": {}, 593 | "outputs": [ 594 | { 595 | "name": "stdout", 596 | "output_type": "stream", 597 | "text": [ 598 | "Permissions already adjusted.\n" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "# Get Name of Security Group\n", 604 | "SGNAME = cluster['VpcSecurityGroups'][0]['VpcSecurityGroupId']\n", 605 | "\n", 606 | "# Adjust Permissions for that security group so that we can access it on Port 3306\n", 607 | "# If already SG is already adjusted, print this out\n", 608 | "try:\n", 609 | " ec2 = boto3.client('ec2')\n", 610 | " data = ec2.authorize_security_group_ingress(\n", 611 | " GroupId=SGNAME,\n", 612 | " IpPermissions=[\n", 613 | " {'IpProtocol': 'tcp',\n", 614 | " 'FromPort': PORT,\n", 615 | " 'ToPort': PORT,\n", 616 | " 'IpRanges': [{'CidrIp': '0.0.0.0/0'}]}\n", 617 | " ]\n", 618 | " )\n", 619 | "except ec2.exceptions.ClientError as e:\n", 620 | " if e.response[\"Error\"][\"Code\"] == 'InvalidPermission.Duplicate':\n", 621 | " print(\"Permissions already adjusted.\")\n", 622 | " else:\n", 623 | " print(e)" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "Redshift was originally forked from PostgreSQL, so the best way to connect with it is via a PostgreSQL Python adaptor (rather than the MySQL adaptor we used previously). We'll use `psycopg2` here." 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 18, 636 | "metadata": {}, 637 | "outputs": [ 638 | { 639 | "name": "stdout", 640 | "output_type": "stream", 641 | "text": [ 642 | "Requirement already satisfied: psycopg2 in ./anaconda3/envs/macs30123/lib/python3.7/site-packages (2.7.7)\r\n" 643 | ] 644 | } 645 | ], 646 | "source": [ 647 | "! pip install psycopg2" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "Note that once we import the package and connect, we can use the same workflow that we used for our MySQL database (and our local SQLite databases in CS 122) to execute SQL queries:" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 19, 660 | "metadata": {}, 661 | "outputs": [ 662 | { 663 | "name": "stderr", 664 | "output_type": "stream", 665 | "text": [ 666 | "/home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use \"pip install psycopg2-binary\" instead. For details see: .\n", 667 | " \"\"\")\n" 668 | ] 669 | } 670 | ], 671 | "source": [ 672 | "import psycopg2\n", 673 | "conn = psycopg2.connect(dbname='twitter', host=ENDPOINT, user=\"username\", password=\"Password123\", port=PORT)\n", 674 | "cur = conn.cursor()\n", 675 | "\n", 676 | "for op in [create_table, insert_data]:\n", 677 | " cur.execute(op)" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 20, 683 | "metadata": {}, 684 | "outputs": [ 685 | { 686 | "name": "stdout", 687 | "output_type": "stream", 688 | "text": [ 689 | "[('macs30123', 100, 5), ('jon_c', 10, 0)]\n" 690 | ] 691 | } 692 | ], 693 | "source": [ 694 | "cur.execute('''SELECT * FROM users''')\n", 695 | "query_results = cur.fetchall()\n", 696 | "print(query_results)" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": 21, 702 | "metadata": {}, 703 | "outputs": [ 704 | { 705 | "name": "stdout", 706 | "output_type": "stream", 707 | "text": [ 708 | "[('macs30123',)]\n" 709 | ] 710 | } 711 | ], 712 | "source": [ 713 | "cur.execute('''SELECT username FROM users WHERE num_followers > 20''')\n", 714 | "query_results = cur.fetchall()\n", 715 | "print(query_results)" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "Then, once we're done, we can close our connection and delete our Redshift cluster in the same way as our RDS instance:" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": 22, 728 | "metadata": {}, 729 | "outputs": [ 730 | { 731 | "name": "stdout", 732 | "output_type": "stream", 733 | "text": [ 734 | "deleting\n", 735 | "Redshift Cluster has been deleted\n" 736 | ] 737 | } 738 | ], 739 | "source": [ 740 | "conn.close()\n", 741 | "response = redshift.delete_cluster(ClusterIdentifier='myCluster',\n", 742 | " SkipFinalClusterSnapshot=True\n", 743 | " )\n", 744 | "print(response['Cluster']['ClusterStatus'])\n", 745 | "\n", 746 | "redshift.get_waiter('cluster_deleted').wait(ClusterIdentifier='myCluster')\n", 747 | "print(\"Redshift Cluster has been deleted\")" 748 | ] 749 | } 750 | ], 751 | "metadata": { 752 | "@webio": { 753 | "lastCommId": null, 754 | "lastKernelId": null 755 | }, 756 | "kernelspec": { 757 | "display_name": "Python 3", 758 | "language": "python", 759 | "name": "python3" 760 | }, 761 | "language_info": { 762 | "codemirror_mode": { 763 | "name": "ipython", 764 | "version": 3 765 | }, 766 | "file_extension": ".py", 767 | "mimetype": "text/x-python", 768 | "name": "python", 769 | "nbconvert_exporter": "python", 770 | "pygments_lexer": "ipython3", 771 | "version": "3.7.10" 772 | } 773 | }, 774 | "nbformat": 4, 775 | "nbformat_minor": 4 776 | } 777 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/.mrjob.conf: -------------------------------------------------------------------------------- 1 | runners: 2 | emr: 3 | # Specify a pem key to start up an EMR cluster on your behalf 4 | ec2_key_pair: MACS_30123 5 | ec2_key_pair_file: ~/MACS_30123.pem 6 | 7 | # Specify type/# of EC2 instances you want your code to run on 8 | core_instance_type: m5.xlarge 9 | num_core_instances: 3 10 | region: us-east-1 11 | 12 | # if cluster idles longer than 60 minutes, terminate the cluster 13 | max_mins_idle: 60.0 14 | 15 | # to read from/write to S3; note colons instead of "=": 16 | aws_access_key_id: 17 | aws_secret_access_key: 18 | aws_session_token: 19 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mapreduce_lab5.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Lab 5 (Part I): Batch Processing Data with MapReduce 3 | 4 | What's the most used word in 5-star customer reviews on Amazon? 5 | 6 | We can answer this question using the mrjob package to investigate customer 7 | reviews available as a part of Amazon's public S3 customer reviews dataset. 8 | 9 | For this demo, we'll use a small sample of this 100m+ review dataset that 10 | Amazon provides (s3://amazon-reviews-pds/tsv/sample_us.tsv). 11 | 12 | In order to run the code below, be sure to `pip install mrjob` if you have not 13 | done so already. 14 | ''' 15 | 16 | from mrjob.job import MRJob 17 | from mrjob.step import MRStep 18 | import re 19 | 20 | WORD_RE = re.compile(r"[\w']+") 21 | 22 | class MRMostUsedWord(MRJob): 23 | 24 | def mapper_get_words(self, _, row): 25 | ''' 26 | If a review's star rating is 5, yield all of the words in the review 27 | ''' 28 | data = row.split('\t') 29 | if data[7] == '5': 30 | for word in WORD_RE.findall(data[13]): 31 | yield (word.lower(), 1) 32 | 33 | def combiner_count_words(self, word, counts): 34 | ''' 35 | Sum all of the words available so far 36 | ''' 37 | yield (word, sum(counts)) 38 | 39 | def reducer_count_words(self, word, counts): 40 | ''' 41 | Arrive at a total count for each word in the 5 star reviews 42 | ''' 43 | yield None, (sum(counts), word) 44 | 45 | # discard the key; it is just None 46 | def reducer_find_max_word(self, _, word_count_pairs): 47 | ''' 48 | Yield the word that occurs the most in the 5 star reviews 49 | ''' 50 | yield max(word_count_pairs) 51 | 52 | def steps(self): 53 | return [ 54 | MRStep(mapper=self.mapper_get_words, 55 | combiner=self.combiner_count_words, 56 | reducer=self.reducer_count_words), 57 | MRStep(reducer=self.reducer_find_max_word) 58 | ] 59 | 60 | if __name__ == '__main__': 61 | MRMostUsedWord.run() 62 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/mrjob_cheatsheet.md: -------------------------------------------------------------------------------- 1 | # mrjob Cheat Sheet 2 | 3 | To run/debug `mrjob` code locally from your command line: 4 | 5 | ``` 6 | python mapreduce_lab5.py sample_us.tsv 7 | ``` 8 | 9 | To run your `mrjob` code on an AWS EMR cluster, you should first ensure that your configuration file is set with your EC2 pem file name and file location, as well as your current credentials from AWS Education/Vocareum. Note that the credentials are listed with ":" here and not "=" as they are in your `credentials` file. `mrjob` assumes that this (`.mrjob.conf`) file will be located in your home directory (at `~/.mrjob.conf`), so you will need to put the file there. Otherwise, you will need to designate your configuration [as a command line option](https://mrjob.readthedocs.io/en/latest/cmd.html#create-cluster) when you start your `mrjob` job using the `-c` flag. 10 | 11 | ~/.mrjob.conf 12 | ``` 13 | runners: 14 | emr: 15 | # Specify a pem key to start up an EMR cluster on your behalf 16 | ec2_key_pair: MACS_30123 17 | ec2_key_pair_file: ~/MACS_30123.pem 18 | 19 | # Specify type/# of EC2 instances you want your code to run on 20 | core_instance_type: m5.xlarge 21 | num_core_instances: 3 22 | region: us-east-1 23 | 24 | # to read from/write to S3; note ":" instead of "=" from `credentials`: 25 | aws_access_key_id: 26 | aws_secret_access_key: 27 | aws_session_token: 28 | ``` 29 | 30 | To run your `mrjob` code on an EMR cluster (of the size and type specified in your configuration file), you can run the following command on the command line (Note: before running this code, ensure that you have created the default IAM "EMR roles" for your cluster via the AWS console by following the instructions in the lab video for today of starting up and terminating an EMR cluster). Note that this command will start up a cluster, run your job, and then automatically terminate the cluster for you. 31 | 32 | ``` 33 | python mapreduce_lab5.py -r emr sample_us.tsv 34 | ``` 35 | 36 | To create a stand-alone cluster that you can use to run multiple jobs on, you can run: 37 | ``` 38 | mrjob create-cluster 39 | ``` 40 | 41 | When you create a cluster, `mrjob` will print out the ID number for the cluster ("j" followed by a bunch of numbers and characters). If you copy this job-id number and specify it after the `--cluster-id` flag on the command line, your code will be run on this already-running cluster. Here, we additionally write the results of our job out to a text file (`mr.out`) via the `> mr.out` addendum at the end of the line. 42 | 43 | ``` 44 | python mapreduce_lab5.py -r emr sample_us.tsv --cluster-id= > mr.out 45 | ``` 46 | 47 | When you're done running jobs on your cluster, you can terminate the cluster with the following command so that you don't have to pay for it any longer than you need to: 48 | 49 | ``` 50 | mrjob terminate-cluster 51 | ``` 52 | 53 | For additional configuration options, consult the [`mrjob` documentation](https://mrjob.readthedocs.io/en/latest/index.html). 54 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part I MapReduce/sample_us.tsv: -------------------------------------------------------------------------------- 1 | marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date 2 | US 18778586 RDIJS7QYB6XNR B00EDBY7X8 122952789 Monopoly Junior Board Game Toys 5 0 0 N Y Five Stars Excellent!!! 2015-08-31 3 | US 24769659 R36ED1U38IELG8 B00D7JFOPC 952062646 56 Pieces of Wooden Train Track Compatible with All Major Train Brands Toys 5 0 0 N Y Good quality track at excellent price Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends. 2015-08-31 4 | US 44331596 R1UE3RPRGCOLD B002LHA74O 818126353 Super Jumbo Playing Cards by S&S Worldwide Toys 2 1 1 N Y Two Stars Cards are not as big as pictured. 2015-08-31 5 | US 23310293 R298788GS6I901 B00ARPLCGY 261944918 Barbie Doll and Fashions Barbie Gift Set Toys 5 0 0 N Y my daughter loved it and i liked the price and it came ... my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop! 2015-08-31 6 | US 38745832 RNX4EXOBBPN5 B00UZOPOFW 717410439 Emazing Lights eLite Flow Glow Sticks - Spinning Light LED Toy Toys 1 1 1 N Y DONT BUY THESE! Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds. 2015-08-31 7 | US 13394189 R3BPETL222LMIM B009B7F6CA 873028700 Melissa & Doug Water Wow Coloring Book - Vehicles Toys 5 0 0 N Y Five Stars Great item. Pictures pop thru and add detail as "painted." Pictures dry and it can be repainted. 2015-08-31 8 | US 2749569 R3SORMPJZO3F2J B0101EHRSM 723424342 Big Bang Cosmic Pegasus (Pegasis) Metal 4D High Performance Generic Battling Top BB-105 Toys 3 2 2 N Y Three Stars To keep together, had to use crazy glue. 2015-08-31 9 | US 41137196 R2RDOJQ0WBZCF6 B00407S11Y 383363775 Fun Express Insect Finger Puppets 12ct Toy Toys 5 0 0 N Y Five Stars I was pleased with the product. 2015-08-31 10 | US 433677 R2B8VBEPB4YEZ7 B00FGPU7U2 780517568 Fisher-Price Octonauts Shellington's On-The-Go Pod Toy Toys 5 0 0 N Y Five Stars Children like it 2015-08-31 11 | US 1297934 R1CB783I7B0U52 B0013OY0S0 269360126 Claw Climber Goliath/ Disney's Gargoyles Toys 1 0 1 N Y Shame on the seller !!! Showed up not how it's shown . Was someone's old toy. with paint on it. 2015-08-31 12 | US 52006292 R2D90RQQ3V8LH B00519PJTW 493486387 100 Foot Multicolor Pennant Banner Toys 5 0 0 N Y Five Stars Really liked these. They were a little larger than I thought, but still fun. 2015-08-31 13 | US 32071052 R1Y4ZOUGFMJ327 B001TCY2DO 459122467 Pig Jumbo Foil Balloon Toys 5 0 0 N Y Nice huge balloon Nice huge balloon! Had my local grocery store fill it up for a very small fee, it was totally worth it! 2015-08-31 14 | US 7360347 R2BUV9QJI2A00X B00DOQCWF8 226984155 Minecraft Animal Toy (6-Pack) Toys 5 0 1 N Y Five Stars Great deal 2015-08-31 15 | US 11613707 RSUHRJFJIRB3Z B004C04I4I 375659886 Disney Baby: Eeyore Large Plush Toys 4 0 0 N Y Four Stars As Advertised 2015-08-31 16 | US 13545982 R1T96CG98BBA15 B00NWGEKBY 933734136 Team Losi 8IGHT-E RTR AVC Electric 4WD Buggy Vehicle (1/8 Scale) Toys 3 2 4 N Y ... servo so expect to spend 150 more on a good servo immediately be the stock one breaks right Comes w a 15$ servo so expect to spend 150 more on a good servo immediately be the stock one breaks right away 2015-08-31 17 | US 43880421 R2ATXF4QQ30YW B00000JS5S 341842639 Hot Wheels 48- Car storage Case With Easy Grip Carrying Case Toys 5 0 0 N Y Five Stars awesome ! Thanks! 2015-08-31 18 | US 1662075 R1YS3DS218NNMD B00XPWXYDK 210135375 ZuZo 2.4GHz 4 CH 6 Axis Gyro RC Quadcopter Drone with Camera & LED Lights, 38 x 38 x 7cm Toys 5 4 4 N N The closest relevance I have to items like these is while in the army I was trained ... I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.

*Disclaimer I received this product at a discount for my unbiased review. 2015-08-31 19 | US 18461411 R2SDXLTLF92O0H B00VPXX92W 705054378 Teenage Mutant Ninja Turtles T-Machines Tiger Claw in Safari Truck Diecast Vehicle Toys 5 0 0 N Y Five Stars It was a birthday present for my grandson and he LOVES IT!! 2015-08-31 20 | US 27225859 R4R337CCDWLNG B00YRA3H4U 223420727 Franklin Sports MLB Fold Away Batting Tee Toys 3 0 1 Y N Got wrong product in the shipment Got a wrong product from Amazon Vine and unable to provide a good review. We received a pair of cute girls gloves and a baseball ball instead, while we were expecting a boys batting tee. The gloves are cute, however made for at least 6+ yrs or above...more likely 8-9 yrs old girls.

Can't provide a fair review as we were not able to use the product. 2015-08-31 21 | US 20494593 R32Z6UA4S5Q630 B009T8BSQY 787701676 Alien Frontiers: Factions Toys 1 0 0 N Y Overpriced. You need expansion packs 3-5 if you want access to the player aids for the Factions expansion. The base game of Alien Frontiers just plays so much smoother than adding Factions with the expansion packs. All this will do is pigeonhole you into a certain path to victory. 2015-08-31 22 | US 6762003 R1H1HOVB44808I B00PXWS1CY 996611871 Holy Stone F180C Mini RC Quadcopter Drone with Camera 2.4GHz 6-Axis Gyro Bonus Battery and 8 Blades Toys 5 1 1 N N Five Stars Awesome customer service and a cool little drone! Especially for the price! 2015-08-31 23 | US 25402244 R4UVQIRZ5T1FM 1591749352 741582499 Klutz Sticker Design Studio: Create Your Own Custom Stickers Craft Kit Toys 4 1 2 N Y Great product for little girls! I got these for my daughters for plane trip. I liked that the zipper pouch was attached for markers. However, that pouch fell off. But the girls have loved coloring their own stickers. Would def buy this again. 2015-08-31 24 | US 32910511 R226K8IJLRPTIR B00V5DM3RE 587799706 Yoga Joes - Green Army Men Toys Toys 5 0 1 N Y Creative and fun! My girlfriend and I are both into yoga and I gave her a set of the Yoga Joes for her new home yoga room. When she saw them, she was impressed that I had found little green army men like her brother used to play with. Then she realized they were doing yoga and she almost exploded with delight. You should have seen the look on her face. Needless to say, the gift was a huge hit. They are absolutely brilliant! 2015-08-31 25 | US 18206299 R3031Q42BKAN7J B00UMSVHD4 135383196 Lalaloopsy Girls Basic Doll- Prairie Dusty Trails Toys 4 1 1 N N i like it but i absoloutely hate that some dolls don't ... i like it but i absoloutely hate that some dolls don't have pets like this one so I'm not stoked and i really would have liked to see her pet 2015-08-31 26 | US 26599182 R44NP0QG6E98W B00JLKI69W 375626298 WOW Toys Town Advent Calendar Toys 3 1 1 N Y We love how well they are made We have MANY Wow toys in our home. We love how well they are made. The advent calendar is an exception. The plastic is thinner than our other wow toys and the barn animals won't even stand up on their own due to the head weighing more than the body and uneven base of the toy. Very disappointing when we love the concept. The story to read along with everyday is great. I would have preferred quality (and would have paid more for it) instead of a cheap "knock-off" of the rest of their toy line. It is very obvious by the weight of the toys sloppy paint jobs which items in our "Wow Toys" bin are part advent calendar vs. the rest of the toys we have. Also there is a lot of overlap between the wow town advent calendar & winter wonderland calendar, which makes me want to look for alternatives for this year's advent calendar options. 2015-08-31 27 | US 128540 R24VKWVWUMV3M3 B004S8F7QM 829220659 Cards Against Humanity Toys 5 0 0 N Y Five Stars Tons of fun 2015-08-31 28 | US 125518 R2MW3TEBPWKENS B00MZ6BR3Q 145562057 Monster High Haunted Student Spirits Porter Geiss Doll Toys 5 0 0 N Y Five Stars Love it great add to collecton 2015-08-31 29 | US 15048896 R3N01IESYEYW01 B001CTOC8O 278247652 Star Wars Clone Wars Clone Trooper Child's Deluxe Captain Rex Costume Toys 5 0 1 N Y Five Stars Exactly as described. Fits my 7-yr old well! 2015-08-31 30 | US 12191231 RKLAK7EPEG5S6 B00BMKL5WY 906199996 LEGO Creative Tower Building Kit XXL 1600 Pieces 10664 Toys 5 1 2 N Y The children LOVED them and best part was that it helped the ... Purchased these Lego's to help aid me with teaching my Sunday School class. The children LOVED them and best part was that it helped the children remember past lessons. The true Lego brand seem to work/snap together/fit together better than the generic brands. (Plus I had some Lego snobs in my class that only would use the real Lego brand and shunned the generic brand). Only wish Lego's were a little cheaper but you get what you pay for and I would recommend for quality to purchase this set. 2015-08-31 31 | US 18409006 R1HOJ5GOA2JWM0 B00L71H0F4 692305292 Barkology Princess the Poodle Hand Puppet Toys 2 1 1 N Y My little dog can bite my hand right through the puupet. IT's OK, but not as good as the old Bite Meez puppets. This puppet is very thin. My little yorkie often bites my hands right through the stuffing. It not durable enough to really play with. 2015-08-31 32 | US 42523709 RO5VL1EAPX6O3 B004CLZRM4 59085350 Intex Mesh Lounge (Colors May Vary) Toys 1 0 0 N Y Save your money...don't buy! This was to be a gift for my husband for our new pool. Did not receive the color I ordered but most of all after only one month of use (not continuously) the mesh pulled away from the material and the inflatable side. Completely shredded and no longer of use. It was stored properly and was not kept outside or in the pool. Poorly made, better off going to W**-M*** and getting something on clearance. 2015-08-31 33 | US 45601416 R3OSJU70OIBWVE B000PEOMC8 895316207 Intex River Run I Sport Lounge, Inflatable Water Float, 53 in Diameter Toys 5 0 0 N Y but I've bought one in the past and loved Ended up sending this guy back because I didnt need it, but I've bought one in the past and loved it 2015-08-31 34 | US 47546726 R3NFZZCJSROBT4 B008W1BPWQ 397107238 Peppa Pig 7 Wood Puzzles In Wooden Storage Box (styles will vary) Toys 3 0 0 N Y Three Stars The product is good, but the box was broken. 2015-08-31 35 | US 21448082 R47XBGQFP039N B00FZX71BI 480992295 Paraboard - Parallel Charging Board for Lipos with EC5 Connectors Toys 5 0 0 N Y CAN SAVE TME IF YOU UNDESTAND HOW IT WORKS . Works well ,quality product but this style of board will charge multiple batteries at the same time SAFELY ( IF ) ALL BATTERIES ARE OF THE SAME CELL COUNT , THE SAME BATTERY COMPOSITION ( LIPO, NIMH-etc ) AND THEY MUST HAVE INDIVIDUAL CELL VOLTAGES THAT ARE VERY CLOSE AND EQUAL TO EACH BATTERY CONNECTED AT THE SAME TIME . When board is connected to most if not all chargers it can only read total and individual cell voltage of ONE OF THE BATTERIES AND MAY OVER OR UNDER CHARGE THE OTHERS TO SOME DEGREE , TOTAL RATE OF CHARGE IS DIVIDED EQUALLY BETWEEN BATTERIES CONNECTED AT THE SAME TIME . Close monitoring is a must when using like all high discharge batteies . I have only personally expeienced one lipo battery meltdown and it is a very SHORT IF NOT NON EXISTANT WINDOW OF OPPERUNITY TO PREVENT OR MINIMISE THE COLATTERAL DAMAGE ONCE THE PROCESS STARTS . Read and understand all charging and battery instructions . 2015-08-31 36 | US 12612039 R1JS8G26X4RM2G B00D4NJSJE 408940178 The Game of Life Money and Asset Board Game, Fame Edition Toys 5 0 1 N Y Five Stars Great gift! 2015-08-31 37 | US 44928701 R1ORWPFQ9EDYA0 B000HZZT7W 967346376 LCR Dice Game (Red Chips) Toys 5 0 0 N Y Five Stars We play this game quite a bit with friends. 2015-08-31 38 | US 43173394 R1YIX4SO32U0GT B002G54DAA 57447684 BCW - Deluxe Currency Slab - Regular Bill - Dollar / Currency Collecting Supplies Toys 5 0 1 N Y BCW - Deluxe Currency Slab Fits my $20 bill perfectly. 2015-08-31 39 | US 11210951 R1W3QQZ8JKECCI B003JT0L4Y 876626440 Ocean Life Stamps Birthday Party Supplies Loot Bag Accessories 24 Pieces per Unit Toys 5 0 0 N Y Fun for birthday party favor I ordered these for my 3 year old son's birthday party as party favors. They were a huge hit & the perfect fit for a 3 year old! 2015-08-31 40 | US 12918717 RZX17JIYIPAR B00KQUNNZ8 644368246 New Age Scare Halloween Party Pumpkin and Bat Hanging Round Lantern Decoration, Paper, 9" Pack of 3 Toys 5 0 0 N N Love the prints! These paper lanterns are adorable! The colors are bright, the patterns are fun & trendy & they're a good size! They came well packaged & are easy to assemble, they're also really easy to take apart & flatten back down! We'll get to use them for a few Halloweens I'm sure! They even came with the string needed to hang! I'm glad I grabbed them, they're really cute, my kids are excited to add to the Halloween decor & are already asking to hang them!
I received this item at a discount for my unbiased review. 2015-08-31 41 | US 47781982 RIDVQ4P3WJR42 B00WTGGGRO 162262449 Pokemon - Double Dragon Energy (97/108) - XY Roaring Skies Toys 5 1 1 N Y Five Stars My Grandson loves these cards. Thank you 2015-08-31 42 | US 34874898 R1WQ3ME3JAG2O1 B00WAKEQLW 824555589 Whiffer Sniffers Mystery Pack 1 Scented Backpack Clip Toys 1 0 6 N Y One Star Received a pineapple rather than the advertised s'more 2015-08-31 43 | US 20962528 RNTPOUDQIICBF B00M5AT30G 548190970 AmiGami Fox and Owl Figure 2-Pack Toys 4 0 0 N Y Four Stars Christmas gift for 6yr 2015-08-31 44 | US 47781982 R3AHZWWOL0IAV0 B00GNDY40U 438056479 Pokemon - Gyarados (31/113) - Legendary Treasures Toys 5 0 0 N Y Five Stars My Grandson loves these cards. Thank you 2015-08-31 45 | US 13328687 R3PDXKS9O2Z20B B00WJ1OPMW 120071056 LeapFrog LeapTV Letter Factory Adventures Educational, Active Video Game Toys 5 0 0 N N they LOVE this game Even though both of my kids are at the top of this age recommendation level, they LOVE this game! I love how it caters to the kinesthetic learner by asking them to move their bodies into the shape of the letters. It even takes teamwork as sometimes two people are required to finish the letter. My kids know all of their letter sounds and shapes, but this didn't stop them from playing the game over and over. 2015-08-31 46 | US 16245463 R23URALWA7IHWL B00IGXV9UI 765869385 Disney Planes: Fire & Rescue Scoop & Spray Firefighter Dusty Toys 5 0 0 N Y Five Stars My 5 year old son loves this. 2015-08-31 47 | US 11916403 R36L8VKT9ZSUY6 B00JVY9J1M 771795950 Winston Zeddmore & Ecto-1: Funko POP! Rides x Ghostbusters Vinyl Figure Toys 5 0 0 N Y Five Stars love it 2015-08-31 48 | US 5543658 R23JRQR6VMY4TV B008AL15M8 211944547 Yu-Gi-Oh! - Solemn Judgment (GLD5-EN045) - Gold Series: Haunted Mine - Limited Edition - Ghost/Gold Hybrid Rare Toys 5 0 0 N Y Absolutely one of the best traps in the game Absolutely one of the best traps in the game. It is never a dead and always live since you can always pay half your lifepoints for its cost. It's main power is that it can stop any card. Hopefully this card comes off the Forbidden/Limited list soon. 2015-08-31 49 | US 41168357 R3T73PQZZ9F6GT B00CAEEDC0 72805974 Seat Pets Car Seat Toy Toys 5 0 0 N Y Five Stars really soft and cute 2015-08-31 50 | US 32866903 R300I65NW30Y19 B000TFLAZA 149264874 Baby Einstein Octoplush Toys 5 0 0 N Y Five Stars baby loved it - so attractive and very nice 2015-08-31 51 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/Lab 5 Kinesis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lab 5 (Part II): Ingesting Streaming Data with Kinesis\n", 8 | "### MACS 30123: Large-Scale Computing for the Social Sciences\n", 9 | "\n", 10 | "In this second part of the lab, we'll explore how we can use Kinesis to ingest streaming text data, of the sort we might encounter on Twitter.\n", 11 | "\n", 12 | "To avoid requiring you to set up Twitter API access, we will create Twitter-like text and metadata using the `testdata` package to perform this demonstration. It should be easy enough to plug your streaming Twitter feed into this workflow if you desire to do so as an individual exercise (for instance, as a part of a final project!). Additionally, once you have this pipeline running, you can scale it up even further to include many more producers and consumers, if you would like, as discussed in lecture and the readings.\n", 13 | "\n", 14 | "Recall from the lecture and readings that in a Kinesis workflow, \"producers\" send data into a Kinesis stream and \"consumers\" draw data out of that stream to perform operations on it (i.e. real-time processing, archiving raw data, etc.). To make this a bit more concrete, we are going to implement a simplified version of this workflow in this lab, in which we spin up Producer and Consumer (t2.nano) EC2 Instances and create a Kinesis stream. Our Producer instance will run a producer script (which writes our Twitter-like text data into a Kinesis stream) and our Consumer instance will run a consumer script (which reads the Twitter-like data and calculates a simple demo statistic -- the average unique word count per tweet, as a real-time running average).\n", 15 | "\n", 16 | "You can visualize this data pipeline, like so:\n", 17 | "\n", 18 | "\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "To begin implementing this pipeline, let's import `boto3` and initialize the AWS services we'll be using in this lab (EC2 and Kinesis)." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "import boto3\n", 35 | "import time\n", 36 | "\n", 37 | "session = boto3.Session()\n", 38 | "\n", 39 | "kinesis = session.client('kinesis')\n", 40 | "ec2 = session.resource('ec2')\n", 41 | "ec2_client = session.client('ec2')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Then, we need to create the Kinesis stream that our Producer EC2 instance will write streaming tweets to. Because we're only setting this up to handle traffic from one consumer and one producer, we'll just use one shard, but we could increase our throughput capacity by increasing the ShardCount if we wanted to do so." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 2, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "response = kinesis.create_stream(StreamName = 'test_stream',\n", 58 | " ShardCount = 1\n", 59 | " )\n", 60 | "\n", 61 | "# Is the stream active and ready to be written to/read from? Wait until it exists before moving on:\n", 62 | "waiter = kinesis.get_waiter('stream_exists')\n", 63 | "waiter.wait(StreamName='test_stream')" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "OK, now we're ready to set up our producer and consumer EC2 instances that will write to and read from this Kinesis stream. Let's spin up our two EC2 instances (specified by the `MaxCount` parameter) using one of the Amazon Linux AMIs. Notice here that you will need to specify your `.pem` file for the `KeyName` parameter, as well as create a custom security group/group ID. Designating a security group is necessary because, by default, AWS does not allow inbound ssh traffic into EC2 instances (they create custom ssh-friendly security groups each time you run the GUI wizard in the console). Thus, if you don't set this parameter, you will not be able to ssh into the EC2 instances that you create here with `boto3`. You can follow along in the lab video for further instructions on how you can set up one of these security groups.\n", 71 | "\n", 72 | "Also, we need to specify an IAM Instance Profile so that our EC2 instances will have the permissions necessary to interact with other AWS services on our behalf. Here, I'm using one of the profiles we create in Part I of Lab 5 (a default AWS profile for launching EC2 instances within an EMR cluster), as this gives us all of the necessary permissions" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "instances = ec2.create_instances(ImageId='ami-0915e09cc7ceee3ab',\n", 82 | " MinCount=1,\n", 83 | " MaxCount=2,\n", 84 | " InstanceType='t2.micro',\n", 85 | " KeyName='MACS_30123',\n", 86 | " SecurityGroupIds=['sg-0e921f64abdfac4e6'],\n", 87 | " SecurityGroups=['MACS_30123'],\n", 88 | " IamInstanceProfile=\n", 89 | " {'Name': 'EMR_EC2_DefaultRole'},\n", 90 | " )\n", 91 | "\n", 92 | "# Wait until EC2 instances are running before moving on\n", 93 | "waiter = ec2_client.get_waiter('instance_running')\n", 94 | "waiter.wait(InstanceIds=[instance.id for instance in instances])" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "While we wait for these instances to start running, let's set up the Python scripts that we want to run on each instance. First of all, we have to define a script for our Producer instance, which continuously produces Twitter-like data using the `testdata` package and puts that data into our Kinesis stream." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "name": "stdout", 111 | "output_type": "stream", 112 | "text": [ 113 | "Overwriting producer.py\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "%%file producer.py\n", 119 | "\n", 120 | "import boto3\n", 121 | "import testdata\n", 122 | "import json\n", 123 | "\n", 124 | "kinesis = boto3.client('kinesis', region_name='us-east-1')\n", 125 | "\n", 126 | "# Continously write Twitter-like data into Kinesis stream\n", 127 | "while 1 == 1:\n", 128 | " test_tweet = {'username': testdata.get_username(),\n", 129 | " 'tweet': testdata.get_ascii_words(280)\n", 130 | " }\n", 131 | " kinesis.put_record(StreamName = \"test_stream\",\n", 132 | " Data = json.dumps(test_tweet),\n", 133 | " PartitionKey = \"partitionkey\"\n", 134 | " )" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "Then, we can define a script for our Consumer instance that gets the latest tweet out of the stream, one at a time. After processing each tweet, we then print out the average unique word count per processed tweet as a running average, before jumping on to the next indexed tweet in our Kinesis stream shard to do the same thing for as long as our program is running." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "Overwriting consumer.py\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "%%file consumer.py\n", 159 | "\n", 160 | "import boto3\n", 161 | "import time\n", 162 | "import json\n", 163 | "\n", 164 | "kinesis = boto3.client('kinesis', region_name='us-east-1')\n", 165 | "\n", 166 | "shard_it = kinesis.get_shard_iterator(StreamName = \"test_stream\",\n", 167 | " ShardId = 'shardId-000000000000',\n", 168 | " ShardIteratorType = 'LATEST'\n", 169 | " )[\"ShardIterator\"]\n", 170 | "\n", 171 | "i = 0\n", 172 | "s = 0\n", 173 | "while 1==1:\n", 174 | " out = kinesis.get_records(ShardIterator = shard_it,\n", 175 | " Limit = 1\n", 176 | " )\n", 177 | " for o in out['Records']:\n", 178 | " jdat = json.loads(o['Data'])\n", 179 | " s = s + len(set(jdat['tweet'].split()))\n", 180 | " i = i + 1\n", 181 | "\n", 182 | " if i != 0:\n", 183 | " print(\"Average Unique Word Count Per Tweet: \" + str(s/i))\n", 184 | " print(\"Sample of Current Tweet: \" + jdat['tweet'][:20])\n", 185 | " print(\"\\n\")\n", 186 | " \n", 187 | " shard_it = out['NextShardIterator']\n", 188 | " time.sleep(0.2)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "As our final preparation step, we'll grab all of the public DNS names of our instances (web addresses that you normally copy from the GUI console to manually ssh into and record the names of our code files, so that we can easily ssh/scp into the instances and pass them our Python scripts to run." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 6, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "instance_dns = [instance.public_dns_name \n", 205 | " for instance in ec2.instances.all() \n", 206 | " if instance.state['Name'] == 'running'\n", 207 | " ]\n", 208 | "\n", 209 | "code = ['producer.py', 'consumer.py']" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "To copy our files over to our instances and programmatically run commands on them, we can use Python's `scp` and `paramiko` packages. You'll need to install these via `pip install paramiko scp` if you have not already done so." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 7, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "Requirement already satisfied: paramiko in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (2.7.2)\r\n", 229 | "Requirement already satisfied: scp in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (0.13.3)\r\n", 230 | "Requirement already satisfied: cryptography>=2.5 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (3.4.6)\r\n", 231 | "Requirement already satisfied: pynacl>=1.0.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (1.4.0)\r\n", 232 | "Requirement already satisfied: bcrypt>=3.1.3 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from paramiko) (3.2.0)\r\n", 233 | "Requirement already satisfied: cffi>=1.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.14.5)\r\n", 234 | "Requirement already satisfied: six>=1.4.1 in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from bcrypt>=3.1.3->paramiko) (1.15.0)\r\n", 235 | "Requirement already satisfied: pycparser in /home/jclindaniel/anaconda3/envs/macs30123/lib/python3.7/site-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko) (2.20)\r\n" 236 | ] 237 | } 238 | ], 239 | "source": [ 240 | "! pip install paramiko scp" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Once we have `scp` and `paramiko` installed, we can copy our producer and consumer Python scripts over to the EC2 instances (designating our first EC2 instance in `instance_dns` as the producer and second EC2 instance as the consumer instance). If you have a slower (or more unstable) internet connection, you might need to increase the time.sleep() time in the code and try to run this code several times in order for it to fully run.\n", 248 | "\n", 249 | "Note that, on each instance, we install `boto3` (so that we can access Kinesis through our scripts) and then copy our producer/consumer Python code over to our producer/consumer EC2 instance via `scp`. After we've done this, we install the `testdata` package on the producer instance (which it needs in order to create fake tweets) and instruct it to run our Python producer script. This will write tweets into our Kinesis stream until we stop the script and terminate the producer EC2 instance.\n", 250 | "\n", 251 | "We could also instruct our consumer to get tweets from the stream immediately after this command and this would automatically collect and process the tweets according to the consumer.py script. For the purposes of this demonstration, though, we'll manually ssh into that instance and run the code from the terminal so that we can see the real-time consumption a bit more easily." 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 9, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "Producer Instance is Running producer.py\n", 264 | ".........................................\n", 265 | "Connect to Consumer Instance by running: ssh -i \"MACS_30123.pem\" ec2-user@ec2-54-89-73-205.compute-1.amazonaws.com\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "import paramiko\n", 271 | "from scp import SCPClient\n", 272 | "ssh_producer, ssh_consumer = paramiko.SSHClient(), paramiko.SSHClient()\n", 273 | "\n", 274 | "# Initialization of SSH tunnels takes a bit of time; otherwise get connection error on first attempt\n", 275 | "time.sleep(5)\n", 276 | "\n", 277 | "# Install boto3 on each EC2 instance and Copy our producer/consumer code onto producer/consumer EC2 instances\n", 278 | "instance = 0\n", 279 | "stdin, stdout, stderr = [[None, None] for i in range(3)]\n", 280 | "for ssh in [ssh_producer, ssh_consumer]:\n", 281 | " ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())\n", 282 | " ssh.connect(instance_dns[instance],\n", 283 | " username = 'ec2-user',\n", 284 | " key_filename='/home/jclindaniel/MACS_30123.pem')\n", 285 | " \n", 286 | " with SCPClient(ssh.get_transport()) as scp:\n", 287 | " scp.put(code[instance])\n", 288 | " \n", 289 | " if instance == 0:\n", 290 | " stdin[instance], stdout[instance], stderr[instance] = \\\n", 291 | " ssh.exec_command(\"sudo pip install boto3 testdata\")\n", 292 | " else:\n", 293 | " stdin[instance], stdout[instance], stderr[instance] = \\\n", 294 | " ssh.exec_command(\"sudo pip install boto3\")\n", 295 | "\n", 296 | " instance += 1\n", 297 | "\n", 298 | "# Block until Producer has installed boto3 and testdata, then start running Producer script:\n", 299 | "producer_exit_status = stdout[0].channel.recv_exit_status() \n", 300 | "if producer_exit_status == 0:\n", 301 | " ssh_producer.exec_command(\"python %s\" % code[0])\n", 302 | " print(\"Producer Instance is Running producer.py\\n.........................................\")\n", 303 | "else:\n", 304 | " print(\"Error\", producer_exit_status)\n", 305 | "\n", 306 | "# Close ssh and show connection instructions for manual access to Consumer Instance\n", 307 | "ssh_consumer.close; ssh_producer.close()\n", 308 | "\n", 309 | "print(\"Connect to Consumer Instance by running: ssh -i \\\"MACS_30123.pem\\\" ec2-user@%s\" % instance_dns[1])" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "If you run the command above (with the correct path to your actual `.pem` file), you should be inside your Consumer EC2 instance. If you run `python consumer.py`, you should also see a real-time count of the average number of unique words per tweet (along with a sample of the text in the most recent tweet), as in the screenshot:\n", 317 | "\n", 318 | "![](consumer_feed.png)\n", 319 | "\n", 320 | "Cool! Now we can scale this basic architecture up to perform any number of real-time data analyses, if we so desire. Also, if we execute our consumer code remotely via paramiko as well, the process will be entirely remote, so we don't need to keep any local resources running in order to keep streaming/processing real-time data.\n", 321 | "\n", 322 | "As a final note, when you are finished observing the real-time feed from your consumer instance, **be sure to terminate your EC2 instances and delete your Kinesis stream**. You don't want to be paying for these to run continuously! You can do so programmatically by running the following `boto3` code:" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 10, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "EC2 Instances Successfully Terminated\n", 335 | "Kinesis Stream Successfully Deleted\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "# Terminate EC2 Instances:\n", 341 | "ec2_client.terminate_instances(InstanceIds=[instance.id for instance in instances])\n", 342 | "\n", 343 | "# Confirm that EC2 instances were terminated:\n", 344 | "waiter = ec2_client.get_waiter('instance_terminated')\n", 345 | "waiter.wait(InstanceIds=[instance.id for instance in instances])\n", 346 | "print(\"EC2 Instances Successfully Terminated\")\n", 347 | "\n", 348 | "# Delete Kinesis Stream (if it currently exists):\n", 349 | "try:\n", 350 | " response = kinesis.delete_stream(StreamName='test_stream')\n", 351 | "except kinesis.exceptions.ResourceNotFoundException:\n", 352 | " pass\n", 353 | "\n", 354 | "# Confirm that Kinesis Stream was deleted:\n", 355 | "waiter = kinesis.get_waiter('stream_not_exists')\n", 356 | "waiter.wait(StreamName='test_stream')\n", 357 | "print(\"Kinesis Stream Successfully Deleted\")" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "Python 3", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.7.10" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 4 382 | } 383 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | import time 4 | import json 5 | 6 | kinesis = boto3.client('kinesis', region_name='us-east-1') 7 | 8 | shard_it = kinesis.get_shard_iterator(StreamName = "test_stream", 9 | ShardId = 'shardId-000000000000', 10 | ShardIteratorType = 'LATEST' 11 | )["ShardIterator"] 12 | 13 | i = 0 14 | s = 0 15 | while 1==1: 16 | out = kinesis.get_records(ShardIterator = shard_it, 17 | Limit = 1 18 | ) 19 | for o in out['Records']: 20 | jdat = json.loads(o['Data']) 21 | s = s + len(set(jdat['tweet'].split())) 22 | i = i + 1 23 | 24 | if i != 0: 25 | print("Average Unique Word Count Per Tweet: " + str(s/i)) 26 | print("Sample of Current Tweet: " + jdat['tweet'][:20]) 27 | print("\n") 28 | 29 | shard_it = out['NextShardIterator'] 30 | time.sleep(0.2) 31 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/consumer_feed.png -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/producer.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | import testdata 4 | import json 5 | 6 | kinesis = boto3.client('kinesis', region_name='us-east-1') 7 | 8 | # Continously write Twitter-like data into Kinesis stream 9 | while 1 == 1: 10 | test_tweet = {'username': testdata.get_username(), 11 | 'tweet': testdata.get_ascii_words(280) 12 | } 13 | kinesis.put_record(StreamName = "test_stream", 14 | Data = json.dumps(test_tweet), 15 | PartitionKey = "partitionkey" 16 | ) 17 | -------------------------------------------------------------------------------- /Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 5 Ingesting and Processing Large-Scale Data/Part II Kinesis/simple_kinesis_architecture.png -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/7M_PySpark_Midway.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql.functions import * 3 | 4 | # Start Spark Session 5 | spark = SparkSession.builder.getOrCreate() 6 | 7 | # Read data 8 | data = spark.read.csv('/project2/macs30123/AWS_book_reviews/*.csv', 9 | header='true', 10 | inferSchema='true') 11 | 12 | # Recast columns to correct data type 13 | data = (data.withColumn('star_rating', col('star_rating').cast('int')) 14 | .withColumn('total_votes', col('total_votes').cast('int')) 15 | .withColumn('helpful_votes', col('helpful_votes').cast('int')) 16 | ) 17 | 18 | # Summarize data by star_rating 19 | stars_votes = (data.groupBy('star_rating') 20 | .sum('total_votes', 'helpful_votes') 21 | .sort('star_rating', ascending=False) 22 | ) 23 | 24 | # Drop rows with NaN values and then print out resulting data: 25 | stars_votes_clean = stars_votes.dropna() 26 | stars_votes_clean.show() 27 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/Local_Colab_Spark_Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "yoPc0xms3gmI" 7 | }, 8 | "source": [ 9 | "\"Open" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "qpaNU3ETh0Qc" 16 | }, 17 | "source": [ 18 | "\n", 19 | "\n", 20 | "# Setting up PySpark in a Colab Notebook\n", 21 | "\n", 22 | "You can run Spark both locally and on a cluster. Here, I'll demonstrate how you can set up Spark to run in a Colab notebook for debugging purposes.\n", 23 | "\n", 24 | "You can also set up Spark locally in a similar way if you want to take advantage of multiple CPU cores (and/or GPU) on your laptop (the setup will vary slightly, though, depending on your operating system and you'll need to figure out these specifics on your own; however, this setup does work in WSL for me if I run the follow bash script in my terminal window using `sudo`). This being said, this local option should be for testing purposes on sample datasets only. If you want to run big PySpark jobs, you will want to run these in an EMR notebook (with an EMR cluster as your backend) or on the Midway Cluster.\n", 25 | "\n", 26 | "First, we need to install Spark and PySpark, by running the following commands:" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": { 33 | "id": "R8f1D7wfaCgF" 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "%%bash\n", 38 | "apt-get update\n", 39 | "apt-get install -y openjdk-8-jdk-headless -qq > /dev/null\n", 40 | "\n", 41 | "wget -q \"https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz\" > /dev/null\n", 42 | "tar -xvf spark-3.1.1-bin-hadoop2.7.tgz > /dev/null\n", 43 | "\n", 44 | "pip install pyspark findspark" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "id": "fHjZLeId0nvR" 51 | }, 52 | "source": [ 53 | "OK, now that we have Spark, we need to set a path to it, so PySpark knows where to find it. We do this using the `os` Python library below.\n", 54 | "\n", 55 | "On my machine (WSL, Ubuntu 20.04), where I unpacked Spark in my home directory, this can be achieved with:\n", 56 | "```\n", 57 | "os.environ[\"SPARK_HOME\"] = \"/home/jclindaniel/spark-3.1.1-bin-hadoop2.7\"\n", 58 | "```\n", 59 | "\n", 60 | "In Colab, it is automatically downloaded to the `/content` directory, so we indicate that as its location here. Then, we run `findspark` to find Spark for us on the machine, and finally start up a SparkSession running on all available cores (`local[4]` means your code will run on 4 threads locally, `local[*]` means that your code will run as many threads as there are logical cores on your machine)." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 2, 66 | "metadata": { 67 | "id": "CljIupW0aE06" 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "# Set path to Spark\n", 72 | "import os\n", 73 | "os.environ[\"SPARK_HOME\"] = \"/content/spark-3.1.1-bin-hadoop2.7\"\n", 74 | "\n", 75 | "# Find Spark so that we can access session within our notebook\n", 76 | "import findspark\n", 77 | "findspark.init()\n", 78 | "\n", 79 | "# Start SparkSession on all available cores\n", 80 | "from pyspark.sql import SparkSession\n", 81 | "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": { 87 | "id": "VNTQOBLthDrC" 88 | }, 89 | "source": [ 90 | "Now that we've installed everything and set up our paths correctly, we can run (small) Spark jobs both in Colab notebooks and locally (for bigger jobs, you will want to run these jobs on an EMR cluster, though. Remember, for instance, that Google only allocates us one CPU core and up to one GPU for free)!\n", 91 | "\n", 92 | "Let's make sure our setup is working by doing couple of simple things with the pyspark.sql package on the Amazon Customer Review Sample Dataset." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "id": "fbXWBQfSAX8q" 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "! pip install wget\n", 104 | "import wget\n", 105 | "\n", 106 | "wget.download('https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv', 'sample_data/sample_us.tsv')" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 4, 112 | "metadata": { 113 | "id": "KrXWEMxjeFx1" 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "# Read TSV file from default data download directory in Colab\n", 118 | "data = spark.read.csv('sample_data/sample_us.tsv',\n", 119 | " sep=\"\\t\",\n", 120 | " header=True,\n", 121 | " inferSchema=True)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 5, 127 | "metadata": { 128 | "colab": { 129 | "base_uri": "https://localhost:8080/" 130 | }, 131 | "id": "2qvOOIYqeWw9", 132 | "outputId": "44f68fd3-ab6a-442a-a94d-7c41c37ed175" 133 | }, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "root\n", 140 | " |-- marketplace: string (nullable = true)\n", 141 | " |-- customer_id: integer (nullable = true)\n", 142 | " |-- review_id: string (nullable = true)\n", 143 | " |-- product_id: string (nullable = true)\n", 144 | " |-- product_parent: integer (nullable = true)\n", 145 | " |-- product_title: string (nullable = true)\n", 146 | " |-- product_category: string (nullable = true)\n", 147 | " |-- star_rating: integer (nullable = true)\n", 148 | " |-- helpful_votes: integer (nullable = true)\n", 149 | " |-- total_votes: integer (nullable = true)\n", 150 | " |-- vine: string (nullable = true)\n", 151 | " |-- verified_purchase: string (nullable = true)\n", 152 | " |-- review_headline: string (nullable = true)\n", 153 | " |-- review_body: string (nullable = true)\n", 154 | " |-- review_date: string (nullable = true)\n", 155 | "\n" 156 | ] 157 | } 158 | ], 159 | "source": [ 160 | "data.printSchema()" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "metadata": { 167 | "colab": { 168 | "base_uri": "https://localhost:8080/" 169 | }, 170 | "id": "ngb25JINcUNr", 171 | "outputId": "8f3f2dd2-b47d-449b-d6c0-34ead56d5e7d" 172 | }, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "+-----------+----------------+\n", 179 | "|star_rating|sum(total_votes)|\n", 180 | "+-----------+----------------+\n", 181 | "| 5| 13|\n", 182 | "| 4| 3|\n", 183 | "| 3| 8|\n", 184 | "| 2| 2|\n", 185 | "| 1| 8|\n", 186 | "+-----------+----------------+\n", 187 | "\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "(data.groupBy('star_rating')\n", 193 | " .sum('total_votes')\n", 194 | " .sort('star_rating', ascending=False)\n", 195 | " .show()\n", 196 | ")" 197 | ] 198 | } 199 | ], 200 | "metadata": { 201 | "colab": { 202 | "collapsed_sections": [], 203 | "name": "Local/Colab Spark Setup", 204 | "provenance": [], 205 | "toc_visible": true 206 | }, 207 | "kernelspec": { 208 | "display_name": "Python 3", 209 | "language": "python", 210 | "name": "python3" 211 | }, 212 | "language_info": { 213 | "codemirror_mode": { 214 | "name": "ipython", 215 | "version": 3 216 | }, 217 | "file_extension": ".py", 218 | "mimetype": "text/x-python", 219 | "name": "python", 220 | "nbconvert_exporter": "python", 221 | "pygments_lexer": "ipython3", 222 | "version": "3.7.6" 223 | } 224 | }, 225 | "nbformat": 4, 226 | "nbformat_minor": 1 227 | } 228 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/ec2_limit_increase.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jonclindaniel/LargeScaleComputing_S21/1416f4c405dbe386ed8a0238f0eccaa7b087b1ca/Labs/Lab 6 PySpark EDA and ML/ec2_limit_increase.png -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/gpu_cluster_instructions.md: -------------------------------------------------------------------------------- 1 | ## Launching a Spark GPU Cluster via AWS EMR 2 | 3 | In order to launch a Spark GPU cluster via AWS EMR (and interact with it via EMR notebook), you will first need to create a personal AWS account. AWS doesn't currently allow us to launch these GPU clusters from within our free AWS educate accounts. 4 | 5 | Once you have a personal account, you will need to request a limit increase on the number of GPU instances you can use from within the account (to follow along with these instructions, you should request 16 vCPUs on G-series EC2 instances), as displayed in the limit increase request form available in the EC2 dashboard: 6 | 7 | ![](ec2_limit_increase.png) 8 | 9 | Once AWS has increased your allowed number of GPU EC2 instances: 10 | 1. Log into your AWS Console with an IAM User account (you may encounter problems starting a PySpark kernel in your EMR Notebook if you do not complete this step). 11 | 2. Go to the EMR dashboard and click "Create Cluster," and then "Advanced" 12 | 3. In "Step 1: Software and Steps," Select EMR release label 6.2.0, and ensure that the only applications that are selected are Hadoop, Spark, Livy, and JupyterEnterpriseGateway. 13 | 4. Also in "Step 1: Software and Steps," but in the "Edit software settings" section, copy and paste everything from `my-configurations.json` (located in this directory) into the text box. Alternatively, you can upload the JSON to S3 and read it in. 14 | 5. In "Step 2: Hardware", select 1 Master node of style "m5.xlarge," 1 core node of type "g4dn.2xlarge," and 1 task node of type "g4dn.2xlarge" (for a total of 2 NVIDIA T4 GPUs in your cluster). 15 | 6. In "Step 3: General Cluster Settings," Enter bootstrap file location in the "Bootstrap Actions" section by clicking "Custom action" and then "Configure and add" (the necessary bootstrap file is available in a public S3 bucket at: `s3://macs30123/my-bootstrap-action.sh` and you can also modify the file and upload it to your own S3 bucket). 16 | 7. Accept all other defaults (but select an EC2 key pair/PEM if you would like to `ssh` into your cluster) and click the "Next" button until your cluster launches. 17 | 8. Then, once your cluster has launched, you can connect it to an EMR Notebook workspace like normal and run your PySpark code on a GPU cluster! Just as a heads-up these GPU instances each cost $0.75/hr, so be careful how long you run your cluster! 18 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/my-bootstrap-action.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -ex 4 | 5 | sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct 6 | sudo chmod a+rwx -R /sys/fs/cgroup/devices 7 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/my-configurations.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "Classification":"spark", 4 | "Properties":{ 5 | "enableSparkRapids":"true" 6 | } 7 | }, 8 | { 9 | "Classification":"yarn-site", 10 | "Properties":{ 11 | "yarn.nodemanager.resource-plugins":"yarn.io/gpu", 12 | "yarn.resource-types":"yarn.io/gpu", 13 | "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", 14 | "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", 15 | "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", 16 | "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", 17 | "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", 18 | "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" 19 | } 20 | }, 21 | { 22 | "Classification":"container-executor", 23 | "Properties":{ 24 | 25 | }, 26 | "Configurations":[ 27 | { 28 | "Classification":"gpu", 29 | "Properties":{ 30 | "module.enabled":"true" 31 | } 32 | }, 33 | { 34 | "Classification":"cgroups", 35 | "Properties":{ 36 | "root":"/sys/fs/cgroup", 37 | "yarn-hierarchy":"yarn" 38 | } 39 | } 40 | ] 41 | }, 42 | { 43 | "Classification":"spark-defaults", 44 | "Properties":{ 45 | "spark.plugins":"com.nvidia.spark.SQLPlugin", 46 | "spark.sql.sources.useV1SourceList":"", 47 | "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", 48 | "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar", 49 | "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", 50 | "spark.rapids.sql.concurrentGpuTasks":"2", 51 | "spark.executor.resource.gpu.amount":"1", 52 | "spark.executor.cores":"8", 53 | "spark.task.cpus ":"1", 54 | "spark.task.resource.gpu.amount":"0.125", 55 | "spark.rapids.memory.pinnedPool.size":"2G", 56 | "spark.executor.memoryOverhead":"2G", 57 | "spark.locality.wait":"0s", 58 | "spark.sql.shuffle.partitions":"200", 59 | "spark.sql.files.maxPartitionBytes":"256m", 60 | "spark.sql.adaptive.enabled":"false" 61 | } 62 | }, 63 | { 64 | "Classification":"capacity-scheduler", 65 | "Properties":{ 66 | "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" 67 | } 68 | } 69 | ] 70 | -------------------------------------------------------------------------------- /Labs/Lab 6 PySpark EDA and ML/spark.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=spark 4 | #SBATCH --partition=broadwl 5 | #SBATCH --output=spark.out 6 | #SBATCH --error=spark.err 7 | #SBATCH --ntasks=10 8 | 9 | module load python/anaconda-2019.03 10 | module load spark/2.4.3 11 | 12 | export PYSPARK_DRIVER_PYTHON=/software/Anaconda3-2019.03-el7-x86_64/bin/python 13 | 14 | spark-submit --master local[*] 7M_PySpark_Midway.py 15 | -------------------------------------------------------------------------------- /Labs/Lab 7 Exploring the Larger Spark Ecosystem/bootstrap: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -x -e 3 | 4 | echo -e 'export PYSPARK_PYTHON=/usr/bin/python3 5 | export HADOOP_CONF_DIR=/etc/hadoop/conf 6 | export SPARK_JARS_DIR=/usr/lib/spark/jars 7 | export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc 8 | 9 | sudo python3 -m pip install awscli boto spark-nlp 10 | 11 | set +x 12 | exit 0 13 | -------------------------------------------------------------------------------- /Labs/Lab 7 Exploring the Larger Spark Ecosystem/spark_streaming_emr.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "93541f0c", 6 | "metadata": {}, 7 | "source": [ 8 | "## Querying Streaming Spark DataFrames in an EMR Notebook\n", 9 | "\n", 10 | "In this notebook, we will read data from a modified version of the Kinesis stream from Lab 5 into a Spark streaming DataFrame. Once we've loaded our streaming DataFrame, we'll perform a simple query on it and write the results of our query to S3 for further analysis.\n", 11 | "\n", 12 | "We've modified the producer from Lab 5 to send tweet-like JSON data into our `test_stream` Kinesis stream in the form of `{\"username\": ..., \"age\": ..., \"num_followers\": ..., \"tweet\": ...}` (by adding additional test data for `age` and `num_followers`). If you're following along with the code in this notebook, be sure to use a similar producer script to put data into a Kinesis stream:\n", 13 | "\n", 14 | "```\n", 15 | "import boto3\n", 16 | "import testdata\n", 17 | "import json\n", 18 | "\n", 19 | "kinesis = boto3.client('kinesis', region_name='us-east-1')\n", 20 | "\n", 21 | "# Continously write Twitter-like data into Kinesis stream\n", 22 | "while 1 == 1:\n", 23 | " test_tweet = {'username': testdata.get_username(),\n", 24 | " 'age': testdata.get_int(18, 100),\n", 25 | " 'num_followers': testdata.get_int(0, 10000),\n", 26 | " 'tweet': testdata.get_ascii_words(280)\n", 27 | " }\n", 28 | " kinesis.put_record(StreamName = \"test_stream\",\n", 29 | " Data = json.dumps(test_tweet),\n", 30 | " PartitionKey = \"partitionkey\"\n", 31 | " )\n", 32 | "```\n", 33 | "\n", 34 | "First, let's add the [Spark Structured Streaming package](https://spark.apache.org/docs/2.4.7/structured-streaming-programming-guide.html) to our session configuration (we'll specifically add a version that makes it possible to interact with Kinesis streams):" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "id": "15f174dc", 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "%%configure -f\n", 45 | "{ \"conf\": {\"spark.jars.packages\": \"com.qubole.spark/spark-sql-kinesis_2.11/1.1.3-spark_2.4\" }}" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "b7fd35cd", 51 | "metadata": {}, 52 | "source": [ 53 | "Then, we're ready to start reading from our Kinesis stream. For this demonstration, we'll start with the latest data in the stream, but we could get more granular if we would like to do so as well:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "id": "43beb294", 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "application/vnd.jupyter.widget-view+json": { 65 | "model_id": "4e7d52b9d2c543dfbe8fdd0faaeaee6c", 66 | "version_major": 2, 67 | "version_minor": 0 68 | }, 69 | "text/plain": [ 70 | "VBox()" 71 | ] 72 | }, 73 | "metadata": {}, 74 | "output_type": "display_data" 75 | }, 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "Starting Spark application\n" 81 | ] 82 | }, 83 | { 84 | "data": { 85 | "text/html": [ 86 | "\n", 87 | "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
13application_1620656441511_0014pysparkidleLinkLink
" 89 | ], 90 | "text/plain": [ 91 | "" 92 | ] 93 | }, 94 | "metadata": {}, 95 | "output_type": "display_data" 96 | }, 97 | { 98 | "data": { 99 | "application/vnd.jupyter.widget-view+json": { 100 | "model_id": "", 101 | "version_major": 2, 102 | "version_minor": 0 103 | }, 104 | "text/plain": [ 105 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 106 | ] 107 | }, 108 | "metadata": {}, 109 | "output_type": "display_data" 110 | }, 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "SparkSession available as 'spark'.\n" 116 | ] 117 | }, 118 | { 119 | "data": { 120 | "application/vnd.jupyter.widget-view+json": { 121 | "model_id": "", 122 | "version_major": 2, 123 | "version_minor": 0 124 | }, 125 | "text/plain": [ 126 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 127 | ] 128 | }, 129 | "metadata": {}, 130 | "output_type": "display_data" 131 | }, 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "======================\n", 137 | "DataFrame is streaming" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "from pyspark.sql import SparkSession\n", 143 | "from pyspark.sql.functions import from_json, col, json_tuple\n", 144 | "import time\n", 145 | "\n", 146 | "stream_df = spark.readStream \\\n", 147 | " .format('kinesis') \\\n", 148 | " .option('streamName', 'test_stream') \\\n", 149 | " .option('endpointUrl', 'https://kinesis.us-east-1.amazonaws.com')\\\n", 150 | " .option('region', 'us-east-1') \\\n", 151 | " .option('startingposition', 'LATEST')\\\n", 152 | " .load()\n", 153 | "\n", 154 | "if stream_df.isStreaming:\n", 155 | " print('======================')\n", 156 | " print('DataFrame is streaming')" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "id": "3dce2a28", 162 | "metadata": {}, 163 | "source": [ 164 | "Now that we have our streaming DataFrame ready, let's use Spark SQL `select` and `where` methods to query our streaming DataFrame. We'll then write this data out to one of an S3 bucket (you'll need to specify your own and then append it with `/data` and `/checkpoints` directories to follow along). Individual CSVs will be produced for each set of data that is processed in a micro-batch." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 3, 170 | "id": "a697c944", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "data": { 175 | "application/vnd.jupyter.widget-view+json": { 176 | "model_id": "80acf397249c4d71924d15b417143965", 177 | "version_major": 2, 178 | "version_minor": 0 179 | }, 180 | "text/plain": [ 181 | "VBox()" 182 | ] 183 | }, 184 | "metadata": {}, 185 | "output_type": "display_data" 186 | }, 187 | { 188 | "data": { 189 | "application/vnd.jupyter.widget-view+json": { 190 | "model_id": "", 191 | "version_major": 2, 192 | "version_minor": 0 193 | }, 194 | "text/plain": [ 195 | "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" 196 | ] 197 | }, 198 | "metadata": {}, 199 | "output_type": "display_data" 200 | } 201 | ], 202 | "source": [ 203 | "# start process of querying streaming data\n", 204 | "query = stream_df.selectExpr('CAST(data AS STRING)', 'CAST(approximateArrivalTimestamp as TIMESTAMP)') \\\n", 205 | " .select('approximateArrivalTimestamp', \n", 206 | " json_tuple(col('data'), 'username', 'age', 'num_followers', 'tweet'\n", 207 | " ).alias('username', 'age', 'num_followers', 'tweet')) \\\n", 208 | " .select('approximateArrivalTimestamp', 'username', 'age') \\\n", 209 | " .where('age > 35') \\\n", 210 | " .writeStream \\\n", 211 | " .queryName('counts') \\\n", 212 | " .outputMode('append') \\\n", 213 | " .format('csv') \\\n", 214 | " .option('path', 's3://mrjob-9caa69460249cdb9/data') \\\n", 215 | " .option('checkpointLocation','s3://mrjob-9caa69460249cdb9/checkpoints') \\\n", 216 | " .start()\n", 217 | "\n", 218 | "# let streaming query run for 15 seconds (and continue sending results to CSV in S3), then stop it\n", 219 | "time.sleep(15)\n", 220 | "\n", 221 | "# Stop query; look at results of micro-batch queries in S3 bucket in `/data` directory\n", 222 | "query.stop()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "dc9c0e66", 228 | "metadata": {}, 229 | "source": [ 230 | "Cool! If we take a look at one of our resulting CSVs over in our S3 bucket (see head below), we can see that it produces the expected results (a selection of columns from the streaming data that is filtered by age). This is a great way to quickly process streaming data!\n", 231 | "\n", 232 | "```\n", 233 | "2021-05-10T22:03:40.787Z,Hailiejade,83\n", 234 | "2021-05-10T22:03:40.824Z,Fischer,79\n", 235 | "2021-05-10T22:03:40.866Z,Leonard,46\n", 236 | "2021-05-10T22:03:40.902Z,Vasquez,65\n", 237 | "2021-05-10T22:03:40.937Z,Porter,86\n", 238 | "2021-05-10T22:03:40.978Z,Joan,56\n", 239 | "```" 240 | ] 241 | } 242 | ], 243 | "metadata": { 244 | "kernelspec": { 245 | "display_name": "PySpark", 246 | "language": "", 247 | "name": "pysparkkernel" 248 | }, 249 | "language_info": { 250 | "codemirror_mode": { 251 | "name": "python", 252 | "version": 3 253 | }, 254 | "mimetype": "text/x-python", 255 | "name": "pyspark", 256 | "pygments_lexer": "python3" 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 5 261 | } 262 | -------------------------------------------------------------------------------- /Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/Dask on an EMR Cluster.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Dask on an AWS EMR Cluster\n", 8 | "\n", 9 | "This notebook is intended to be run on an AWS EMR cluster, configured using the steps listed in dask_bootstrap_workflow.md tutorial in this lab directory. The EMR cluster used in this tutorial has two worker m5.xlarge instances within it, each of which has 4 virtual CPU cores and 16 GB of memory (you're welcome to scale your cluster beyond this, though!). If you would like to learn more about working with Dask on EMR clusters, [check out the dask-yarn documentation](https://yarn.dask.org/en/latest/aws-emr.html)." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 7, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from dask_yarn import YarnCluster\n", 19 | "from dask.distributed import Client" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 15, 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "name": "stderr", 29 | "output_type": "stream", 30 | "text": [ 31 | "distributed.scheduler - INFO - Clear task state\n", 32 | "distributed.scheduler - INFO - Scheduler at: tcp://172.31.21.225:33211\n", 33 | "distributed.scheduler - INFO - dashboard at: :45301\n", 34 | "distributed.scheduler - INFO - Receive client connection: Client-d2dd3a2c-2534-11eb-b6af-632ed99d0685\n", 35 | "distributed.core - INFO - Starting established connection\n", 36 | "distributed.scheduler - INFO - Register worker \n", 37 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.21.182:38719\n", 38 | "distributed.core - INFO - Starting established connection\n", 39 | "distributed.scheduler - INFO - Register worker \n", 40 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.25.119:40597\n", 41 | "distributed.core - INFO - Starting established connection\n", 42 | "distributed.scheduler - INFO - Register worker \n", 43 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.25.119:37545\n", 44 | "distributed.core - INFO - Starting established connection\n", 45 | "distributed.scheduler - INFO - Register worker \n", 46 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.20.204:44979\n", 47 | "distributed.core - INFO - Starting established connection\n", 48 | "distributed.scheduler - INFO - Register worker \n", 49 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.20.204:45273\n", 50 | "distributed.core - INFO - Starting established connection\n", 51 | "distributed.scheduler - INFO - Register worker \n", 52 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.27.243:40497\n", 53 | "distributed.core - INFO - Starting established connection\n", 54 | "distributed.scheduler - INFO - Register worker \n", 55 | "distributed.scheduler - INFO - Starting worker compute stream, tcp://172.31.27.243:43533\n", 56 | "distributed.core - INFO - Starting established connection\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "# Create a cluster where each worker has 1 cores and 4 GiB of memory:\n", 62 | "cluster = YarnCluster(environment=\"/home/hadoop/environment.tar.gz\",\n", 63 | " worker_vcores = 1,\n", 64 | " worker_memory = \"4GiB\"\n", 65 | " )\n", 66 | "\n", 67 | "# Scale cluster out to 8 such workers:\n", 68 | "cluster.scale(8)\n", 69 | "\n", 70 | "# Connect to the cluster (before proceeding, you should wait for workers to be registered by the dask scheduler, as below):\n", 71 | "client = Client(cluster)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Once everyone is registered, we can see our workers, virtual cores, and the amount of memory that our cluster is using overall; we can adjust all of this as need be if it doesn't match our hardware well. You click the link below to show your task graphs and execution status of your code as you run it." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 18, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/html": [ 89 | "\n", 90 | "\n", 91 | "\n", 98 | "\n", 106 | "\n", 107 | "
\n", 92 | "

Client

\n", 93 | "
    \n", 94 | "
  • Scheduler: tcp://172.31.21.225:33211
  • \n", 95 | "
  • Dashboard: /proxy/45301/status
  • \n", 96 | "
\n", 97 | "
\n", 99 | "

Cluster

\n", 100 | "
    \n", 101 | "
  • Workers: 7
  • \n", 102 | "
  • Cores: 7
  • \n", 103 | "
  • Memory: 30.06 GB
  • \n", 104 | "
\n", 105 | "
" 108 | ], 109 | "text/plain": [ 110 | "" 111 | ] 112 | }, 113 | "execution_count": 18, 114 | "metadata": {}, 115 | "output_type": "execute_result" 116 | } 117 | ], 118 | "source": [ 119 | "client" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "To start, let's do some simple Dask array operations to demonstrate how arrays and array operations are split up into equal chunks across our 8 workers:" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 19, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/html": [ 137 | "\n", 138 | "\n", 139 | "\n", 152 | "\n", 178 | "\n", 179 | "
\n", 140 | "\n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | "
Array Chunk
Bytes 800 B 112 B
Shape (100,) (14,)
Count 8 Tasks 8 Chunks
Type float64 numpy.ndarray
\n", 151 | "
\n", 153 | "\n", 154 | "\n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | "\n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | "\n", 170 | " \n", 171 | " \n", 172 | "\n", 173 | " \n", 174 | " 100\n", 175 | " 1\n", 176 | "\n", 177 | "
" 180 | ], 181 | "text/plain": [ 182 | "dask.array" 183 | ] 184 | }, 185 | "execution_count": 19, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "import dask.array as da\n", 192 | "\n", 193 | "n = len(client.scheduler_info()['workers'])\n", 194 | "a = da.ones(100, chunks=int(100/n))\n", 195 | "a" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 22, 201 | "metadata": { 202 | "scrolled": true 203 | }, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "100.0" 209 | ] 210 | }, 211 | "execution_count": 22, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "a.sum().compute()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 23, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "1.0000125081975038" 229 | ] 230 | }, 231 | "execution_count": 23, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "x = da.random.random((10000, 10000), chunks=(1000, 1000))\n", 238 | "y = x + x.T\n", 239 | "y.mean().compute()" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "We can also interact with large data sources in S3 via Dask DataFrames, using a lot of the familiar methods we employ in smaller scale applications in Pandas. Here, we read in 10GB of Amazon's customer book data and perform a few simple operations. If you take a look at the Dask task graph while this is running, you can see that our groupby and sum operations are being performed in parallel by our workers." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 24, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "import dask.dataframe as dd\n", 256 | "\n", 257 | "df = dd.read_parquet(\"s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet\",\n", 258 | " storage_options={'anon': True, 'use_ssl': False},\n", 259 | " engine='fastparquet')" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 25, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/html": [ 270 | "
\n", 271 | "\n", 284 | "\n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | "
helpful_votes
star_rating
110985502
25581929
37021927
411100563
544825468
\n", 318 | "
" 319 | ], 320 | "text/plain": [ 321 | " helpful_votes\n", 322 | "star_rating \n", 323 | "1 10985502\n", 324 | "2 5581929\n", 325 | "3 7021927\n", 326 | "4 11100563\n", 327 | "5 44825468" 328 | ] 329 | }, 330 | "execution_count": 25, 331 | "metadata": {}, 332 | "output_type": "execute_result" 333 | } 334 | ], 335 | "source": [ 336 | "helpful_by_star = (df[['star_rating', 'helpful_votes']].groupby('star_rating')\n", 337 | " .sum())\n", 338 | "helpful_df = helpful_by_star.compute() # returns Pandas DataFrame\n", 339 | "helpful_df" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "We can also plot and explore our data using standard Matplotlib plotting functionality:" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 26, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEPCAYAAABr4Y4KAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAASZklEQVR4nO3df7CVdZ3A8fcnYMUWAkap1XC75KyWImLc0CwVtIANxnQmAgfT0mT9sSpt1uS4s+qUrbbMtqW5ruMqZo6Y2M5q1jZNhT9W0+6VK0loZsuut0wvKiCCxI/P/nEOcMWr94D3nPPl3vdrhuHec55zzmeeC28envM950RmIkkq19uaPYAk6c0ZakkqnKGWpMIZakkqnKGWpMIZakkqXN1CHRE3RsTzEfF4Ddt+IyI6qr9+ExGr6zWXJO1pol7rqCPiWGAd8J3MHLcLtzsfOCIzz6jLYJK0h6nbEXVm3ge82P2yiDgwIv4rItoj4v6IeF8PNz0FuK1ec0nSnmZwgx/veuDszHwqIo4ErgWO33ZlRLwHGAv8rMFzSVKxGhbqiBgGHA3cERHbLt5rp83mAIszc0uj5pKk0jXyiPptwOrMnPAm28wBzmvMOJK0Z2jY8rzMXAv8T0TMAoiKw7ddHxEHA6OAhxo1kyTtCeq5PO82KtE9OCI6I+JMYC5wZkQ8BiwHPtHtJqcAi9K385Ok16jb8jxJUt/wlYmSVDhDLUmFq8uqj3333TdbWlrqcdeS1C+1t7evyszRPV1Xl1C3tLTQ1tZWj7uWpH4pIv73ja7z1IckFc5QS1LhDLUkFa5hLyHftGkTnZ2dvPrqq416SPVi6NChjBkzhiFDhjR7FElvomGh7uzsZPjw4bS0tNDtTZnUJJnJCy+8QGdnJ2PHjm32OJLeRMNOfbz66qvss88+RroQEcE+++zj/3CkPUBDz1Eb6bL485D2DD6ZKEmFa/QnvGzX8uV7+vT+Vl45482vX7mSmTNn8vjjvX7WLgCXXXYZw4YN46KLLnrDbTZu3MiMGTNYtWoVF198MbNnz+5xu4ULF9LW1sY111xT02P3ZuHChUydOpX999+/T+5P2pP1dUt2V28NeiuaFur+YOnSpWzatImOjo6GPu7ChQsZN26coZYGiAF16mPLli2cddZZHHrooUydOpUNGzbw9NNPM336dCZOnMgxxxzDE0888brbTZ48mfnz53P00Uczbtw4HnnkEZ5//nlOPfVUOjo6mDBhAk8//TQtLS2sWrUKgLa2NiZPntzrTGvWrKGlpYWtW7cCsH79eg444IDt/wAcddRRjB8/npNPPpmXXnqJxYsX09bWxty5c5kwYQIbNmygvb2d4447jokTJzJt2jSeffZZAL71rW9xyCGHMH78eObMmdN3O1JSQw2oUD/11FOcd955LF++nJEjR3LnnXcyb948rr76atrb21mwYAHnnntuj7d95ZVXePDBB7n22ms544wzeOc738kNN9zAMcccQ0dHBwceeOBuzTRixAgOP/xw7r33XgDuvvtupk2bxpAhQzjttNO46qqrWLZsGYcddhiXX345n/zkJ2ltbeXWW2+lo6ODwYMHc/7557N48WLa29s544wzuOSSSwC48sorWbp0KcuWLeO6667bvZ0mqekG1KmPsWPHMmHCBAAmTpzIypUrefDBB5k1a9b2bTZu3NjjbU855RQAjj32WNauXcvq1av7bK7Zs2dz++23M2XKFBYtWsS5557LmjVrWL16NccddxwAp59++mvm3ObJJ5/k8ccf52Mf+xhQ+V/DfvvtB8D48eOZO3cuJ510EieddFKfzSupsQZUqPfaa8eHng8aNIjnnnuOkSNH1nSOeeelbD0tbRs8ePD2Uxi7sj75xBNP5OKLL+bFF1+kvb2d448/nnXr1tV028zk0EMP5aGHXv9Rk/fccw/33Xcfd911F1/5yldYvnw5gwcPqB+51C8MqFMfO3vHO97B2LFjueOOO4BK9B577LEet7399tsBeOCBBxgxYgQjRox43TYtLS20t7cDcOedd9Y8x7Bhw5g0aRIXXnghM2fOZNCgQYwYMYJRo0Zx//33A3DLLbdsP7oePnw4L7/8MgAHH3wwXV1d20O9adMmli9fztatW3nmmWeYMmUKX//611m9enXN8ZdUlqYdXtVzKcuuuPXWWznnnHP46le/yqZNm5gzZw6HH37467YbNWoURx99NGvXruXGG2/s8b4uvfRSzjzzTL72ta9x5JFH7tIcs2fPZtasWSxZsmT7ZTfffDNnn30269ev573vfS833XQTAJ/5zGc4++yz2XvvvXnooYdYvHgxF1xwAWvWrGHz5s3Mnz+fgw46iFNPPZU1a9aQmXz+859n5MiRuzSTpDLU5cNtW1tbc+cPDlixYgXvf//7+/yxGmHy5MksWLCA1tbWZo/S5/bkn4sE/WcddUS0Z2aPkRnQpz4kaU/gM0s16H464q244oortp8P32bWrFnbl9NJUk8MdQNdcsklRlnSLmvoqY96nA/X7vPnIe0ZGhbqoUOH8sILLxiHQmz74IChQ4c2exRJvWjYqY8xY8bQ2dlJV1dXox5Svdj2UVySytawUA8ZMsSPfJKk3eDyPEkqnKGWpMIZakkqXM2hjohBEbE0In5Qz4EkSa+1K0fUFwIr6jWIJKlnNYU6IsYAM4Ab6juOJGlntR5R/wvwJWBr/UaRJPWk11BHxEzg+cxs72W7eRHRFhFtvqhFkvpOLUfUHwZOjIiVwCLg+Ij47s4bZeb1mdmama2jR4/u4zElaeDqNdSZeXFmjsnMFmAO8LPMPLXuk0mSANdRS1Lxdum9PjJzCbCkLpNIknrkEbUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFc5QS1LhDLUkFa7XUEfE0Ih4JCIei4jlEXF5IwaTJFUMrmGbjcDxmbkuIoYAD0TEjzLzF3WeTZJEDaHOzATWVb8dUv2V9RxKkrRDTeeoI2JQRHQAzwM/ycyH6zqVJGm7mkKdmVsycwIwBpgUEeN23iYi5kVEW0S0dXV19fGYkjRw7dKqj8xcDSwBpvdw3fWZ2ZqZraNHj+6b6SRJNa36GB0RI6tf7w18FHiiznNJkqpqWfWxH3BzRAyiEvbvZeYP6juWJGmbWlZ9LAOOaMAskqQe+MpESSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSqcoZakwhlqSSpcr6GOiAMi4ucRsSIilkfEhY0YTJJUMbiGbTYDX8jMRyNiONAeET/JzF/XeTZJEjUcUWfms5n5aPXrl4EVwLvrPZgkqWKXzlFHRAtwBPBwXaaRJL1OzaGOiGHAncD8zFzbw/XzIqItItq6urr6ckZJGtBqCnVEDKES6Vsz8/s9bZOZ12dma2a2jh49ui9nlKQBrZZVHwH8O7AiM/+5/iNJkrqr5Yj6w8CngeMjoqP66+N1nkuSVNXr8rzMfACIBswiSeqBr0yUpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkqnKGWpMIZakkq3OBmDyBp17V8+Z5mjwDAyitnNHuEAaHXUEfEjcBM4PnMHFf/kfxDKEnd1XLqYyEwvc5zSJLeQK+hzsz7gBcbMIskqQc+mShJheuzUEfEvIhoi4i2rq6uvrpbSRrw+izUmXl9ZrZmZuvo0aP76m4lacDz1IckFa7XUEfEbcBDwMER0RkRZ9Z/LEnSNr2uo87MUxoxiCSpZ576kKTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpyhlqTC+VFc2mP4yT8aqDyilqTCGWpJKpyhlqTCGWpJKpyhlqTCGWpJKpzL8wrnkjRJHlFLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuEMtSQVzlBLUuFqCnVETI+IJyPitxHx5XoPJUnaoddQR8Qg4NvAXwOHAKdExCH1HkySVFHLEfUk4LeZ+bvM/BOwCPhEfceSJG1TS6jfDTzT7fvO6mWSpAaIzHzzDSJmAdMy83PV7z8NTMrM83fabh4wr/rtwcCTfT/uLtkXWNXkGUrhvtjBfbGD+2KHEvbFezJzdE9XDK7hxp3AAd2+HwP8YeeNMvN64PrdGq8OIqItM1ubPUcJ3Bc7uC92cF/sUPq+qOXUxy+Bv4qIsRHxZ8Ac4K76jiVJ2qbXI+rM3BwRfwv8GBgE3JiZy+s+mSQJqO3UB5n5Q+CHdZ6lrxVzGqYA7osd3Bc7uC92KHpf9PpkoiSpuXwJuSQVzlBLUuEMdT8UEe+LiBMiYthOl09v1kzNEhGTIuKD1a8PiYi/i4iPN3uuZouI7zR7hlJExEeqfy6mNnuWN9Lvz1FHxGcz86Zmz9EoEXEBcB6wApgAXJiZ/1m97tHM/EATx2uoiLiUynvUDAZ+AhwJLAE+Cvw4M69o3nSNExE7L6cNYArwM4DMPLHhQzVRRDySmZOqX59F5e/LfwBTgbsz88pmzteTgRDq/8vMv2z2HI0SEb8CPpSZ6yKiBVgM3JKZ34yIpZl5RHMnbJzqvpgA7AX8ERiTmWsjYm/g4cwc38z5GiUiHgV+DdwAJJVQ30blNRFk5r3Nm67xuv89iIhfAh/PzK6I+HPgF5l5WHMnfL2alueVLiKWvdFVwLsaOUsBBmXmOoDMXBkRk4HFEfEeKvtjINmcmVuA9RHxdGauBcjMDRGxtcmzNVIrcCFwCfDFzOyIiA0DLdDdvC0iRlE59RuZ2QWQma9ExObmjtazfhFqKjGeBry00+UBPNj4cZrqjxExITM7AKpH1jOBG4HijhTq7E8R8fbMXA9M3HZhRIwABkyoM3Mr8I2IuKP6+3P0n7/7u2ME0E6lDxkRf5GZf6w+p1PkwUx/+WH9ABi2LU7dRcSShk/TXKcBrzkqyMzNwGkR8W/NGalpjs3MjbA9VtsMAU5vzkjNk5mdwKyImAGsbfY8zZKZLW9w1Vbg5AaOUrN+f45akvZ0Ls+TpMIZakkqnKGWpMIZahUrIuZHxNsb/JiTI+Lobt+fHRGnNXIGaWc+mahiRcRKoDUza/6IpIgYVF07/WbbDK6uhOnpusuAdZm5YFdmlerJUKsI1VeFfY/KR70NAu6g8gKNJ4FVmTklIv4V+CCwN7A4My+t3nYllXXiU4FrMnNRD/e/hMqa+g9T+YSi3wB/D/wZ8AIwt3q/vwC2AF3A+cAJVMNdvY+Hqbz8eiRwZmbeXz3qXwi8j8pL91uA8zKzrY92jwa4/rKOWnu+6cAfMnMGbH9RymeBKd2OqC/JzBcjYhDw04gYn5nbXpX6amZ+pJfHGJmZx1XvfxRwVGZmRHwO+FJmfiEirqPbEXVEnLDTfQzOzEnVN3a6lMr7hpwLvJSZ4yNiHNDxFvaD9Dqeo1YpfgV8NCKuiohjMnNND9t8qvq+FUuBQ4FDul13ew2P0X2bMcCPq+8H8sXq/dXi+9Xf26kcOQN8BFgEkJmPA2/0lgbSbjHUKkJm/obKy7x/BfxjRPxD9+sjYixwEXBC9c2U7gGGdtvklRoepvs2V1M5TXIY8Dc73deb2Vj9fQs7/kda5MuO1X8YahUhIvYH1mfmd4EFwAeAl4Hh1U3eQSW0ayLiXVTevvStGAH8vvp195eTd3/MWj0AfAoq73nNwHtPFdWZ56hVisOAf6q+q90m4BzgQ8CPIuLZ6pOJS4HlwO+A/36Lj3cZcEdE/J7KE4hjq5ffTeXdBj9B5cnEWlwL3Fx9F8elVE599HTqRtotrvqQ3qLqk5tDMvPViDgQ+ClwUGb+qcmjqZ/wiFp6694O/DwihlA5X32OkVZf8oha/UpEfJvKWunuvjmQPo5N/Y+hlqTCuepDkgpnqCWpcIZakgpnqCWpcIZakgr3/6h7+/dH/MSEAAAAAElFTkSuQmCC\n", 357 | "text/plain": [ 358 | "
" 359 | ] 360 | }, 361 | "metadata": { 362 | "needs_background": "light" 363 | }, 364 | "output_type": "display_data" 365 | } 366 | ], 367 | "source": [ 368 | "%matplotlib inline\n", 369 | "import matplotlib.pyplot as plt\n", 370 | "\n", 371 | "helpful_df.plot(kind=\"bar\");" 372 | ] 373 | } 374 | ], 375 | "metadata": { 376 | "kernelspec": { 377 | "display_name": "Python 3", 378 | "language": "python", 379 | "name": "python3" 380 | }, 381 | "language_info": { 382 | "codemirror_mode": { 383 | "name": "ipython", 384 | "version": 3 385 | }, 386 | "file_extension": ".py", 387 | "mimetype": "text/x-python", 388 | "name": "python", 389 | "nbconvert_exporter": "python", 390 | "pygments_lexer": "ipython3", 391 | "version": "3.7.6" 392 | } 393 | }, 394 | "nbformat": 4, 395 | "nbformat_minor": 4 396 | } 397 | -------------------------------------------------------------------------------- /Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/bootstrap-dask: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | HELP="Usage: bootstrap-dask [OPTIONS] 3 | 4 | Adapted from Example AWS EMR Bootstrap Action to install and configure Dask and 5 | Jupyter: 6 | https://github.com/dask/dask-yarn/blob/master/deployment_resources/aws-emr/bootstrap-dask 7 | 8 | By default it does the following things: 9 | - Installs miniconda 10 | - Installs dask, distributed, dask-yarn, pyarrow, and s3fs. This list can be 11 | extended using the --conda-packages flag below. 12 | - Packages this environment for distribution to the workers. 13 | - Installs and starts a jupyter notebook server running on port 8888. This can 14 | be disabled with the --no-jupyter flag below. 15 | 16 | Options: 17 | --jupyter / --no-jupyter Whether to also install and start a Jupyter 18 | Notebook Server. Default is True. 19 | --password, -pw Set the password for the Jupyter Notebook 20 | Server. Default is 'dask-user'. 21 | --conda-packages Extra packages to install from conda. 22 | " 23 | 24 | set -e 25 | 26 | # Parse Inputs. This is specific to this script, and can be ignored 27 | # ----------------------------------------------------------------- 28 | JUPYTER_PASSWORD="dask-user" 29 | EXTRA_CONDA_PACKAGES="" 30 | JUPYTER="true" 31 | 32 | while [[ $# -gt 0 ]]; do 33 | case $1 in 34 | -h|--help) 35 | echo "$HELP" 36 | exit 0 37 | ;; 38 | --no-jupyter) 39 | JUPYTER="false" 40 | shift 41 | ;; 42 | --jupyter) 43 | JUPYTER="true" 44 | shift 45 | ;; 46 | -pw|--password) 47 | JUPYTER_PASSWORD="$2" 48 | shift 49 | shift 50 | ;; 51 | --conda-packages) 52 | shift 53 | PACKAGES=() 54 | while [[ $# -gt 0 ]]; do 55 | case $1 in 56 | -*) 57 | break 58 | ;; 59 | *) 60 | PACKAGES+=($1) 61 | shift 62 | ;; 63 | esac 64 | done 65 | EXTRA_CONDA_PACKAGES="${PACKAGES[@]}" 66 | ;; 67 | *) 68 | echo "error: unrecognized argument: $1" 69 | exit 2 70 | ;; 71 | esac 72 | done 73 | 74 | 75 | # ----------------------------------------------------------------------------- 76 | # 1. Check if running on the master node. If not, there's nothing do. 77 | # ----------------------------------------------------------------------------- 78 | grep -q '"isMaster": true' /mnt/var/lib/info/instance.json \ 79 | || { echo "Not running on master node, nothing to do" && exit 0; } 80 | 81 | 82 | # ----------------------------------------------------------------------------- 83 | # 2. Install Miniconda 84 | # ----------------------------------------------------------------------------- 85 | echo "Installing Miniconda" 86 | curl https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh -o /tmp/miniconda.sh 87 | bash /tmp/miniconda.sh -b -p $HOME/miniconda 88 | rm /tmp/miniconda.sh 89 | echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc 90 | source $HOME/.bashrc 91 | conda update conda -y 92 | 93 | 94 | # ----------------------------------------------------------------------------- 95 | # 3. Install packages to use in packaged environment 96 | # 97 | # We install a few packages by default, and allow users to extend this list 98 | # with a CLI flag: 99 | # 100 | # - dask-yarn, for deploying Dask on YARN. 101 | # - pyarrow for working with hdfs, parquet, ORC, etc... 102 | # - s3fs for access to s3 (Dask DF requires version 0.4.0 or higher) 103 | # - conda-pack for packaging the environment for distribution 104 | # ----------------------------------------------------------------------------- 105 | echo "Installing base packages" 106 | conda install \ 107 | -c conda-forge \ 108 | -y \ 109 | -q \ 110 | dask-yarn=0.9.0 \ 111 | pyarrow=4.0.0 \ 112 | s3fs=0.4.0 \ 113 | conda-pack=0.6.0 \ 114 | tornado=6.1 \ 115 | $EXTRA_CONDA_PACKAGES 116 | 117 | # ----------------------------------------------------------------------------- 118 | # 4. Package the environment to be distributed to worker nodes 119 | # ----------------------------------------------------------------------------- 120 | echo "Packaging environment" 121 | conda pack -q -o $HOME/environment.tar.gz 122 | 123 | 124 | # ----------------------------------------------------------------------------- 125 | # 5. List all packages in the worker environment 126 | # ----------------------------------------------------------------------------- 127 | echo "Packages installed in the worker environment:" 128 | conda list 129 | 130 | 131 | # ----------------------------------------------------------------------------- 132 | # 6. Configure Dask 133 | # 134 | # This isn't necessary, but for this particular bootstrap script it will make a 135 | # few things easier: 136 | # 137 | # - Configure the cluster's dashboard link to show the proxied version through 138 | # jupyter-server-proxy. This allows access to the dashboard with only an ssh 139 | # tunnel to the notebook. 140 | # 141 | # - Specify the pre-packaged python environment, so users don't have to 142 | # 143 | # - Set the default deploy-mode to local, so the dashboard proxying works 144 | # 145 | # - Specify the location of the native libhdfs library so pyarrow can find it 146 | # on the workers and the client (if submitting applications). 147 | # ------------------------------------------------------------------------------ 148 | echo "Configuring Dask" 149 | mkdir -p $HOME/.config/dask 150 | cat <> $HOME/.config/dask/config.yaml 151 | distributed: 152 | dashboard: 153 | link: "/proxy/{port}/status" 154 | 155 | yarn: 156 | environment: /home/hadoop/environment.tar.gz 157 | deploy-mode: local 158 | 159 | worker: 160 | env: 161 | ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/ 162 | 163 | client: 164 | env: 165 | ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native/ 166 | EOT 167 | # Also set ARROW_LIBHDFS_DIR in ~/.bashrc so it's set for the local user 168 | echo -e '\nexport ARROW_LIBHDFS_DIR=/usr/lib/hadoop/lib/native' >> $HOME/.bashrc 169 | 170 | 171 | # ----------------------------------------------------------------------------- 172 | # 7. If Jupyter isn't requested, we're done 173 | # ----------------------------------------------------------------------------- 174 | if [[ "$JUPYTER" == "false" ]]; then 175 | exit 0 176 | fi 177 | 178 | 179 | # ----------------------------------------------------------------------------- 180 | # 8. Install jupyter notebook server and dependencies 181 | # 182 | # We do this after packaging the worker environments to keep the tar.gz as 183 | # small as possible. 184 | # 185 | # We install the following packages: 186 | # 187 | # - notebook: the Jupyter Notebook Server 188 | # - ipywidgets: used to provide an interactive UI for the YarnCluster objects 189 | # - jupyter-server-proxy: used to proxy the dask dashboard through the notebook server 190 | # ----------------------------------------------------------------------------- 191 | if [[ "$JUPYTER" == "true" ]]; then 192 | echo "Installing Jupyter" 193 | conda install \ 194 | -c conda-forge \ 195 | -y \ 196 | -q \ 197 | notebook \ 198 | ipywidgets \ 199 | jupyter-server-proxy 200 | fi 201 | 202 | 203 | # ----------------------------------------------------------------------------- 204 | # 9. List all packages in the client environment 205 | # ----------------------------------------------------------------------------- 206 | echo "Packages installed in the client environment:" 207 | conda list 208 | 209 | 210 | # ----------------------------------------------------------------------------- 211 | # 10. Configure Jupyter Notebook 212 | # ----------------------------------------------------------------------------- 213 | echo "Configuring Jupyter" 214 | mkdir -p $HOME/.jupyter 215 | HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"` 216 | cat <> $HOME/.jupyter/jupyter_notebook_config.py 217 | c.NotebookApp.password = u'$HASHED_PASSWORD' 218 | c.NotebookApp.open_browser = False 219 | c.NotebookApp.ip = '0.0.0.0' 220 | EOF 221 | 222 | 223 | # ----------------------------------------------------------------------------- 224 | # 11. Define an upstart service for the Jupyter Notebook Server 225 | # 226 | # This sets the notebook server up to properly run as a background service. 227 | # ----------------------------------------------------------------------------- 228 | echo "Configuring Jupyter Notebook Upstart Service" 229 | cat < /tmp/jupyter-notebook.conf 230 | description "Jupyter Notebook Server" 231 | start on runlevel [2345] 232 | stop on runlevel [016] 233 | respawn 234 | respawn limit unlimited 235 | exec su - hadoop -c "jupyter notebook" >> /var/log/jupyter-notebook.log 2>&1 236 | EOF 237 | sudo mv /tmp/jupyter-notebook.conf /etc/init/ 238 | 239 | # ----------------------------------------------------------------------------- 240 | # 12. Start the Jupyter Notebook Server 241 | # ----------------------------------------------------------------------------- 242 | echo "Starting Jupyter Notebook Server" 243 | sudo initctl reload-configuration 244 | sudo initctl start jupyter-notebook 245 | -------------------------------------------------------------------------------- /Labs/Lab 8 Parallel Computing with Dask/Part I - Dask on EMR/dask_bootstrap_workflow.md: -------------------------------------------------------------------------------- 1 | # Launching a Dask Cluster with AWS EMR 2 | 3 | To launch a Dask-compatible EMR cluster, you should first upload the bootstrap-dask script provided in this lab directory to an S3 bucket that you have access to (I uploaded mine to the aws-emr-resources bucket that AWS automatically created for storing EMR notebooks for Labs 7 and 8). This bootstrap script installs Miniconda, Dask, and some other Python packages to your EMR cluster (you can add additional packages via the --bootstrap-actions "Args", as in the example below). It also launches a Jupyter Server on the cluster for you, so that you can use Dask in a Jupyter notebook environment. 4 | 5 | Run the following on your local terminal (wherever you installed AWS CLI) to launch your cluster. Note that you should provide your unique S3 paths for the location that your logs will be written to (the --log-uri field) and the spot that you uploaded your bootstrap script. You should also provide the name of your PEM file in the --ec2-attributes "KeyName" field. Note as well that you can launch whatever types of instances and counts that you wish. Here, we launch 3 m5.xlarge instances in our cluster: 1 instance for the EMR "Master" and 2 worker instances. 6 | 7 | ``` 8 | aws emr create-cluster --name "Dask-Cluster" \ 9 | --release-label emr-5.29.0 \ 10 | --applications Name=Hadoop \ 11 | --log-uri s3://aws-logs-545721747693-us-east-1/elasticmapreduce/ \ 12 | --instance-type m5.xlarge \ 13 | --instance-count 3 \ 14 | --bootstrap-actions Path=s3://aws-emr-resources-545721747693-us-east-1/bootstrap-dask,Args="[--conda-packages,bokeh,fastparquet,python-snappy,snappy,matplotlib]" \ 15 | --use-default-roles \ 16 | --region us-east-1 \ 17 | --ec2-attributes '{"KeyName":"MACS_30123"}' 18 | ``` 19 | 20 | It can take up to ten minutes for your cluster to launch. While the cluster is launching, you'll want to adjust the permissions of the Security Group associated with your "Master" EC2 instance in the cluster so that it will allow you to ssh into it. To do so, you'll need to access the Summary tab of your EMR cluster in the AWS Console and scroll down to the "Security and access" header. Click the security group ID number that is beside the "Security groups for Master" description. On the next screen, click this same security group ID again (corresponding to the name "ElasticMapReduce-master"). Now click the "Edit inbound rules" button and add an inbound rule of type "SSH" and Source type "Anywhere" (you can also limit ssh access to your IP if you'd like for this to be more secure) to your cluster. Finally, click "save rules" to save your new inbound rule. 21 | 22 | Once your cluster has finished bootstrapping and is running, you can issue the following command on your local machine terminal (assuming a Unix-style terminal) to forward port 8888 from the Jupyter server running on the cluster to your local port 8888 (entering the path to your PEM file along with the correct address for your master node, which is available after the "Master public DNS" header in your cluster's summary tab in the AWS EMR console): 23 | 24 | ``` 25 | ssh -N -L localhost:8888:localhost:8888 -i ~/YOUR_PEM.pem hadoop@YOUR_ADDRESS_HERE 26 | ``` 27 | 28 | Be sure to keep this local terminal window open to maintain your connection to the cluster. Now, if you go to `localhost:8888` in your web browser on your local machine, you can log in to your Dask cluster! The default password for the Jupyter server is `dask-user`. To test to see if Dask is correctly configured, try running the short "Dask on an EMR Cluster" Jupyter notebook provided here in this lab directory (just upload it from within your Jupyter window and run it). The notebook demonstrates how you can launch your cluster and perform basic Dask array and DataFrame operations that scale up the capabilities of Python libraries like NumPy and Pandas. 29 | 30 | When you're done, remember to save your work and download anything you want to save back to your local machine (you can do this by clicking File>Download as>.ipynb), or wherever you are saving your files. When you have everything saved, remember to terminate your EMR cluster (you can do this via the AWS CLI or the AWS Console). You will be charged for as long as your EC2 instances are running. -------------------------------------------------------------------------------- /Labs/Lab 9 Accelerating Dask/8W_Dask_Rapids.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "8W_dask_rapids.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | }, 18 | "accelerator": "GPU" 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "UhCHb7X_nOyq" 25 | }, 26 | "source": [ 27 | "# Accelerating Dask with GPUs (via RAPIDS)\n", 28 | "\n", 29 | "\"Open\n", 30 | "\n", 31 | "We've seen in lecture how the [RAPIDS libraries](https://rapids.ai/) make it possible to accelerate common analytical workflows on GPUs using libraries like `cudf` (for GPU DataFrames) and `cuml` (for basic GPU machine learning operations on DataFrames). When your data gets especially large (e.g. exceeding the memory capacity of a single GPU) or your computations get especially cumbersome, Dask makes it possible to scale these workflows out even further -- distributing work out across a cluster of GPUs.\n", 32 | "\n", 33 | "In this notebook, we'll work on Colab (with a single GPU in our Dask GPU cluster) so that you can follow along. In AWS Educate, recall that we cannot create GPU clusters. However, this notebook should be runnable on multi-GPU EC2 instances and clusters (on AWS) if you use a personal account to request these resources.\n", 34 | "\n", 35 | "To run this notebook in Colab, let's make sure that we have a GPU allocated. In Colab, click \"Runtime\" > \"Change Runtime Type\" to confirm that your \"Hardware Accelerator\" type is \"GPU.\" [Note: if you are allocated a K80 GPU (like we have access to on Midway), you will not be able to run this notebook and you will need to terminate (and restart) your session until you receive a newer GPU].\n", 36 | "\n", 37 | "If we run the command below, you'll see the type of GPU that is being used:" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "metadata": { 43 | "colab": { 44 | "base_uri": "https://localhost:8080/" 45 | }, 46 | "id": "8zKj9laOmtoq", 47 | "outputId": "f0a9a260-3cfa-447a-b1e2-0a4abd5515e2" 48 | }, 49 | "source": [ 50 | "!nvidia-smi" 51 | ], 52 | "execution_count": 1, 53 | "outputs": [ 54 | { 55 | "output_type": "stream", 56 | "text": [ 57 | "Tue May 18 19:47:23 2021 \n", 58 | "+-----------------------------------------------------------------------------+\n", 59 | "| NVIDIA-SMI 465.19.01 Driver Version: 460.32.03 CUDA Version: 11.2 |\n", 60 | "|-------------------------------+----------------------+----------------------+\n", 61 | "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", 62 | "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", 63 | "| | | MIG M. |\n", 64 | "|===============================+======================+======================|\n", 65 | "| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n", 66 | "| N/A 60C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |\n", 67 | "| | | N/A |\n", 68 | "+-------------------------------+----------------------+----------------------+\n", 69 | " \n", 70 | "+-----------------------------------------------------------------------------+\n", 71 | "| Processes: |\n", 72 | "| GPU GI CI PID Type Process name GPU Memory |\n", 73 | "| ID ID Usage |\n", 74 | "|=============================================================================|\n", 75 | "| No running processes found |\n", 76 | "+-----------------------------------------------------------------------------+\n" 77 | ], 78 | "name": "stdout" 79 | } 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": { 85 | "id": "o51ycJzBrHHm" 86 | }, 87 | "source": [ 88 | "Then, we need to install RAPIDS by running the following cell [Note: this will take some time (and take up much of your available disk-space), so be patient!]" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "metadata": { 94 | "colab": { 95 | "base_uri": "https://localhost:8080/" 96 | }, 97 | "id": "pi7fFRHfnMNX", 98 | "outputId": "864223a0-8986-4970-d6dd-cd8812cc5f06" 99 | }, 100 | "source": [ 101 | "# Install RAPIDS\n", 102 | "!git clone https://github.com/rapidsai/rapidsai-csp-utils.git\n", 103 | "!bash rapidsai-csp-utils/colab/rapids-colab.sh stable\n", 104 | "\n", 105 | "import sys, os, shutil\n", 106 | "\n", 107 | "sys.path.append('/usr/local/lib/python3.7/site-packages/')\n", 108 | "os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'\n", 109 | "os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'\n", 110 | "os.environ[\"CONDA_PREFIX\"] = \"/usr/local\"\n", 111 | "for so in ['cudf', 'rmm', 'nccl', 'cuml', 'cugraph', 'xgboost', 'cuspatial']:\n", 112 | " fn = 'lib'+so+'.so'\n", 113 | " source_fn = '/usr/local/lib/'+fn\n", 114 | " dest_fn = '/usr/lib/'+fn\n", 115 | " if os.path.exists(source_fn):\n", 116 | " print(f'Copying {source_fn} to {dest_fn}')\n", 117 | " shutil.copyfile(source_fn, dest_fn)\n", 118 | "if not os.path.exists('/usr/lib64'):\n", 119 | " os.makedirs('/usr/lib64')\n", 120 | "for so_file in os.listdir('/usr/local/lib'):\n", 121 | " if 'libstdc' in so_file:\n", 122 | " shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib64/'+so_file)\n", 123 | " shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib/x86_64-linux-gnu/'+so_file)" 124 | ], 125 | "execution_count": 1, 126 | "outputs": [ 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": { 132 | "id": "ws9pAFs2ER4o" 133 | }, 134 | "source": [ 135 | "OK, with RAPIDS installed, let's use `dask_cuda`'s API to launch a GPU cluster and pass this cluster object to our `dask.distributed` client. `LocalCUDACluster()` will count each available GPU in our cluster (in this case, 1 GPU) as a Dask worker and assign it work." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "metadata": { 141 | "id": "L2BX7q1CnDgG" 142 | }, 143 | "source": [ 144 | "from dask_cuda import LocalCUDACluster\n", 145 | "from dask.distributed import Client\n", 146 | "\n", 147 | "cluster = LocalCUDACluster() # Identify all available GPUs\n", 148 | "client = Client(cluster)" 149 | ], 150 | "execution_count": 2, 151 | "outputs": [] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": { 156 | "id": "0GkgjBR7FPsG" 157 | }, 158 | "source": [ 159 | "From here, we can use `dask_cudf` to automate the process of partitioning our data across our GPU workers and instantiating a GPU-based DataFrame on our GPU that we can work with. Let's load in the same AirBnB data that we were working with in the `numba` + `dask` CPU demonstration:" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "metadata": { 165 | "colab": { 166 | "base_uri": "https://localhost:8080/", 167 | "height": 450 168 | }, 169 | "id": "BND3eVWathF_", 170 | "outputId": "8046f29a-76a8-4a0a-af88-104fae24248b" 171 | }, 172 | "source": [ 173 | "import dask_cudf\n", 174 | "\n", 175 | "df = dask_cudf.read_csv('listings*.csv')\n", 176 | "df.head()" 177 | ], 178 | "execution_count": 3, 179 | "outputs": [ 180 | { 181 | "output_type": "execute_result", 182 | "data": { 183 | "text/html": [ 184 | "
\n", 185 | "\n", 198 | "\n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | "
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
03781HARBORSIDE-Walk to subway4804Frank<NA>East Boston42.36413-71.02991Entire home/apt12532192021-02-260.271106
16695$99 Special!! Home Away! Condo8229Terry<NA>Roxbury42.32802-71.09387Entire home/apt169291152019-11-020.81440
210813Back Bay Apt-blocks to subway, Newbury St, The...38997Michelle<NA>Back Bay42.35061-71.08787Entire home/apt962952020-12-020.0811307
310986North End (Waterfront area) CLOSE TO MGH & SU...38997Michelle<NA>North End42.36377-71.05206Entire home/apt962922016-05-230.0311293
413247Back Bay studio apartment51637Susan<NA>Back Bay42.35164-71.08752Entire home/apt75910<NA><NA>20
\n", 318 | "
" 319 | ], 320 | "text/plain": [ 321 | " id ... availability_365\n", 322 | "0 3781 ... 106\n", 323 | "1 6695 ... 40\n", 324 | "2 10813 ... 307\n", 325 | "3 10986 ... 293\n", 326 | "4 13247 ... 0\n", 327 | "\n", 328 | "[5 rows x 16 columns]" 329 | ] 330 | }, 331 | "metadata": { 332 | "tags": [] 333 | }, 334 | "execution_count": 3 335 | } 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": { 341 | "id": "y4ikXLjh4RRW" 342 | }, 343 | "source": [ 344 | "Once we have that data, we can perform many of the standard DataFrame operations we perform on CPUs -- just accelerated by our GPU cluster!" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "metadata": { 350 | "colab": { 351 | "base_uri": "https://localhost:8080/" 352 | }, 353 | "id": "GdtSdyF0u8xs", 354 | "outputId": "cd6b5fde-cf8f-483b-ddce-13b4a33066df" 355 | }, 356 | "source": [ 357 | "df.groupby(['neighbourhood', 'room_type']) \\\n", 358 | " .price \\\n", 359 | " .mean() \\\n", 360 | " .compute()" 361 | ], 362 | "execution_count": 4, 363 | "outputs": [ 364 | { 365 | "output_type": "execute_result", 366 | "data": { 367 | "text/plain": [ 368 | "neighbourhood room_type \n", 369 | "North Center Private room 75.818182\n", 370 | "Ashburn Entire home/apt 100.857143\n", 371 | "Edgewater Entire home/apt 140.142857\n", 372 | "South Lawndale Entire home/apt 79.826087\n", 373 | "Auburn Gresham Entire home/apt 135.000000\n", 374 | " ... \n", 375 | "Lakeshore Entire home/apt 205.500000\n", 376 | "Brighton Park Shared room 39.000000\n", 377 | "Lake View Hotel room 656.400000\n", 378 | "North Beach Shared room 31.900000\n", 379 | "Clearing Entire home/apt 90.000000\n", 380 | "Name: price, Length: 341, dtype: float64" 381 | ] 382 | }, 383 | "metadata": { 384 | "tags": [] 385 | }, 386 | "execution_count": 4 387 | } 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": { 393 | "id": "mXoQOrdD-vz9" 394 | }, 395 | "source": [ 396 | "One thing to note, though, is that not all of the functionality we might expect out of CPU clusters is available yet in the `cudf`/`dask_cudf` DataFrame implementation.\n", 397 | "\n", 398 | "For instance (and of particular note!), our ability to apply custom functions is still pretty limited. `cudf` uses Numba's CUDA compiler to translate this code for the GPU and [many standard `numpy` operations are not supported](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html#numpy-support) (for instance, if you try to apply the distance calculation with performed in the Numba+Dask CPU demonstration notebook for today, this will fail to compile correctly for the GPU).\n", 399 | "\n", 400 | "That being said, we can perform many base-Python operations inside of custom functions, so if you can express your custom functions in this way, it might be worth your while to do this work on a GPU. For example, let's create a custom price index that indicates whether an AirBnB is \"Cheap\" (0), \"Moderately Expensive\" (1), or \"Very Expensive\" (2) using `cudf`'s [`apply_rows` method](https://docs.rapids.ai/api/cudf/stable/guide-to-udfs.html#DataFrame-UDFs):\n" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "metadata": { 406 | "colab": { 407 | "base_uri": "https://localhost:8080/", 408 | "height": 195 409 | }, 410 | "id": "DOo_LkCI0fhl", 411 | "outputId": "8c3f2db9-d4de-4d5a-e1f8-a9fc144ac6b6" 412 | }, 413 | "source": [ 414 | "def expensive(x, price_index):\n", 415 | " # passed through Numba's CUDA compiler and auto-parallelized for GPU\n", 416 | " # for loop is automatically parallelized\n", 417 | " for i, price in enumerate(x):\n", 418 | " if price < 50:\n", 419 | " price_index[i] = 0\n", 420 | " elif price < 100:\n", 421 | " price_index[i] = 1\n", 422 | " else:\n", 423 | " price_index[i] = 2\n", 424 | "\n", 425 | "# Use cudf's `apply_rows` API for applying function to every row in DataFrame\n", 426 | "df = df.apply_rows(expensive,\n", 427 | " incols={'price':'x'},\n", 428 | " outcols={'price_index': int})\n", 429 | "\n", 430 | "# Confirm that price index created correctly\n", 431 | "df[['price', 'price_index']].head()" 432 | ], 433 | "execution_count": 5, 434 | "outputs": [ 435 | { 436 | "output_type": "execute_result", 437 | "data": { 438 | "text/html": [ 439 | "
\n", 440 | "\n", 453 | "\n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | "
priceprice_index
01252
11692
2961
3961
4751
\n", 489 | "
" 490 | ], 491 | "text/plain": [ 492 | " price price_index\n", 493 | "0 125 2\n", 494 | "1 169 2\n", 495 | "2 96 1\n", 496 | "3 96 1\n", 497 | "4 75 1" 498 | ] 499 | }, 500 | "metadata": { 501 | "tags": [] 502 | }, 503 | "execution_count": 5 504 | } 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": { 510 | "id": "IB5kFM5C3TTC" 511 | }, 512 | "source": [ 513 | "In addition to preprocessing and analyzing data on GPUs, we can also train (a limited set of) Machine Learning models directly on our GPU cluster using the `cuml` library in the RAPIDS ecoystem as well. \n", 514 | "\n", 515 | "For instance, let's train a linear regression model based on our data from San Francisco, Chicago, and Boston to predict the price of an AirBnB based on other values in its listing information (e.g. \"reviews per month\" and \"minimum nights\"). We'll then use this model to make predictions about the price of AirBnBs in another city (NYC):" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "metadata": { 521 | "id": "3frzCDx-Pkzb" 522 | }, 523 | "source": [ 524 | "from cuml.dask.linear_model import LinearRegression\n", 525 | "import numpy as np\n", 526 | "\n", 527 | "X = df[['reviews_per_month', 'minimum_nights']].astype(np.float32).dropna()\n", 528 | "y = df[['price']].astype(np.float32).dropna()\n", 529 | "fit = LinearRegression().fit(X, y)" 530 | ], 531 | "execution_count": 6, 532 | "outputs": [] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": { 537 | "id": "fJR31Zss2VgY" 538 | }, 539 | "source": [ 540 | "Then, we can read in the NYC dataset and make predictions about what prices will be in NYC on the basis of the model we trained on data from our three original cities:" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "metadata": { 546 | "colab": { 547 | "base_uri": "https://localhost:8080/" 548 | }, 549 | "id": "7C18rvfJDjM_", 550 | "outputId": "f8483688-d395-40ce-959a-c23858a4d1e4" 551 | }, 552 | "source": [ 553 | "df_nyc = dask_cudf.read_csv('test*.csv')\n", 554 | "X_test = df_nyc[['reviews_per_month', 'minimum_nights']].astype(np.float32) \\\n", 555 | " .dropna()\n", 556 | "fit.predict(X_test) \\\n", 557 | " .compute() \\\n", 558 | " .head()" 559 | ], 560 | "execution_count": 7, 561 | "outputs": [ 562 | { 563 | "output_type": "execute_result", 564 | "data": { 565 | "text/plain": [ 566 | "0 184.802887\n", 567 | "1 188.286636\n", 568 | "2 184.802887\n", 569 | "3 183.658218\n", 570 | "4 186.646774\n", 571 | "dtype: float32" 572 | ] 573 | }, 574 | "metadata": { 575 | "tags": [] 576 | }, 577 | "execution_count": 7 578 | } 579 | ] 580 | } 581 | ] 582 | } 583 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Large-Scale Computing for the Social Sciences 2 | ### Spring 2021 - MACS 30123/MAPS 30123/PLSC 30123 3 | 4 | | Instructor Information | **TA Information** | **TA Information** | Course Information | 5 | | :------------- | :------------- | :------------- | :------------ | 6 | | Jon Clindaniel | Yongfei Lu | Luxin Tian | Location: [Online](https://canvas.uchicago.edu/courses/34598) | 7 | | 1155 E. 60th Street, Rm. 215 | | | Monday/Wednesday | 8 | | jclindaniel@uchicago.edu | yongfeilu@uchicago.edu | luxintian@uchicago.edu | 9:10-11:10 AM (CDT) | 9 | | **Office Hours:** [Schedule](https://appoint.ly/s/jclindaniel/office-hours)\* | **Office Hours:** [Schedule](https://appoint.ly/s/yongfeilu/20-min)\* | **Office Hours:** [Schedule](https://appoint.ly/s/luxintian/macs30123)\*| **Lab:** Prerecorded, [Online](https://canvas.uchicago.edu/courses/34598) | 10 | 11 | \* Office Hours held via Zoom 12 | 13 | ## Course Description 14 | Computational social scientists increasingly need to grapple with data that is either too big for a local machine and/or code that is too resource intensive to process on a local machine. In this course, students will learn how to effectively scale their computational methods beyond their local machines. The focus of the course will be social scientific applications, ranging from training machine learning models on large economic time series to processing and analyzing social media data in real-time. Students will be introduced to several large-scale computing frameworks such as MPI, MapReduce, Spark, Dask, and OpenCL, with a special emphasis on employing these frameworks using cloud resources and the Python programming language. 15 | 16 | *Prerequisites: MACS 30121 and MACS 30122, or equivalent (e.g. CAPP 30121 and CAPP 30122).* 17 | 18 | ## Course Structure 19 | This course is structured into several modules, or overarching thematic learning units, focused on teaching students fundamental large-scale computing concepts, as well as giving them the opportunity to apply these concepts to Computational Social Science research problems. Students can access the content in these modules on the [Canvas course site](https://canvas.uchicago.edu/courses/34598). Each module consists of a series of asynchronous lectures, readings, and assignments, which will make up a portion of the instruction time each week. If students have any questions about the asynchronous content, they should post these questions in the Ed Discussion forum for the course, which they can access by clicking the "Ed Discussion" tab on the left side of the screen on the Canvas course site. To see an overall schedule and syllabus for the course, as well as access additional course-related files, please visit the GitHub Course Repository, available here. 20 | 21 | Additionally, students will attend short, interactive Zoom sessions during regular class hours (starting at 9:10 AM CDT) on Mondays and Wednesdays meant to give them the opportunity to discuss and apply the skills they have learned asynchronously and receive live instructor feedback. Attendance to the synchronous class sessions is mandatory and is an important component of the final course grade. Students can access the Zoom sessions by clicking on the "Zoom" tab on the left side of the screen on the Canvas course site. Students should prepare for these classes by reading the assigned readings ahead of every session. All readings are available online and are linked in the course schedule below. 22 | 23 | Students will also virtually participate in a hands-on Python lab section each week, meant to instill practical large-scale computing skills related to the week’s topic. These labs are accessible on the Canvas course site and related code will be posted here in this GitHub repository. In order to practice these large-scale computing skills and complete the course assignments, students will be given free access to UChicago's [Midway Cluster](https://rcc.uchicago.edu/docs/), [Amazon Web Services (AWS)](https://aws.amazon.com/) cloud computing resources, and [DataCamp](https://www.datacamp.com/). More information about accessing these resources will be provided to registered students in the first several weeks of the quarter. 24 | 25 | ## Grading 26 | There will be an assignment due at the end of each unit (3 in total). Each assignment is worth 20% of the overall grade, with all assignments together worth a total of 60%. Additionally, attendance and participation will be worth 10% of the overall grade. Finally, students will complete a final project that is worth 30% of the overall grade (25% for the project itself, and 5% for an end-of-quarter presentation). 27 | 28 | | Course Component | Grade Percentage | 29 | | :------------- | :------------- | 30 | | Assignments (Total: 3) | 60% | 31 | | Attendance/Participation | 10% | 32 | | Final Project | 5% (Presentation) | 33 | | | 25% (Project) | 34 | 35 | Grades are not curved in this class or, at least, not in the traditional sense. We use a standard set of grade boundaries: 36 | * 95-100: A 37 | * 90-95: A- 38 | * 85-90: B+ 39 | * 80-85: B 40 | * 75-80: B- 41 | * 70-75: C+ 42 | * <70: Dealt on a case-by-case basis 43 | 44 | We curve only to the extent we might lower the boundaries for one or more letter grades, depending on the distribution of the raw scores. We will not raise the boundaries in response to the distribution. 45 | 46 | So, for example, if you have a total score of 82 in the course, you are guaranteed to get, at least, a B (but may potentially get a higher grade if the boundary for a B+ is lowered). 47 | 48 | ## Final Project 49 | For their final project (due June 4th, 2021), students will write large-scale computing code that solves a social science research problem of their choosing. For instance, students might perform a computationally intensive demographic simulation, or they may choose to collect, analyze, and visualize large, streaming social media data, or do something else that employs large-scale computing strategies. Students will additionally record a short video presentation about their project. Detailed descriptions and grading rubrics for the project and presentation are available [on the Canvas course site.](https://canvas.uchicago.edu/courses/34598) 50 | 51 | ## Late Assignments/Projects 52 | Unexcused Late Assignment/Project Submissions will be penalized 10 percentage points for every hour they are late. For example, if an assignment is due on Wednesday at 2:00pm, the following percentage points will be deducted based on the time stamp of the last commit. 53 | 54 | | Example last commit | Percentage points deducted | 55 | | ---- | ---- | 56 | | 2:01pm to 3:00pm | -10 percentage points | 57 | | 3:01pm to 4:00pm |-20 percentage points | 58 | | 4:01pm to 5:00pm | -30 percentage points | 59 | | 5:01pm to 6:00pm | -40 percentage points | 60 | | ... | ... | 61 | | 11:01pm and beyond | -100 percentage points (no credit) | 62 | 63 | ## Plagiarism on Assignments/Projects 64 | Academic honesty is an extremely important principle in academia and at the University of Chicago. 65 | * Writing assignments must quote and cite any excerpts taken from another work. 66 | * If the cited work is the particular paper referenced in the Assignment, no works cited or references are necessary at the end of the composition. 67 | * If the cited work is not the particular paper referenced in the Assignment, you MUST include a works cited or references section at the end of the composition. 68 | * Any copying of other students' work will result in a zero grade and potential further academic discipline. 69 | If you have any questions about citations and references, consult with your instructor. 70 | 71 | ## Statement of Diversity and Inclusion 72 | The University of Chicago is committed to diversity and rigorous inquiry from multiple perspectives. The MAPSS, CIR, and Computation programs share this commitment and seek to foster productive learning environments based upon inclusion, open communication, and mutual respect for a diverse range of identities, experiences, and positions. 73 | 74 | Any suggestions for how we might further such objectives both in and outside the classroom are appreciated and will be given serious consideration. Please share your suggestions or concerns with your instructor, your preceptor, or your program’s Diversity and Inclusion representatives: Darcy Heuring (MAPSS), Matthias Staisch (CIR), and Chad Cyrenne (Computation). You are also welcome and encouraged to contact the Faculty Director of your program. 75 | 76 | This course is open to all students who meet the academic requirements for participation. Any student who has a documented need for accommodation should contact Student Disability Services (773-702-6000 or disabilities@uchicago.edu) and the instructor as soon as possible. 77 | 78 | ## Course Schedule 79 | | Unit | Week | Day | Topic | Readings | Assignment | 80 | | --- | --- | --- | --- | --- | --- | 81 | | Fundamentals of Large-Scale Computing | Week 1: Introduction to Large-Scale Computing for the Social Sciences | 3/29/2021 | Introduction to the Course | | | 82 | | | | 3/31/2021 | General Considerations for Large-Scale Computing | [Robey and Zamora 2020 (Chapter 1)](https://livebook.manning.com/book/parallel-and-high-performance-computing/chapter-1), [Faster code via static typing (Cython)](http://docs.cython.org/en/latest/src/quickstart/cythonize.html), [A ~5 minute guide to Numba](https://numba.readthedocs.io/en/stable/user/5minguide.html) | | 83 | | | Week 2: On-Premise Large-Scale CPU-computing with MPI | 4/5/2021 | An Introduction to CPUs and Computing Clusters | [Pacheco 2011](https://canvas.uchicago.edu/files/5233897/download?download_frd=1) (Ch. 1-2), [Midway RCC User Guide](https://rcc.uchicago.edu/docs/) | | 84 | | | | 4/7/2021 | Cluster Computing via Message Passing Interface (MPI) for Python | [Pacheco 2011](https://canvas.uchicago.edu/files/5233897/download?download_frd=1) (Ch. 3), [Dalcín et al. 2008](https://www-sciencedirect-com.proxy.uchicago.edu/science/article/pii/S0743731507001712?via%3Dihub) | | 85 | |||| Lab: Hands-on Introduction to UChicago's Midway Computing Cluster and mpi4py ||| 86 | | | Week 3: On-Premise GPU-computing with OpenCL | 4/12/2021 | An Introduction to GPUs and Heterogenous Computing with OpenCL | [Scarpino 2012](https://canvas.uchicago.edu/files/5233898/download?download_frd=1) (Read Ch. 1, Skim Ch. 2-5,9) | | 87 | | | | 4/14/2021 | Harnessing GPUs with PyOpenCL | [Klöckner et al. 2012](https://arxiv.org/pdf/0911.3456.pdf) | | 88 | |||| Lab: Introduction to PyOpenCL and GPU Computing on UChicago's Midway Computing Cluster ||| 89 | | Architecting Computational Social Science Data Solutions in the Cloud | Week 4: An Introduction to Cloud Computing and Cloud HPC Architectures | 4/19/2021 | An Introduction to the Cloud Computing Landscape and AWS | [Jorissen and Bouffler 2017](https://canvas.uchicago.edu/files/5233891/download?download_frd=1) (Read Ch. 1, Skim Ch. 4-7), [Armbrust et al. 2009](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf), [Jonas et al. 2019](https://arxiv.org/pdf/1902.03383.pdf) | | 90 | | | | 4/21/2021 | Architectures for Large-Scale Computation in the Cloud | [Introduction to HPC on AWS](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf), [HPC Architectural Best Practices](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) | Due: Assignment 1 (11:59 PM) | 91 | |||| Lab: Running "Serverless" HPC Jobs in the AWS Cloud ||| 92 | | | Week 5: Architecting Large-Scale Data Solutions in the Cloud | 4/26/2021 | "Data Lake" Architectures | [Data Lakes and Analytics on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/), [AWS Data Lake Whitepaper](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf), [*Introduction to AWS Boto in Python*](https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python) (DataCamp Course; Practice working with S3 Data Lake in Python) | | 93 | | | | 4/28/2021 | Architectures for Large-Scale Data Structuring and Storage | General, Open Source: ["What is a Database?" (YouTube),](https://www.youtube.com/watch?v=Tk1t3WKK-ZY) ["How to Choose the Right Database?" (YouTube),](https://www.youtube.com/watch?v=v5e_PasMdXc) [“NoSQL Explained,”](https://www.mongodb.com/nosql-explained) AWS-specific solutions: ["Which Database to Use When?" (YouTube),](https://youtu.be/KWOSGVtHWqA) Optional: [Data Warehousing on AWS Whitepaper](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf), [AWS Big Data Whitepaper](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) | | 94 | |||| Lab: Exploring Large Data Sources in an S3 Data Lake with the AWS Python SDK, Boto ||| 95 | | | Week 6: Large-Scale Data Ingestion and Processing | 5/3/2021 | Stream Ingestion and Processing with Apache Kafka, AWS Kinesis | [Narkhede et al. 2017](https://canvas.uchicago.edu/files/5233889/download?download_frd=1) (Read Ch. 1, Skim 3-6,11), [Dean and Crettaz 2019 (Ch. 4)](https://livebook.manning.com/book/event-streams-in-action/chapter-4/), [AWS Kinesis Whitepaper](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) || 96 | | | | 5/5/2021 | Batch Processing with Apache Hadoop and MapReduce | [White 2015](https://canvas.uchicago.edu/files/5233885/download?download_frd=1) (read Ch. 1-2, Skim 3-4), [Dean and Ghemawat 2004](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf), ["What is Amazon EMR?"](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html), Running MapReduce Jobs with Python’s “mrjob” package on EMR ([Fundamentals](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html) and [Elastic MapReduce Quickstart](https://mrjob.readthedocs.io/en/latest/guides/emr-quickstart.html)) | | 97 | |||| Lab: Ingesting and Processing Batch Data with MapReduce on AWS EMR and Streaming Data with AWS Kinesis | || 98 | | High-Level Paradigms for Large-Scale Data Analysis, Prediction, and Presentation | Week 7: Spark | 5/10/2021 | Large-Scale Data Processing and Analysis with Spark | [Karau et al. 2015](https://canvas.uchicago.edu/files/5233908/download?download_frd=1) (Read Ch. 1-4, Skim 9-11), [*Introduction to PySpark*](https://learn.datacamp.com/courses/introduction-to-pyspark) (DataCamp Course) | | 99 | | | | 5/12/2021 | A Deeper Dive into Spark (with a survey of Social Science Machine Learning, NLP, and Network Analysis Applications) | [*Machine Learning with PySpark*](https://campus.datacamp.com/courses/machine-learning-with-pyspark) (DataCamp Course), [Guller 2015](https://canvas.uchicago.edu/files/5233903/download?download_frd=1), [Hunter 2017](https://www.youtube.com/watch?v=NmbKst7ny5Q) (Spark Summit Talk), [GraphFrames Documentation for Python](https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html), [Spark NLP Documentation](https://nlp.johnsnowlabs.com/), Optional: [*Feature Engineering with PySpark*](https://learn.datacamp.com/courses/feature-engineering-with-pyspark) (DataCamp Course), Videos about accelerating Spark with GPUs (via [Horovod](https://www.youtube.com/watch?v=D1By2hy4Ecw) for deep learning, and the RAPIDS libraries for both [ETL and ML acceleration in Spark 3.0](https://www.youtube.com/watch?v=4MI_LYah900)) | Due: Assignment 2 (11:59 PM) | 100 | |||| Lab: Running PySpark in an AWS EMR Notebook for Large-Scale Data Analysis and Prediction ||| 101 | | | Week 8: Dask | 5/17/2021 | Introduction to Dask | [“Why Dask,”](https://docs.dask.org/en/latest/why.html) [Dask Slide Deck](https://dask.org/slides.html), [*Parallel Programming with Dask*](https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python) (DataCamp Course) | | 102 | | | | 5/19/2021 | Natively Scaling the Python Ecosystem with Dask | | | 103 | |||| Lab: Scaling up Python Analytical Workflows in the Cloud and on the RCC in a Jupyter Notebook with Dask | || 104 | | | Week 9: Presenting Data and Insights from Large-Scale Data Pipelines | 5/24/2021 | Building and Deploying (Scalable) Public APIs and Web Applications with Flask and AWS Elastic Beanstalk | Documentation for [Flask](https://flask-doc.readthedocs.io/en/latest/index.html), and [Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html) | | 105 | || | 5/26/2021 | Visualizing Large Data | Documentation for [DataShader](https://datashader.org/index.html) and [Bokeh](https://bokeh.org/), and integrating the two libraries using [HoloViews](http://holoviews.org/user_guide/Large_Data.html) | | 106 | | | | 5/28/2021 | | | Due: Assignment 3 (11:59 PM) | 107 | | Student Projects | Week 10: Final Projects | 6/4/2021 ||| Due: Final Project + Presentation Video (11:59 PM) | 108 | 109 | ## Works Cited 110 | 111 | "A ~5 minute guide to Numba." https://numba.readthedocs.io/en/stable/user/5minguide.html. Accessed 3/2021. 112 | 113 | Armbrust, Michael, Fox, Armando, Griffith, Rean, Joseph, Anthony D., Katz, Randy H., Konwinski, Andrew, Lee, Gunho, Patterson, David A., Rabkin, Ariel, Stoica, Ion, and Matei Zaharia. 2009. "Above the Clouds: A Berkeley View of Cloud Computing." Technical report, EECS Department, University of California, Berkeley. 114 | 115 | ["AWS Big Data Analytics Options on AWS." December 2018.)](https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf) AWS Whitepaper. 116 | 117 | "AWS Elastic Beanstalk Developer Guide." https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html. Accessed 3/2021. 118 | 119 | ["Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility." July 2017.](https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf) 120 | 121 | Dalcín, Lisandro, Paz, Rodrigo, Storti, Mario, and Jorge D'Elía. 2008. "MPI for Python: Performance improvements and MPI-2 extensions." *J. Parallel Distrib. Comput.* 68: 655-662. 122 | 123 | ["Data Warehousing on AWS." March 2016.](https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf) AWS Whitepaper. 124 | 125 | "DataShader Documentation." https://datashader.org/index.html. Accessed 3/2021. 126 | 127 | Dean, Alexander, and Valentin Crettaz. 2019. *Event Streams in Action: Real-time event systems with Kafka and Kinesis*. Shelter Island, NY: Manning. 128 | 129 | Dean, Jeffrey, and Sanjay Ghemawat. 2004. "MapReduce: Simplified data processing on large clusters." In *Proceedings of Operating Systems Design and Implementation (OSDI)*. San Francisco, CA. 137-150. 130 | 131 | Evans, Robert and Jason Lowe. "Deep Dive into GPU Support in Apache Spark 3.x." https://www.youtube.com/watch?v=4MI_LYah900. Accessed 3/2021. 132 | 133 | "Faster code via static typing." http://docs.cython.org/en/latest/src/quickstart/cythonize.html. Accessed 3/2021 134 | 135 | *Feature Engineering with PySpark*. https://learn.datacamp.com/courses/feature-engineering-with-pyspark. Accessed 3/2020. 136 | 137 | "Flask Documentation." https://flask-doc.readthedocs.io/en/latest/index.html. Accessed 3/2021. 138 | 139 | "GraphFrames user guide - Python." https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html. Accessed 3/2020. 140 | 141 | Guller, Mohammed. 2015. "Graph Processing with Spark." In *Big Data Analytics with Spark*. New York: Apress. 142 | 143 | ["High Performance Computing Lens AWS Well-Architected Framework." December 2019.](https://d1.awsstatic.com/whitepapers/architecture/AWS-HPC-Lens.pdf) AWS Whitepaper. 144 | 145 | Hunter, Tim. October 26, 2017. "GraphFrames: Scaling Web-Scale Graph Analytics with Apache Spark." https://www.youtube.com/watch?v=NmbKst7ny5Q. 146 | 147 | *Introduction to AWS Boto in Python*. https://campus.datacamp.com/courses/introduction-to-aws-boto-in-python. Accessed 3/2020. 148 | 149 | ["Introduction to HPC on AWS." n.d.](https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdf) AWS Whitepaper. 150 | 151 | *Introduction to PySpark*. https://learn.datacamp.com/courses/introduction-to-pyspark. Accessed 3/2020. 152 | 153 | Jonas, Eric, Schleier-Smith, Johann, Sreekanti, Vikram, and Chia-Che Tsai. 2019. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." Technical report, EECS Department, University of California, Berkeley. 154 | 155 | Jorissen, Kevin, and Brendan Bouffler. 2017. *AWS Research Cloud Program: Researcher's Handbook*. Amazon Web Services. 156 | 157 | Kane, Frank. March 23, 2018. ["How to Choose the Right Database? - MongoDB, Cassandra, MySQL, HBase"](https://www.youtube.com/watch?v=v5e_PasMdXc). https://www.youtube.com/watch?v=v5e_PasMdXc 158 | 159 | Karau, Holden, Konwinski, Andy, Wendell, Patrick, and Matei Zaharia. 2015. *Learning Spark*. Sebastopol, CA: O'Reilly. 160 | 161 | Klöckner, Andreas, Pinto, Nicolas, Lee, Yunsup, Catanzaro, Bryan, Ivanov, Paul, and Ahmed Fasih. 2012. "PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation." *Parallel Computing* 38(3): 157-174. 162 | 163 | Linux Academy. July 10, 2019. ["What is a Database?"](https://www.youtube.com/watch?v=Tk1t3WKK-ZY). https://www.youtube.com/watch?v=Tk1t3WKK-ZY 164 | 165 | *Machine Learning with PySpark*. https://campus.datacamp.com/courses/machine-learning-with-pyspark. Accessed 3/2020. 166 | 167 | "mrjob v0.7.1 documentation." https://mrjob.readthedocs.io/en/latest/index.html. Accessed 3/2020. 168 | 169 | Narkhede, Neha, Shapira, Gwen, and Todd Palino. 2017. *Kafka: The Definitive Guide*. Sebastopol, CA: O'Reilly. 170 | 171 | “NoSQL Explained.” https://www.mongodb.com/nosql-explained. Accessed 3/2020. 172 | 173 | Pacheco, Peter. 2011. *An Introduction to Parallel Programming*. Burlington, MA: Morgan Kaufmann. 174 | 175 | *Parallel Programming with Dask*. https://learn.datacamp.com/courses/parallel-programming-with-dask-in-python. Accessed 3/2020. 176 | 177 | Petrossian, Tony, and Ian Meyers. November 30, 2017. "Which Database to Use When?" https://youtu.be/KWOSGVtHWqA. AWS re:Invent 2017. 178 | 179 | "RCC User Guide." rcc.uchicago.edu/docs/. Accessed March 2020. 180 | 181 | Robey, Robert and Yuliana Zamora. 2020. *Parallel and High Performance Computing*. Manning Early Access Program. 182 | 183 | Scarpino, Matthew. 2012. *OpenCL in Action*. Shelter Island, NY: Manning. 184 | 185 | Sergeev, Alex. March 28, 2019. "Distributed Deep Learning with Horovod." https://www.youtube.com/watch?v=D1By2hy4Ecw. 186 | 187 | "Spark NLP Documentation." https://nlp.johnsnowlabs.com/. Accessed 3/2021. 188 | 189 | ["Streaming Data Solutions on AWS with Amazon Kinesis." July 2017.](https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf) AWS Whitepaper. 190 | 191 | "The Bokeh Visualization Library Documentation." https://bokeh.org/. Accessed 3/2021. 192 | 193 | "What is Amazon EMR." https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html. Accessed 3/2020. 194 | 195 | White, Tom. 2015. *Hadoop: The Definitive Guide*. Sebastopol, CA: O'Reilly. 196 | 197 | "Why Dask." https://docs.dask.org/en/latest/why.html. Accessed 3/2020. 198 | 199 | "Working with large data using datashader." http://holoviews.org/user_guide/Large_Data.html. Accessed 3/2021. 200 | --------------------------------------------------------------------------------