├── lecture2 ├── parts │ ├── file2.txt │ ├── file1.txt │ ├── example_file.txt │ ├── file3.txt │ ├── subfolder │ │ └── mod.py │ ├── 6-conclusion.py │ ├── 2-commands.py │ ├── 1-introduction.py │ ├── 3-informational.py │ ├── 5-git.py │ └── 4-help-and-doing-stuff.py └── README.md ├── lecture1 ├── .gitignore ├── extras │ ├── main_test.py │ ├── module_test.py │ ├── cut.py │ └── failures.py ├── parts │ ├── life-expectancy-row1.csv │ ├── verify.py │ ├── throughput_latency.py │ ├── 1-introduction.py │ ├── 6-conclusion.py │ ├── 2-etl.py │ ├── 4-properties.py │ ├── 5-performance.py │ └── 3-dataflow-graphs.py └── README.md ├── .gitignore ├── lecture7 ├── file.txt ├── example-policy.json └── README.md ├── lecture0 ├── Lecture 0 Slides.pdf └── README.md ├── lecture5 ├── extras │ ├── narrow_wide.png │ ├── parallel_test.py │ ├── exercises.py │ ├── examples.py │ ├── cut.py │ └── dataframe.py ├── README.md └── parts │ ├── 5-dataframes.py │ ├── 6-latency-throughput.py │ └── 1-RDDs.py ├── lecture4 ├── extras │ ├── scaling-example.png │ ├── dataflow-graph-example.png │ ├── resources.md │ └── data-race-example.py ├── README.md └── parts │ ├── 6-distribution.py │ ├── 1-motivation.py │ ├── 2-parallelism.py │ └── 5-quantifying.py ├── lecture3 ├── README.md └── old │ └── README.md ├── exams ├── final.md ├── midterm_study_list.md ├── poll_answers.md └── final_study_list.md ├── LICENSE ├── lecture6 ├── parts │ ├── orders.md │ ├── 4-end-notes.py │ └── 2-microbatching.py ├── extras │ └── streaming.py └── README.md └── schedule.md /lecture2/parts/file2.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /lecture2/parts/file1.txt: -------------------------------------------------------------------------------- 1 | ../lecture1 2 | -------------------------------------------------------------------------------- /lecture1/.gitignore: -------------------------------------------------------------------------------- 1 | output.csv 2 | save.txt 3 | -------------------------------------------------------------------------------- /lecture2/parts/example_file.txt: -------------------------------------------------------------------------------- 1 | Commit test 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | __pycache__/ 3 | notes/ 4 | -------------------------------------------------------------------------------- /lecture7/file.txt: -------------------------------------------------------------------------------- 1 | Test data 2 | Apple banana orange 3 | -------------------------------------------------------------------------------- /lecture2/parts/file3.txt: -------------------------------------------------------------------------------- 1 | ../lecture1 2 | ../lecture2 3 | ../lecture3 4 | ../lecture4 5 | -------------------------------------------------------------------------------- /lecture1/extras/main_test.py: -------------------------------------------------------------------------------- 1 | import lecture 2 | 3 | print("Hello from main_test.py") 4 | -------------------------------------------------------------------------------- /lecture0/Lecture 0 Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture0/Lecture 0 Slides.pdf -------------------------------------------------------------------------------- /lecture5/extras/narrow_wide.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture5/extras/narrow_wide.png -------------------------------------------------------------------------------- /lecture4/extras/scaling-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture4/extras/scaling-example.png -------------------------------------------------------------------------------- /lecture2/parts/subfolder/mod.py: -------------------------------------------------------------------------------- 1 | """ 2 | A new module that we created from the command line 3 | """ 4 | 5 | print("Hello from submodule") 6 | -------------------------------------------------------------------------------- /lecture4/extras/dataflow-graph-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture4/extras/dataflow-graph-example.png -------------------------------------------------------------------------------- /lecture1/parts/life-expectancy-row1.csv: -------------------------------------------------------------------------------- 1 | Entity,Code,Year,Period life expectancy at birth - Sex: all - Age: 0 2 | Afghanistan,AFG,1950,27.7275 3 | -------------------------------------------------------------------------------- /lecture0/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 0: Course Introduction 2 | 3 | This was the first lecture (on Wednesday, Sep 24). 4 | We went over an introduction to the course, syllabus, and schedule! 5 | The slides can be found in `Lecture 0 Slides.pdf`. 6 | -------------------------------------------------------------------------------- /lecture3/README.md: -------------------------------------------------------------------------------- 1 | # Spring 2025 2 | 3 | This shorter lecture was skipped for Spring 2025. 4 | 5 | Some aspects of Pandas were covered on HW1. 6 | 7 | We will move directly to Lecture 4. 8 | 9 | ## Please note 10 | 11 | - Any uncovered material will NOT be covered on the midterm. 12 | -------------------------------------------------------------------------------- /lecture4/extras/resources.md: -------------------------------------------------------------------------------- 1 | ### Resources and further reading 2 | 3 | [Parallel Computing: Theory and Practice](https://www.cs.cmu.edu/afs/cs/academic/class/15210-f15/www/tapp.html) 4 | 5 | A good textbook on parallel computing (written by Umut A. Acar at CMU). 6 | Covers the concepts, design, and implementation of parallel algorithms in more detail, work/span analysis, fork/join parallelism, etc. 7 | -------------------------------------------------------------------------------- /exams/final.md: -------------------------------------------------------------------------------- 1 | # Final details 2 | 3 | Thursday, Dec 11, 8-10am, same room as lecture 4 | 5 | Closed-book, on paper, one-sided cheat sheet allowed (handwritten or typed). 6 | 7 | Similar structure to the midterm: 8 | 10 true/false, 8 multiple choice / short answer, 2 free response 9 | 10 | Study topic list: see `final_study_list.md` 11 | 12 | Study questions: go over the in-class polls! 13 | 14 | No questions that ask you to hand-write code 15 | (concepts, not syntax). 16 | -------------------------------------------------------------------------------- /lecture7/example-policy.json: -------------------------------------------------------------------------------- 1 | { 2 | "Id": "Policy1733525110646", 3 | "Version": "2012-10-17", 4 | "Statement": [ 5 | { 6 | "Sid": "Stmt1733525108327", 7 | "Action": "s3:*", 8 | "Effect": "Allow", 9 | "Resource": "arn:aws:s3:::119-test-bucket-2", 10 | "Principal": { 11 | "AWS": [ 12 | "arn:aws:iam::472501947158:user/caleb" 13 | ] 14 | } 15 | } 16 | ] 17 | } 18 | -------------------------------------------------------------------------------- /lecture1/extras/module_test.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is a little test script to talk about 3 | Python modules and scope. 4 | 5 | We may get to it in lecture 1 or in a future lecture. 6 | """ 7 | 8 | print("Python modules and scope") 9 | 10 | import sys 11 | import types 12 | def imports(): 13 | for name, val in globals().items(): 14 | if isinstance(val, types.ModuleType): 15 | yield val.__name__ 16 | 17 | print("__name__:", __name__) 18 | print("All modules:", sys.modules.keys()) 19 | print("Local modules:", list(imports())) 20 | print("This module:", sys.modules[__name__]) 21 | print("This module:", sys.modules[__name__].__name__) 22 | -------------------------------------------------------------------------------- /lecture5/extras/parallel_test.py: -------------------------------------------------------------------------------- 1 | """ 2 | A little test to show how RDDs are parallelized. 3 | """ 4 | 5 | from pyspark.sql import SparkSession 6 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate() 7 | sc = spark.sparkContext 8 | 9 | # Modify as needed 10 | N = 1_000_000_000 11 | 12 | result = (sc 13 | .parallelize(range(1, N)) 14 | # Uncomment to force only a single partition 15 | # .map(lambda x: (0, x)) # first element of ordered pair is the key I want to parallelize on 16 | # .partitionBy(1) 17 | # .map(lambda x: x[1]) 18 | .map(lambda x: x ** 2) 19 | .filter(lambda x: x >= 100 and x < 1000) 20 | .collect() 21 | ) 22 | 23 | print(result) 24 | -------------------------------------------------------------------------------- /lecture5/extras/exercises.py: -------------------------------------------------------------------------------- 1 | """ 2 | Some MapReduce exercises 3 | 4 | (Skipped - will appear on the homework.) 5 | 6 | https://github.com/DavisPL-Teaching/119-hw2 7 | """ 8 | 9 | # Spark boilerplate (remember to always add this at the top of any Spark file) 10 | import pyspark 11 | from pyspark.sql import SparkSession 12 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate() 13 | sc = spark.sparkContext 14 | 15 | """ 16 | 1. Among the numbers from 1 to 1000, which digit is most common? 17 | the least common? 18 | 19 | 2. Among the numbers from 1 to 1000, written out in English, which character is most common? 20 | the least common? 21 | 22 | 3. Does the answer change if we have the numbers from 1 to 1,000,000? 23 | """ 24 | -------------------------------------------------------------------------------- /lecture1/extras/cut.py: -------------------------------------------------------------------------------- 1 | """ 2 | Additional cut material 3 | 4 | === Advantages of software view === 5 | 6 | Advantages of thinking of data processing pipelines as software: 7 | 8 | - Software *design* matters: structuring code into modules, classes, functions 9 | - Software can be *tested*: validating functions, validating inputs, unit & integration tests 10 | - Software can be *reused* and maintained (not just a one-off script) 11 | - Software can be developed collaboratively (Git, GitHub) 12 | - Software can be optimized for performance (parallelism, distributed computing, etc.) 13 | 14 | It is a little more work to structure our code this way! 15 | But it helps ensure that our work is reusable and integrates well with other teams, projects, etc. 16 | """ 17 | -------------------------------------------------------------------------------- /lecture5/extras/examples.py: -------------------------------------------------------------------------------- 1 | """ 2 | Just some simple examples for syntax reference. 3 | """ 4 | 5 | ### RDD part 6 | 7 | # Start a Spark session 8 | from pyspark.sql import SparkSession 9 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate() 10 | sc = spark.sparkContext 11 | 12 | data = sc.parallelize(range(1, 11)) # RDD containing integers 1 to 10 13 | 14 | mapped_data = data.map(lambda x: x ** 2) # [1, 4, 9, ..., 100] 15 | 16 | filtered_data = mapped_data.filter(lambda x: x > 50) # [64, 81, 100] 17 | 18 | ### DataFrame part 19 | 20 | # Start a Spark session 21 | from pyspark.sql import SparkSession 22 | from pyspark.sql.functions import col 23 | spark = SparkSession.builder.appName("DataFrameExample").getOrCreate() 24 | 25 | # Create a DataFrame with integers from 1 to 10 26 | data = spark.createDataFrame([(i,) for i in range(1, 11)], ["number"]) 27 | 28 | mapped_data = data.withColumn("squared", col("number") ** 2) 29 | 30 | filtered_data = mapped_data.filter(col("squared") > 50) 31 | 32 | filtered_data.show() 33 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Caleb Stanford 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /lecture6/parts/orders.md: -------------------------------------------------------------------------------- 1 | # Orders (copy paste into `nc` window) 2 | 3 | {"order_number": 1, "item": "Apple", "timestamp": "2025-11-24 10:00:00", "qty": 2} 4 | {"order_number": 2, "item": "Banana", "timestamp": "2025-11-24 10:01:00", "qty": 3} 5 | {"order_number": 3, "item": "Orange", "timestamp": "2025-11-24 10:02:00", "qty": 1} 6 | 7 | ### More examples 8 | {"order_number": 4, "item": "Apple", "timestamp": "2025-11-24 10:03:00", "qty": 2} 9 | {"order_number": 5, "item": "Banana", "timestamp": "2025-11-24 10:04:00", "qty": 1} 10 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 1} 11 | 12 | ### Stress testing 13 | 14 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 100} 15 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 10000} 16 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 5} 17 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 500} 18 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 50000} 19 | {"order_number": 3, "item": "Orange", "timestamp": "2025-11-27 15:44:00", "qty": 5000000} 20 | -------------------------------------------------------------------------------- /lecture7/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 7: Brief lecture on Cloud Computing 2 | 3 | **This is an old version of the lecture from the Fall 2024 iteration of the course. It has not yet been updated for Fall 2025.** 4 | 5 | ## Dec 6 6 | 7 | Announcements: 8 | https://piazza.com/class/m12ef423uj5p5/post/165 9 | 10 | - OH today and Monday 11 | 12 | - HW2, HW1 makeup, all in-class polls due Monday EOD 13 | 14 | - Final is Wednesday, 6-8pm 15 | 16 | ## Outline for today 17 | 18 | Start with the poll 19 | 20 | Cloud computing in AWS: 21 | 22 | - Getting started; what to know up front 23 | 24 | - Storing data (S3) 25 | 26 | - Running computations (EC2 and Lambda) 27 | 28 | If we have an extra 10-15 minutes at the end of class, I will reserve it for an open Q+A. 29 | 30 | Questions about HW2 or the final or anything else? 31 | 32 | ## Poll 33 | 34 | Getting the poll out of the way: 35 | 36 | https://forms.gle/3e6vJHShMJfaD2pH8 37 | 38 | A streaming system processes 20 input rows over the duration of 1 minute that the system is running. The system may use parallelism, meaning that at any given point in time, more than one input may be processed. 39 | 40 | Given this information, what is the minimum and maximum latency for individual input rows? 41 | -------------------------------------------------------------------------------- /lecture6/extras/streaming.py: -------------------------------------------------------------------------------- 1 | """ 2 | A minimal example of a streaming pipeline in PySpark 3 | using Structured Streaming. 4 | 5 | (Remember to use nc -lk 9999 to run) 6 | """ 7 | 8 | from pyspark.sql import SparkSession 9 | from pyspark.sql.functions import from_json, col 10 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType 11 | 12 | # Create a Spark session 13 | spark = SparkSession.builder.appName("streaming").getOrCreate() 14 | 15 | # Define the schema of the incoming JSON data 16 | schema = StructType([ 17 | StructField("order_number", IntegerType()), 18 | StructField("item", StringType()), 19 | StructField("timestamp", StringType()), 20 | StructField("qty", IntegerType()) 21 | ]) 22 | 23 | # Use local socket as a streaming source 24 | streaming_df = spark.readStream.format("socket") \ 25 | .option("host", "localhost") \ 26 | .option("port", 9999) \ 27 | .load() 28 | 29 | # Parse the JSON data 30 | parsed_df = streaming_df.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) 31 | 32 | # Start the streaming query 33 | query = parsed_df.writeStream.outputMode("append").format("console").start() 34 | 35 | # Wait for the streaming to finish 36 | query.awaitTermination() 37 | -------------------------------------------------------------------------------- /lecture1/parts/verify.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import sys 4 | import matplotlib.pyplot as plt 5 | import pyspark.sql as sql 6 | 7 | def verify(): 8 | print() 9 | print("Python version:", sys.version) 10 | print("Numpy version:", np.__version__) 11 | print("Pandas version:", pd.__version__) 12 | print("Matplotlib version:", plt.matplotlib.__version__) 13 | print() 14 | 15 | # Simple numpy and pandas operations to verify functionality 16 | # - Create a numpy array 17 | array = np.array([1, 2, 3]) 18 | print("Numpy array:", array) 19 | 20 | # - Add stuff to the array 21 | array = array + 4 22 | print("Numpy array after addition:", array) 23 | 24 | # - Make a pandas DataFrame 25 | df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) 26 | print("Pandas DataFrame:\n", df1) 27 | print() 28 | 29 | # A basic pyspark test 30 | spark = sql.SparkSession.builder.appName("Verify").getOrCreate() 31 | df2 = spark.createDataFrame([ 32 | sql.Row(a=1, b=2., c='string1'), 33 | sql.Row(a=2, b=3., c='string2'), 34 | sql.Row(a=4, b=5., c='string3'), 35 | ]) 36 | df2.show() 37 | print("Spark version:", spark.version) 38 | print() 39 | 40 | # Plot the first DataFrame using matplotlib 41 | plt.plot(df1) 42 | plt.show() 43 | 44 | if __name__ == "__main__": 45 | verify() 46 | -------------------------------------------------------------------------------- /lecture6/parts/4-end-notes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 4: End notes 3 | 4 | === Poll === 5 | 6 | This is the last poll! 7 | 8 | What type of time corresponds to each of the following scenarios? 9 | 10 | (Real time, event time, system time, logical time) 11 | 12 | Select all that apply 13 | 14 | https://forms.gle/NCXfDV4J3ySWiyiT6 15 | 16 | === Summary === 17 | 18 | We've seen: 19 | 20 | - Streaming systems: differ from batch processing systems in that they process 21 | one item (or row) in your input at a time 22 | 23 | - This is useful for "latency-critical" applications where you want, say, sub-second 24 | or sub-millisecond level respnose times 25 | 26 | - Measuring latency at an individual item level: 27 | 28 | Recall formula for latency: 29 | 30 | (exit time item X) - (start time item X) 31 | 32 | - Microbatching: an optimization that trades latency for higher throughput 33 | 34 | Microbatching can be based on different notions of time! Usually event time or system time 35 | 36 | Microbatching - still a streaming system! 37 | 38 | - Time: Real, Event, System, Logical 39 | 40 | + Monotonic time 41 | 42 | === Discussion and Failure Cases === 43 | 44 | Streaming pipelines' (e.g., in Spark Streaming) 45 | major advantage is better latency. 46 | 47 | However, they have additional failure cases from their batch counterparts. 48 | Let's cover a few of these: 49 | 50 | - Out-of-order data (late arrivals) 51 | 52 | - Clock drift and non-monotonic clocks 53 | 54 | (streaming system cares about time - batch prcoessing system didn't!) 55 | 56 | - Too much data 57 | 58 | . 59 | . 60 | . 61 | 62 | Q: How do we deal with out-of-order data? 63 | 64 | Q: How do we deal with clocks being wrong? 65 | 66 | Q: How do we deal with too much data? 67 | 68 | Q: What happens when our pipeline is overloaded with too much data, and the above techniques fail? 69 | """ 70 | -------------------------------------------------------------------------------- /schedule.md: -------------------------------------------------------------------------------- 1 | # ECS 119 Tentative Course Schedule - Fall 2025 2 | 3 | **Important note:** 4 | This schedule is subject to change. 5 | I will try to keep it up to date, but please see Piazza for the latest information and homework deadlines. 6 | 7 | ## Section 1: Data Processing Basics 8 | 9 | | Week | Date | Topic | Readings & HW | Lecture \# and part | 10 | | --- | --- | --- | --- | --- | 11 | | 0 | Sep 24 | Introduction | | 0 | 12 | | | Sep 26 | Introduction to Data Processing Software | | 1.1 | 13 | | 1 | Sep 29 | | | 1.2 | 14 | | | Oct 1 | **No class** | HW0 Due | | 15 | | | Oct 3 | | | 1.3 | 16 | | 2 | Oct 6 | | HW1 Available | 1.4 | 17 | | | Oct 8 | | | 1.5 | 18 | | | Oct 10 | | | 1.6 | 19 | | 3 | Oct 13 | The Shell | | 2.1 | 20 | | | Oct 15 | | | 2.2 | 21 | | | Oct 17 | | HW1 Due | 2.3 | 22 | | 4 | Oct 20 | | | 2.4 | 23 | | | Oct 22 | | | 2.5, 2.6 | 24 | | | Oct 24 | Parallelism | | 4.1 | 25 | 26 | ## Section 2: Parallelism 27 | 28 | | Week | Date | Topic | Readings & HW | Lecture # | 29 | | --- | --- | --- | --- | --- | 30 | | 5 | Oct 27 | | | 4.2 | 31 | | | Oct 29 | | | 4.3 | 32 | | | Oct 31 | | | 4.3, 4.4 | 33 | | 6 | Nov 3 | Review or Overflow | | 4.4 | 34 | | | Nov 5 | **Midterm** | | | 35 | | | Nov 7 | | | 4.5 | 36 | | 7 | Nov 10 | | | 4.5, 4.6 | 37 | | | Nov 12 | Distributed Pipelines | HW2 available | 5.1 | 38 | | | Nov 14 | | | 5.2 | 39 | | 8 | Nov 17 | | | 5.3 | 40 | | | Nov 19 | | | 5.4 | 41 | | | Nov 21 | | | 5.4, 5.5, 5.6 | 42 | 43 | ## Section 3: Distributed Computing 44 | 45 | | Week | Date | Topic | Readings & HW | Lecture # | 46 | | --- | --- | --- | --- | --- | 47 | | 9 | Nov 24 | Streaming Pipelines | HW2 due | 6.1 | 48 | | | Nov 26 | | | 6.2, 6.3, 6.4 | 49 | | | Nov 28 | **No Class** (Thanksgiving) | 50 | | 10 | Dec 1 | selection of additional topics[^1] | | 7 | 51 | | | Dec 3 | selection of additional topics[^1] | | 7 | 52 | | | Dec 5 | selection of additional topics[^1] | | 7 | 53 | | 11 | Dec 11 | **Final Exam (8am)** | | | 54 | 55 | ## Notes 56 | 57 | [^1]: Possible additional topics (depending on time and preference): 58 | cloud computing; 59 | data validation and integrity; 60 | data cleaning; 61 | missing data; 62 | distributed programming: failures and consistency requirements; 63 | containerization and orchestration; 64 | cloud computing: platforms, services, and resources. 65 | -------------------------------------------------------------------------------- /lecture5/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 5: Distributed Pipelines 2 | 3 | ## Nov 12 4 | 5 | Announcements: 6 | 7 | - HW2 now available! (Early release) 8 | 9 | https://github.com/DavisPL-Teaching/119-hw2 10 | 11 | Due Monday, Nov 24 12 | 13 | Some minor changes possible before Monday, will be announced on Piazza 14 | 15 | Plan: 16 | 17 | - Start with poll 18 | 19 | - Lecture 5, part 1: introduction to distributed pipelines and PySpark. 20 | 21 | Questions? 22 | 23 | ## Friday, November 14 24 | 25 | Announcements: 26 | 27 | - HW2 due Nov 24 28 | 29 | I made a few clarifications based on fixes from last year! 30 | Please pull to get the latest. 31 | 32 | + [Diff 1](https://github.com/DavisPL-Teaching/119-hw2/commit/f81558e317fbe427367da3b4c5828265ab4085be) 33 | 34 | + [Diff 2](https://github.com/DavisPL-Teaching/119-hw2/commit/7e6beaab93cdf1f30ea1fbc55535c2db9f99208a) 35 | 36 | Plan: 37 | 38 | - Poll 39 | 40 | - Lecture 5, part 2: Properties of RDDs 41 | 42 | - (If time) continue to Lecture 6, part 3: MapReduce. 43 | 44 | Questions? 45 | 46 | ## Monday, Nov 17 47 | 48 | Reminders: 49 | 50 | - HW2 due Nov 24 (1 week from today) 51 | 52 | OH: today 415pm, Friday 11am, Monday 24th 4:15pm 53 | 54 | Plan: 55 | 56 | - Finish loose ends from Lecture 5, part 2 57 | 58 | - Poll 59 | 60 | - Part 3: introduction to MapReduce. 61 | 62 | Questions? 63 | 64 | ## Wednesday, Nov 19 65 | 66 | Announcements and reminders: 67 | 68 | - HW2 due Monday (in 5 days) 69 | 70 | Partial autograder is now available! 71 | Try it out ahead of time 72 | 73 | It will give you a preliminary score out of a maximum of 28. 74 | (Currently 38 - will be updated after this evening, will be 28.) 75 | 76 | - ICYMI: 77 | 78 | + `github_help.md` file for setting up your own Git repo! 79 | 80 | + `hints.md` has hints for some problems. 81 | 82 | - Some extra MapReduce material in `extras/` - some of this is covered as part of your homework 83 | 84 | Plan: 85 | 86 | - Part 4: Data Partitioning 87 | 88 | - (If time) Part 5 on DataFrames 89 | 90 | Any questions? 91 | 92 | ## Friday, Nov 21st 93 | 94 | Reminders: 95 | 96 | - HW2 due on Monday 11:59pm 97 | 98 | Plan: 99 | 100 | - Narrow and wide operators (finishing up Part 4) 101 | 102 | - Poll 103 | 104 | - DataFrames (finishing up part 5 - this will be covered relatively briefly) 105 | 106 | - Part 6: end note on disadvantages of Spark. 107 | -------------------------------------------------------------------------------- /lecture3/old/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 3: Data Operators 2 | 3 | **This is an old version of the lecture from the Fall 2024 iteration of the course. It has not yet been updated for Fall 2025.** 4 | 5 | ## Oct 14 6 | 7 | Announcements: 8 | 9 | - HW1 Part 1 is available -- due a week from Friday 10 | 11 | https://piazza.com/class/m12ef423uj5p5/post/34 12 | https://github.com/DavisPL-Teaching/119-hw1 13 | 14 | - All the parts are part of the same assignment repository. 15 | Parts 2 and 3 will be released on Wednesday and Friday. 16 | 17 | Plan: 18 | 19 | - Poll 20 | 21 | - Finish loose ends from Lecture 2 22 | 23 | - Start Lecture 3 24 | 25 | ### Poll 26 | 27 | 1. Which is the correct sequence of 3 commands? 28 | 29 | 2. Which of the following are probably correct reason(s) that git requires running 3 commands to publish your code instead of just one? 30 | 31 | https://forms.gle/MpSRmPyWVJfbm3kv6 32 | https://tinyurl.com/bdh3d2zp 33 | 34 | ## Oct 16 35 | 36 | Announcements: 37 | 38 | - HW1 part 2 available! 39 | 40 | + https://github.com/DavisPL-Teaching/119-hw1 41 | 42 | + Please come to office hours! And get started early 43 | 44 | + Part 3 + instructions to submit will be released on Friday. 45 | 46 | + Due Friday, Oct 25. 47 | 48 | - Midterm will likely be moved to week of Nov 4/6/8 -- the date will 49 | be confirmed on Monday, Oct 21. 50 | 51 | Questions about HW1? 52 | 53 | Plan: 54 | 55 | - Start with the poll 56 | 57 | - Note on Python in vscode 58 | 59 | - Continue lecture 3 60 | 61 | ### Poll 62 | 63 | Which of the following are true statements about Pandas data frames? 64 | 65 | - Every value in a row must have the same type 66 | - Every value in a column must have the same type 67 | - Data frames must have at least one row 68 | - Data frames must have at least one column 69 | - Data frames are 2-dimensional 70 | - Data frames cannot have null values 71 | - Rows must be indexed by integer values 72 | 73 | https://forms.gle/KBenqsqHD6A71Moq8 74 | https://tinyurl.com/27mtfxyn 75 | 76 | ## Oct 18 77 | 78 | Announcements: 79 | 80 | - HW1 fully available: https://piazza.com/class/m12ef423uj5p5/post/46 81 | 82 | - The "project proposal" part has been moved to a later HW; instead, Part 3 83 | is a shorter series of exercises on the shell. 84 | 85 | - To get the latest changes: git add ., git commit -m "message", git pull, **then** resolve any merge conflicts! https://piazza.com/class/m12ef423uj5p5/post/44 86 | 87 | - OH today (starting 415 -- I will stay past 5 if there are still people around 88 | asking questions) 89 | 90 | HW submission: 91 | 92 | - We are now using Gradescope for submission instead of GitHub Classroom. 93 | You should have been added; please see Piazza for the link and HW1 for details. 94 | You can submit either via GitHub or via a zip file. 95 | 96 | - Clarification to late policy: https://piazza.com/class/m12ef423uj5p5/post/47 97 | 98 | - Questions? 99 | 100 | Plan: 101 | 102 | - Poll 103 | 104 | - Continue Lecture 3: 105 | remaining SQL operators, common "gotchas", and survey a more general 106 | view of data processing operators 107 | 108 | ### Poll 109 | 110 | https://forms.gle/TTX5Tvp72AGESsxK8 111 | https://tinyurl.com/mu6e73sx 112 | -------------------------------------------------------------------------------- /lecture2/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 2: The Shell 2 | 3 | ## Monday Oct 13 4 | 5 | A couple of scheduling announcements: 6 | 7 | - **This Wednesday:** 8 | I need to switch lecture & discussion section, and 9 | lecture will be on Zoom. 10 | 11 | So: 12 | + Lecture at 11am on Zoom 13 | ^^ meeting invite is moved in Canvas so you should be able 14 | to find it there. 15 | + Discussion section at 3:10pm in the usual classroom (walker hall) 16 | 17 | - **Midterm date: moved** to Wednesday, November 5 18 | 19 | In class midterm - in this classroom (Walker Hall) 20 | 21 | Reminders: 22 | 23 | - HW1 due this Friday 24 | + Office hours today, Friday! 25 | 26 | Plan: 27 | 28 | - Start with the discussion question & poll. 29 | 30 | - Introduction to the shell! 31 | 32 | - 3-part model for interacting with the shell. 33 | 34 | ## Wednesday Oct 15 35 | 36 | Reminders: 37 | 38 | - Discussion section at 3:10pm in lecture classroom 39 | 40 | - HW1 due Friday 41 | 42 | Plan: 43 | 44 | - 3-part model for interacting with the shell. 45 | 46 | Questions? 47 | 48 | ## Friday Oct 17 49 | 50 | Reminders: 51 | 52 | - HW1 due today 53 | 54 | Good luck! 55 | Please keep the questions coming on Piazza! We will try to monitor leading 56 | up to the deadline 57 | 58 | Autograder score: 59 | 60 pts available now, remainder will be graded with 60 | the full version of the autograder after the deadline. 61 | 62 | Plan: 63 | 64 | - 3-part model for interacting with the shell. 65 | 66 | - We will aim to finish Lecture 2 today and Monday. 67 | 68 | Questions? 69 | 70 | ## Monday Oct 20 71 | 72 | Announcements: 73 | 74 | - Canvas and Piazza outage today :-) 75 | 76 | + Canvas still down, Piazza is back up 77 | 78 | - My office hours are cancelled today! (I have to make a different appointment) 79 | 80 | **If you need to reach me:** I can be available for Zoom hours, I plan to hold these hours either today or tomorrow evening or Wednesday late afternoon. Please email me if you would like to attend office hours at one of these times and let me know which of these times you would be available. 81 | 82 | - A quick note about uploading to Gradescope 83 | 84 | + Please do try to download your code and run it! 85 | 86 | + Part of why we are doing all of this work on the shell is to get you used to figuring out how/why 87 | programs do and don't run :-) 88 | 89 | + It is your responsibility to ensure that the autograder runs on your code. 90 | 91 | Plan: 92 | 93 | - Start with discussion question & poll 94 | 95 | - Help commands and doing stuff commands 96 | 97 | - Hope to get through most/all of remainder of lecture 2. 98 | 99 | Questions? 100 | 101 | ## Wednesday Oct 22 102 | 103 | Announcements: 104 | 105 | - Full disclosure: we are a bit behind from where I wanted to be 106 | at this point in the quarter! 107 | 108 | I know I said we would try to finish the shell lecture last time, 109 | but there are a couple of topics left that I want to cover 110 | and these are part of the material for the midterm. 111 | 112 | - I will discuss more details about the midterm on Friday. 113 | 114 | Questions? 115 | 116 | Plan: 117 | 118 | - Poll 119 | 120 | - Git 121 | 122 | - Dangers of the shell 123 | 124 | - Overview of advanced topics / things we didn't cover. 125 | -------------------------------------------------------------------------------- /lecture1/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 1: Introduction to data processing pipelines 2 | 3 | ## Friday, September 26 4 | 5 | Announcements 6 | 7 | - Homework 0 is now available, due Wednesday, October 1. 8 | https://forms.gle/XVDDVZPJNRZKhrN36 9 | 10 | - This is an installation help homework. Please come to office hours and 11 | discussion section to get help! 12 | + Monday OH after class 13 | + Wednesday discussion section will cover installation help 14 | 15 | - **There will be no class on Wednesday (October 1)** as I will be away at a conference. 16 | 17 | - Enrollment updates 18 | + Currently 90 students enrolled, 13 on waitlist 19 | + I have received some questions from those of you on the waitlist - more on this in the slides! 20 | 21 | Plan for today: 22 | 23 | 1. Finish Lecture 0 slides (syllabus overview) 24 | 25 | 2. In-class poll 26 | 27 | 3. Following along with lectures 28 | 29 | 4. Begin Lecture 1: Introduction to data processing pipelines. 30 | 31 | ## Monday, September 29 32 | 33 | Reminders: 34 | 35 | - **No class on Wednesday** - HW0 due on Wednesday 11:59pm 36 | 37 | - OH after class today, discussion section to get installation help! 38 | 39 | + Windows issues: use WSL, downgrade to Java 11 40 | https://piazza.com/class/mfvn4ov0kuc731/post/22 41 | 42 | Plan for today: 43 | 44 | - Clone repo to follow along! 45 | 46 | - Discussion question / in-class poll 47 | 48 | - Continue Lecture 1 on ETL and dataflow graphs. 49 | 50 | Questions about HW0 or plan for today? 51 | 52 | ## Friday, October 3 53 | 54 | Thanks to the TA for running installation help office hours - hopefully you have been able to get your installation issues resolved! 55 | 56 | - As of yesterday, most of you have everything working up to PySpark (with one or two still working on PySpark). You will need PySpark working for the latter part of the course. 57 | We found that: 58 | - for Windows, Java 11, Python 3.12.3, Pyspark 3.5.3 work 59 | - for Mac/WSL: Java 21 or 22 works. 60 | 61 | Plan for today: 62 | 63 | - Start with poll 64 | 65 | - From ETL to Dataflow graphs 66 | 67 | - A more realistic example to go through 68 | 69 | - (If time) Failures and risks 70 | 71 | ## Monday, October 6 72 | 73 | Announcements: 74 | 75 | - HW1 is released! Due: Friday, October 17 76 | 77 | https://github.com/DavisPL-Teaching/119-hw1 78 | 79 | Please get started early! 80 | 81 | - I will need to end OH early today at 5pm 82 | 83 | Plan: 84 | 85 | - Practice with Dataflow Graphs :) 86 | 87 | - A little bit about data validation 88 | 89 | - Measuring performance 90 | 91 | ## Wednesday, October 8 92 | 93 | Reminders: 94 | 95 | - HW1 due Fri Oct 17 96 | 97 | Announcements: 98 | 99 | - ~~Midterm date: tenatively set for **Friday, November 7**~~ 100 | 101 | - Discussion section Zoom link/recording will be available in Canvas going forward! 102 | 103 | - Waitlist update 104 | 105 | Plan: 106 | 107 | - Start with discussion question 108 | 109 | - Talk about performance 110 | 111 | - (If time) Talk about failures and risks in pipelines 112 | 113 | - We will finish up Lecture 1 today and Friday and then move to Lecture 2 on the Shell. 114 | 115 | ## Friday, October 10 116 | 117 | Announcements: 118 | 119 | - Waitlist update: 120 | They were able to admit a few students off the waitlist. I was told that they emailed out PTAs. 121 | 122 | - Reminder: HW1 due Fri Oct 17 123 | 124 | - Questions on Piazza -- also, join the student-run Discord! 125 | 126 | - HW1 prelminary autograder available! 127 | 128 | Plan: 129 | 130 | - Poll (and a quick check-in on pacing) 131 | 132 | - Finish performance discussion. 133 | 134 | Questions? 135 | -------------------------------------------------------------------------------- /lecture6/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 6: Streaming Pipelines 2 | 3 | This is the last "full" lecture! 4 | 5 | ## Monday, November 24 6 | 7 | HW2 due today! 8 | 9 | - Time limit for part 3: 30 minutes total 10 | (autograder will run for 40 minutes on the whole submission) 11 | 12 | Try it out and please let us know if you are having any difficulties! 13 | 14 | - Max autograder score: 28/28 15 | 16 | - Also check out: https://piazza.com/class/mfvn4ov0kuc731/post/126 17 | if you are running into a runtime error related to the Spark Context. 18 | 19 | I have OH after class today 20 | 21 | Plan: 22 | 23 | - Introduction to streaming pipelines. 24 | 25 | Questions about HW2? 26 | 27 | ## Wednesday, November 26 28 | 29 | Happy Thanksgiving! 30 | 31 | Announcements 32 | (see [Piazza](https://piazza.com/class/mfvn4ov0kuc731/post/145)): 33 | 34 | 1. No homework 3 35 | 36 | 2. Homework make-up option 37 | 38 | 3. Course evaluations open this Friday 39 | 40 | I will need to end lecture 5-10 minutes early today. 41 | 42 | Lecture today: 43 | 44 | - Part 2 on microbatching. 45 | 46 | - Part 3 on time. 47 | 48 | Questions? 49 | 50 | ## Wednesday, Dec 3 51 | 52 | Announcements: 53 | 54 | - ALL in-class polls due this Friday 11:59pm 55 | 56 | - Course eval due this Friday 11:59pm - you should have received by email 57 | 58 | - HW2 grades were released 59 | 60 | + Autograder timeout issue - please submit on Gradescope! 61 | 62 | + Part 2 part2.png naming issue - please submit on Gradescope also (will reduce to a -5 pt deduction) 63 | 64 | - HW make-up option: 65 | 66 | + HW1 make-up option due by Friday 11:59pm 67 | 68 | + HW2 make-up option and regrades due by Monday 11:59pm 69 | 70 | + Pick one of the two if you received < 90 score, 2/3 of the points back. 71 | 72 | - Final study list: see `exams/final_study_list.md`. 73 | 74 | - Final details: `exams/final.md` 75 | 76 | - OH after class today on Zoom. 77 | 78 | Plan: 79 | 80 | - Poll 81 | 82 | - Go over different notions of time (Part 3) 83 | 84 | - Go over details for the final and study guide. 85 | 86 | Questions? 87 | 88 | ## Friday, Dec 5 89 | 90 | Last day of class! 91 | 92 | Announcements: 93 | 94 | - All in-class polls are due by EOD today! The final poll will occur in class today and then I will sync the polls to Canvas one more time (after class) prior to the deadline. 95 | 96 | - Final: next Thursday (Dec 11) at 8-10am! 97 | See `exams/` and Piazza for final study materials and the practice final. 98 | 99 | - I hope that we have resolved the HW2 grading issues! 100 | I apologize for the stress. :-) 101 | Please fill out a regrade request if we haven't resolved your case. 102 | HW make-up option: HW1 due today 11:59pm, HW2 due Monday 11:59pm. 103 | 104 | - I will hold OH either on 105 | + Monday at 415pm - on Zoom 106 | + Tuesday at 445pm - in person 107 | (or both) 108 | If you plan to attend, please let me know if you plan to attend one of these two times! 109 | 110 | Plan: 111 | 112 | - Review & finish Part 3 on time 113 | 114 | - Poll on different types of time 115 | 116 | - End notes & wrapping up Lecture 6. 117 | 118 | If there is time: 119 | 120 | - A note on HW2 latency & throughput graphs! 121 | And an easy mistake to make in Python 122 | 123 | - Since I am skipping Lecture 7, we will likely have some extra time at the end, I can answer questions or go over practice questions from the practice final. 124 | 125 | - Lecture 7 will not appear on the final, 126 | but you are welcome to review the materials in lecture7/ on your own time! 127 | It is a crash course covering the basics of AWS services: 128 | 129 | - AWS S3, 130 | - AWS EC2, and 131 | - AWS Lambda. 132 | 133 | Questions? 134 | -------------------------------------------------------------------------------- /lecture1/parts/throughput_latency.py: -------------------------------------------------------------------------------- 1 | """ 2 | Throughput and latency calculation example. 3 | 4 | Throughput: Number of items processed per unit time 5 | Latency: Time taken to process a single item 6 | """ 7 | 8 | from lecture import pipeline, get_life_expectancy_data 9 | 10 | """ 11 | Timeit: 12 | https://docs.python.org/3/library/timeit.html 13 | 14 | Library that allows us to measure the running time of a Python function 15 | 16 | Example syntax: 17 | timeit.timeit('"-".join(str(n) for n in range(100))', number=10000) 18 | """ 19 | import timeit 20 | 21 | IN_FILE_THROUGHPUT = "life-expectancy.csv" 22 | IN_FILE_LATENCY = "life-expectancy-row1.csv" 23 | OUT_FILE = "output.csv" 24 | 25 | def throughput(num_runs): 26 | """ 27 | Measure the throughput of our example pipeline, in data items per second. 28 | 29 | We need to run it a bunch of times to get an accurate number, 30 | that's why we're taking a num_runs parameter and passing it to 31 | timeit 32 | 33 | Updated formula if we run multiple times 34 | N = input dataset size 35 | T = total running time 36 | num_runs = number of times I ran the pipeline 37 | Throughput = 38 | N / (T / num_runs) 39 | """ 40 | 41 | print(f"Measuring throughput over {num_runs} runs...") 42 | 43 | # Number of items 44 | num_items = len(get_life_expectancy_data(IN_FILE_THROUGHPUT)) 45 | 46 | # ^^ Notice I'm doing this cmoputation outside of the actual measurement 47 | # (timeit) code - if you start measuring things as you're computing 48 | # them, this leads to distortion of the results 49 | # (Kind of like Quantum Mechanics - if you measure it, it changes) 50 | 51 | # Function to run the pipeline 52 | def f(): 53 | pipeline(IN_FILE_THROUGHPUT, "output.csv") 54 | 55 | # Measure execution time 56 | execution_time = timeit.timeit(f, number=num_runs) 57 | 58 | # Print and return throughput 59 | # Display to the nearest integer 60 | throughput = num_items * num_runs / execution_time 61 | print(f"Throughput: {int(throughput)} items/second") 62 | return throughput 63 | 64 | def latency(num_runs): 65 | """ 66 | Measure the average latency of our example pipeline, in seconds. 67 | 68 | (In today's poll: we saw an example where we measured latency 69 | not just with one input row) 70 | 71 | *Key point:* 72 | We use a one-row version of the pipeline to measure latency 73 | """ 74 | 75 | print(f"Measuring latency over {num_runs} runs...") 76 | 77 | # Function to run the pipeline 78 | def f(): 79 | pipeline(IN_FILE_LATENCY, "output.csv") 80 | 81 | # Measure execution time 82 | execution_time = timeit.timeit(f, number=num_runs) 83 | 84 | # Print and return latency (in milliseconds) 85 | # Display to 5 decimal places 86 | latency = execution_time / num_runs * 1000 87 | print(f"Latency: {latency:.5f} ms") 88 | return latency 89 | 90 | if __name__ == "__main__": 91 | throughput(1000) 92 | latency(1000) 93 | 94 | """ 95 | Observations? 96 | 97 | ~4.1M items/second 98 | Pandas is very fast! 99 | (We are just doing max/min/avg, probably things would slow 100 | down for something more complicated) 101 | 102 | Somewhat stable across runs - this is because we run the 103 | pipeline always multiple times 104 | 105 | (Lesson: always run multiple times for best practice) 106 | 107 | Is throughput always constant with the number of input items? 108 | No! 109 | 110 | (Try deleting items from input dataset - what happens?) 111 | 112 | Latency: about ~0.7 ms 113 | 114 | Latency != 1 / Throughput 115 | 116 | Latency is much greater for a dataset with one row. 117 | """ 118 | -------------------------------------------------------------------------------- /lecture4/extras/data-race-example.py: -------------------------------------------------------------------------------- 1 | """ 2 | A more extended data race example - related to the Oct 31 in-class poll - 3 | that discusses the relationship between data races and UB. 4 | 5 | This is a subtle topic! 6 | TL;DR: The following are both correct: 7 | 8 | - x will always between 2 and 200 in Python using multiprocess (multiprocess.Array and multiprocess.Value) 9 | 10 | - x could be any integer value making no assumptions about the programming language implementation of += or the underlying architecture. 11 | 12 | The latter bullet (any value) is also the technically correct answer in C/C++, where data races are something called "undefined behavior", meaning that the compiler is allowed to compile your code to do something entirely different than what you said. 13 | (And it may depend on the compiler!) 14 | 15 | In a language with undefined behavior: 16 | 17 | *if there is a read and a write concurrently to the same data (or two concurrent writes), the value of that data is indeterminate.* 18 | 19 | Longer discussion: 20 | 21 | Try running the example! What happens? 22 | 23 | As some of you suspected, in our Python implementation (using multiprocess), it happens to be the case that: 24 | 25 | - when run, the result is a value between 100 and 200 in all test runs 26 | (often 10 or 20 for the N = 10 case -- sometimes some value in between!) 27 | 28 | - we do observe a data race 29 | (if there were no data race, all += 1s would happen, and the value would always end up at 200) 30 | 31 | So why did I say that x can be *any* value? 32 | 33 | This has to do with what assumptions we are making about how Python represents integers 34 | -- for example, assuming that it consistently handles reads and writes to integers (+= consistently). 35 | This assumption is very dangerous! It is generally *not* true when using operations that aren't designed to be safely called in a concurrent programming context, such as big integers. 36 | (A Python integer is not just a single byte, it contains multiple bytes! That makes it something called a big integer. It may be the case that a read and a write to a Python integer at the same time don't just execute one after the other, but actually completely invalidate that integer, by modifying the different bytes in different ways -- or even, moving the integer somewhere else in memory.) 37 | 38 | This actually occurs and is not just a hypothetical: 39 | 40 | - It will occur in Python when using data structures more than just a single byte, if not protected by a shared lock 41 | 42 | - It will occur in any data structure implementation in C, C++, or Java or any implementation that stores raw pointers in memory. 43 | 44 | That's why the way I want you to think about data races in this class 45 | is with the "simplified view" above. 46 | That is, data races cause data to be invalidated and it could represent any value after the race occurs. 47 | 48 | Further reading: 49 | 50 | - Why += 1 is so-called "undefined behavior" (UB) in C: https://stackoverflow.com/a/39396999/2038713 51 | 52 | - More on "undefined behavior" and why: 53 | https://news.ycombinator.com/item?id=16247958 54 | https://davmac.wordpress.com/2018/01/28/understanding-the-c-c-memory-model/ 55 | 56 | - Data races in Python: https://verdagon.dev/blog/python-data-races 57 | which does not discuss the UB issue. 58 | """ 59 | 60 | import ctypes 61 | from multiprocessing import RawValue, Process, freeze_support 62 | 63 | N = 100 64 | 65 | # Comment out other examples to try: 66 | # N = 1 67 | # N = 10 68 | # N = 10_000_000 69 | 70 | def worker1(x): 71 | for i in range(N): 72 | x.value += 1 73 | 74 | def worker2(x): 75 | for i in range(N): 76 | x.value += 1 77 | 78 | if __name__ == "__main__": 79 | # Guard to ensure only one worker runs the main code 80 | freeze_support() 81 | 82 | # Set up the shared memory 83 | start = 0 84 | x = RawValue(ctypes.c_uint64, start) 85 | print(f"Start value: {x.value}") 86 | 87 | # Run the two workers 88 | p1 = Process(target=worker1, args=(x,)) 89 | p2 = Process(target=worker2, args=(x,)) 90 | p1.start() 91 | p2.start() 92 | p1.join() 93 | p2.join() 94 | 95 | # Get the result 96 | print(f"Final result: x = {x.value}") 97 | -------------------------------------------------------------------------------- /lecture5/extras/cut.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | === Cut material === 4 | 5 | ===== Cut material on MapReduce ===== 6 | 7 | Some of this material will appear as exercises on HW2. 8 | 9 | The above is written very abstractly, what does it mean? 10 | 11 | Let's walk through each part: 12 | 13 | Keys: 14 | 15 | The first thing I want to point out is that all the data is given as 16 | (key, value) 17 | 18 | pairs. (K1 and K2) 19 | 20 | Generally speaking, we use the first coordinate (key) for partitioning, 21 | and the second one to compute values. 22 | 23 | Ignore the keys for now, we'll come back to that. 24 | 25 | Map: 26 | map: (K1, T1) -> list((K2, T2)) 27 | 28 | - we might want to transform the data into a different type 29 | T1 and T2 30 | - we might want to output zero or more than one output -- why? 31 | list(K2, T2) 32 | 33 | Examples: 34 | (write pseudocode for the corresponding lambda function) 35 | 36 | - Compute a list of all Carbon-Fluorine bonds 37 | 38 | - Compute the total number of Carbon-Fluorine bonds 39 | 40 | - Compute the average of the ratio F / C for every molecule that has at least one Carbon 41 | (our original example) 42 | 43 | In Spark: 44 | https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.flatMap.html 45 | """ 46 | 47 | def map_general_ex(rdd, f): 48 | # TODO 49 | raise NotImplementedError 50 | 51 | # Uncomment to run 52 | # map_general_ex() 53 | 54 | """ 55 | What about Reduce? 56 | 57 | reduce: (K2, list(T2)) -> list(T2) 58 | 59 | The following is a common special case: 60 | 61 | reduce_by_key: (T2, T2) -> T2 62 | 63 | Reduce: 64 | - data has keys attached. Keys are used for partitioning 65 | - we aggregate the values *by key* instead of over the entire dataset. 66 | 67 | Examples: 68 | (write the corresponding Python lambda function) 69 | (you can use the simpler reduce_by_key version) 70 | 71 | - To compute a total for each key? 72 | 73 | - To compute a count for each key? 74 | 75 | - To compute an average for each key? 76 | 77 | - To compute an average over the entire dataset? 78 | 79 | Important note: 80 | K1 and K2 are different! Why? 81 | 82 | In Spark: 83 | 84 | https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html 85 | """ 86 | 87 | def reduce_general(rdd, f): 88 | # TODO 89 | raise NotImplementedError 90 | 91 | """ 92 | Finally, let's use our generalized map and reduce functions to re-implement our original task, 93 | computing the average Fluorine-to-Carbon ratio in our chemical 94 | dataset, among molecules with at least one Carbon. 95 | """ 96 | 97 | def fluorine_carbon_ratio_map_reduce(data): 98 | # TODO 99 | raise NotImplementedError 100 | 101 | """ 102 | ===== Understanding latency (abstract) ===== 103 | (review if you are interested in a more abstract view) 104 | 105 | Why isn't optimizing latency the same as optimizing for throughput? 106 | 107 | First of all, what is latency? 108 | Imagine this. I have 10M inputs, my pipeline processes 1M inputs/sec (pretty well parallelized. 109 | 110 | (input) --- (process) 111 | 10M 1M / sec 112 | 113 | Latency is always about response time. 114 | To have a latency you have to know what change or input you're making to the system, 115 | and what result you want to observe -- the latency is the difference (in time) 116 | between the two. 117 | 118 | Latency is always measured in seconds; it's not a time "per item" or dividing the total time by number 119 | of items (that doesn't tell us how long it took to see a response for each individual item!) 120 | 121 | So, what's the latency if I just put 1 item into the pipeline? 122 | (That is, I run on an input of just 1 item, and wait for a response to come back)? 123 | 124 | We can't say! We don't know for example, whether the 125 | 1M inputs/sec means all 1M are processed in parallel, and they all take 1 second, 126 | or the 1M inputs/sec means that 127 | it's really 100K inputs every tenth of a second at a time. 128 | 129 | This is why: 130 | Latency DOES NOT = 1 / throughput 131 | in general and it's also why optimizing for throughput doesn't always benefit latency 132 | (or vice versa). 133 | We will get to this more in Lecture 6. 134 | """ 135 | -------------------------------------------------------------------------------- /exams/midterm_study_list.md: -------------------------------------------------------------------------------- 1 | # Midterm Study List 2 | 3 | Study list of topics for the midterm. 4 | 5 | **The midterm will cover Lecture 1, Lecture 2, and Lecture 4 up through Data Parallelism (in Part 4).** 6 | 7 | You should know all of the following concepts, but I won't test you on syntax. 8 | - For example, you won't be asked to write code on the exam, 9 | but you might be asked to explain how you might to dome task in words 10 | or asked to calculate how much time a task would take given some 11 | assumptions about how long the parts take. 12 | 13 | ## Lecture 1 (Introduction to data processing) 14 | 15 | - Data pipelines: the ETL model 16 | 17 | + Know each stage and why it exists 18 | 19 | - Dataflow Graphs 20 | 21 | + Know how to draw a dataflow graph 22 | 23 | + Definition of edges & dependencies 24 | 25 | + Definition and application of: sources, data operators, sinks 26 | 27 | + Definition of an operator 28 | 29 | + Relation between ETL and dataflow graphs 30 | 31 | - Coding practices: Python classes, functions, modules, unit tests 32 | 33 | - Performance: 34 | 35 | Latency and throughput: 36 | concepts, definitions, formulas and how to calculate them 37 | 38 | - Pandas: what a DataFrame is and basic properties 39 | 40 | (some number of rows and columns) 41 | 42 | (DataFrame = table) 43 | 44 | - Also know: 45 | 46 | + Know a little bit about data validation: things that can go wrong in your input! 47 | 48 | (check for null values, check that values are positive) 49 | 50 | + Some things about exploration time vs. development time: 51 | I might ask you what step(s) you might do to explore a dataset 52 | after reading it in to Python or Pandas 53 | 54 | + Know the following term: vectorization 55 | 56 | ## Lecture 2 (the Shell) 57 | 58 | - "Looking around, getting help, doing stuff" model 59 | 60 | + 3 types of commands 61 | 62 | + Underlying state of the system 63 | 64 | (current working directory, file system, environment variables) 65 | 66 | - Basic command purposes (I won't test you on syntax!) 67 | ls, cd, cat, git, python3, echo, man, mkdir, touch, open, rm, cp 68 | $PATH, $PWD 69 | 70 | - shell commands can be run from Python and vice versa 71 | 72 | - hidden files and folders, environment variables 73 | 74 | - platform dependence; definition of a "platform" 75 | 76 | platform = operating system, architecture, any installed software requirements or depdencies 77 | 78 | - named arguments vs. positional arguments 79 | 80 | - Git, philoslophy of git 81 | 82 | - dangers of the shell, file deletion, concept of "ambient authority" 83 | 84 | ## Lecture 4 (Parallelism) 85 | 86 | Part 1: 87 | 88 | - Scaling: scaling in input data or # of tasks; Pandas does not scale 89 | 90 | Part 2: 91 | 92 | - Definitions of parallelism, concurrency, and distribution; 93 | how to identify these in tasks; conveyer belt analogy 94 | 95 | - Should be able to give an example scenario using these terms, or an example 96 | scenario satisfying soem of parallelism, concurrency, and distribution but not 97 | the others 98 | 99 | - How parallelism speeds up a pipeline; be able to estimate how fast a program would take given some assumptions about how long it takes to process each item and how 100 | it is parallelized 101 | 102 | Part 3: 103 | 104 | - Know terminology: Concurrency, race condition, data race, undefined behavior, contention, deadlock, consistency, nondeterminism 105 | 106 | - "Conflicting operations": for example, a read/write are conflicting, two writes are conflicting, two reads are not conflicting as they can both occur simultaneously (this is also part of the definition of a data race) 107 | 108 | - Know possible concurrent executions of a program: you won't be asked a scenario as complex as the poll on Oct 31. However, you might be asked about a simpler program, like one with two workers, one increments x += 1, the other implements x += 1 twice, etc. 109 | 110 | - Why we avoid concurrency in data pipelines: difficult to program with; no one 111 | should write any program with data races in it. 112 | 113 | Part 4: 114 | 115 | - Motivation: is parallelism present? how much parallelism? 116 | Want to parallelize without writing concurrent code ourselves 117 | 118 | - Types of parallelism: task parallelism, data parallelism 119 | 120 | - How to identify each of these in the dataflow graph. 121 | -------------------------------------------------------------------------------- /exams/poll_answers.md: -------------------------------------------------------------------------------- 1 | # In-class poll answers 2 | 3 | Sep 24: 4 | Tools required and characteristics and needs of your application will change drastically with the size of the dataset. 5 | 6 | Sep 26: 7 | N/A 8 | 9 | Sep 29: 10 | 1) One possible answer: input file does not exist 11 | 2) No, because the maximum row of a dataset is not always unique. 12 | 13 | Oct 3: 14 | All except B ("Will speed up the development of a one-off script") 15 | 16 | Oct 6: 17 | 1) Edges from: 18 | read -> max, min, and avg 19 | max, min, and avg -> print 20 | max, min, and avg -> save 21 | 22 | 2) read -> print or read -> save (give a specific example) 23 | 24 | Oct 8: 25 | True, False, False. 26 | 27 | Oct 10: 28 | 1) Throughput = 1,000 records/hour assuming the full pipeline is measured from 9am to 9pm. 29 | 2) 30 minutes on average 30 | 3) From the perspective of the patient (individual row level): uniformly distributed between 0min and 60min delay. 31 | 32 | Oct 13: 33 | 1) hrs, ms, s, ns 34 | 2) F F F T T F 35 | 36 | Oct 15: 37 | Correct answers: 1, 2, 3, 4, 5, and 7 (all except 6: "To load & use pandas to calculate the max and average of a DataFrame") 38 | 39 | Oct 17: 40 | 1) B, C, and D 41 | 2) B, C, D, E, and F (all except "A: A python3 'Hello, world!' program works only on certain operating systems") 42 | 43 | Oct 20: 44 | ls, ls -alh, echo $PATH, python3 --version, conda list, git status, cat, less 45 | 46 | Oct 22: 47 | 1) Some possible answers: `cd folder/`, `cp file1.txt file2.txt` 48 | 2) Some possible answers: `ls -alh`, `python3 --version` 49 | 50 | Oct 24: 51 | B, E, and F. 52 | 53 | Oct 27: 54 | 1, 2: no one correct answer, most answers were clustered 8 GB or 16GB 55 | 3: Pandas requires 5-10x the amount of RAM as your dataset, so for 16GB you should have gotten 1.6GB to 3.2GB for the lagest dataset you can handle. 56 | 57 | Oct 29: 58 | 1. Parallel 59 | 2. Parallel + Concurrent 60 | 3. Parallel + Concurrent 61 | 4. Concurrent + Distributed 62 | 5. Parallel + Distributed 63 | 6. Parallel + Concurrent + Distributed 64 | 65 | Oct 31: 66 | 1. Intended answer was between 0 and 200; 67 | to be fully precise, the correct answer should be any value between 2 and 200 (inclusive). 68 | 69 | 2. ABDEFG 70 | Concurrency, Parallelism, Contention, Race Condition, Data Race - 71 | and Spooky Halloween Vibes, because data races are scary!! :-) 72 | 73 | Nov 3: 74 | Data parallelism (at tasks 1 and 2) 75 | No task parallelism 76 | Pipeline parallelism from 1 -> 2 and 2 -> 3. 77 | (Pipeline parallelism will not appear on midterm) 78 | 79 | Nov 7: 80 | 1. Data parallelism: **single node** 81 | Task parallelism: **between a pair of nodes** 82 | Pipeline parallelism: **between a pair of nodes** 83 | 84 | 2. Answer is yes. For instance, splitting one node "task" into two separate tasks could reveal additional pipeine and task parallelism that would not be present in the graph. 85 | 86 | Nov 10: 87 | T = 300 ms 88 | S = 3 ms 89 | Speedup <= T / S = 100x. 90 | Maximum speedup is 100x (same as the # of data items - this is not a coincidence). 91 | 92 | Nov 12: 93 | C and E: CPU cores & RAM available 94 | 95 | Nov 14: 96 | 5, 6, and 8. 97 | Note that, depending on assumptions about how regular the timestamps are or if the input data is sorted by row #, it may be possible to make these data-parallel also. 98 | It is just not quite as straightforward as the others. 99 | 100 | Nov 17: 101 | 1: Lazy, 2: Lazy, 3: Not Lazy 102 | Bonus: 5ms + 5ms + 5ms = 15 ms. 103 | 104 | Nov 19: 105 | Multiple solutions are possible 106 | Map stage should describe a map on each input row (T1) to an output row (T2) 107 | Reduce stage should describe how to combine two output rows (two T2s, get a single T2) 108 | Example solution: 109 | Map stage: map each row to (city name, avg_temp / population) 110 | Reduce stage: for (city1, ratio1), (city2, ratio2), return (city3, ratio3) where ratio3 = max(ratio1, ratio2) and city3 is the corresponding city. 111 | 112 | Nov 21: 113 | Narrow: 1, 4, 5, 7 114 | Wide: 2, 3, 6 115 | 116 | Nov 24: 117 | C and I only: "They are not optimized for latency" and "Reduce functions may be applied out-of-order" 118 | 119 | Nov 26: 120 | B (serving GPT), D (high frequency trading), E (login), F (order qty) 121 | 122 | Dec 3: 123 | 1. 2ms 124 | 3. Yes, the latency would be higher (specifically, around 4.5ms on average). 125 | 126 | Dec 5: 127 | 1. Event Time 128 | 2. Logical Time 129 | 3. System Time 130 | 4. Real Time 131 | 5. Event Time 132 | 6. Logical Time + System Time 133 | 7. Logical Time + System Time 134 | 8. Real Time + Logical Time 135 | 9. Event Time 136 | 10. Event Time + Logical Time 137 | -------------------------------------------------------------------------------- /lecture2/parts/6-conclusion.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 6: Dangers of the Shell (and a few Loose Ends) 3 | 4 | We covered this part briefly at the end of Wednesday, October 22. 5 | 6 | Finishing up the shell: 7 | 8 | === Dangers of the shell === 9 | 10 | The shell has something called "ambient authority" 11 | which is a term from computer security basically meaning that 12 | you can do anything that you want to, if you just ask. 13 | 14 | Be aware! 15 | 16 | - rm -f part1.py -- permanently delete your code (and changes), 17 | no way to recover 18 | rm -- remove 19 | -f: force removal (don't ask first) 20 | -r: remove all subfiles and subdirectories 21 | 22 | - rm -rf "/" 23 | 24 | removes all files on the system. 25 | 26 | Many modern systems will actually complain if you try to do this. 27 | 28 | - bash fork bomb :-) 29 | 30 | :(){ :|:& };: 31 | 32 | """ 33 | 34 | def rm_rf_slash(): 35 | raise RuntimeError("This command is very dangerous! If you are really sure you want to run it, you can comment out this exception first.") 36 | 37 | # Remove the root directory on the system 38 | subprocess.run(["rm", "-rf", "/"]) 39 | 40 | # rm_rf_slash() 41 | 42 | """ 43 | sudo: run a command in true "admin" mode 44 | 45 | sudo rm -rf / 46 | ^^^^^^^^^^^^^ Delete the whole system, in administrator mode 47 | """ 48 | 49 | """ 50 | Aside: This is part of what makes the shell so useful, but it is also 51 | what makes the shell so dangerous! 52 | 53 | All shell commands are assumed to be executed by a "trusted" user. 54 | It's like the admin console for the computer. 55 | 56 | Example: 57 | person who gave an LLM agent access to their shell: 58 | https://twitter.com/bshlgrs/status/1840577720465645960 59 | 60 | "At this point I was amused enough to just let it continue. Unfortunately, the computer no longer boots." 61 | """ 62 | 63 | # sudo rm -rf "/very/important/operating-system/file" 64 | 65 | """ 66 | =============== Closing material (discusses advanced topics and recap; feel free to review on your own time!) =============== 67 | 68 | === What is the Shell? (revisited) === 69 | 70 | The shell IS: 71 | 72 | - the de facto standard for interacting with real systems, 73 | including servers, supercomputers, and even your own operating system. 74 | 75 | - a way to "glue together" different programs, by chaining them together 76 | 77 | The shell is NOT (necessarily): 78 | 79 | - a good way to write complex programs or scripts (use Python instead!) 80 | 81 | - free from errors (it is often easy to make mistakes in the shell) 82 | 83 | - free from security risks (rm -rf /) 84 | 85 | === Q+A === 86 | 87 | Q: How is this useful for data processing? 88 | 89 | A: Many possible answers! In decreasing order of importance: 90 | 91 | - Interacting with software dev tools (like git, Docker, and package managers) 92 | -- many tools are built to be accessed through the shell. 93 | 94 | - Give us a better understanding of how programs run "under the hood" 95 | and how the filesystem and operating system work 96 | (this is where almost all input/output happens!) 97 | 98 | - Gives you another option to write more powerful functions in Python 99 | by directly calling into the shell (subprocess) 100 | (e.g. fetching data with git; connecting to a 101 | database implementation or a network API) 102 | 103 | - Writing quick-and-dirty data processing scripts direclty in the shell 104 | (Common but we will not be doing this in this class). 105 | 106 | Example: Input as a CSV, filter out lines that are not relevant, and 107 | add up the results to sort by most common keywords or labels. 108 | 109 | Q: How is the shell similar/different from Python? 110 | 111 | A: Both of these are useful "glue" languages -- ways to 112 | connect together different programs. 113 | 114 | Python is more high-level, and the shell is more like what happens 115 | under the hood. 116 | 117 | Knowing the shell can improve your Python scripts and vice versa. 118 | 119 | === Some skipped topics === 120 | 121 | Things we didn't cover: 122 | 123 | - Using the shell for cleaning, filtering, finding, and modifying files 124 | 125 | + cf.: grep, find, sed, awk 126 | 127 | - Regular expressions for pattern matching in text 128 | 129 | === Miscellaneous further resources === 130 | 131 | Future of the shell paper: 132 | 133 | - https://dl.acm.org/doi/pdf/10.1145/3458336.3465296 134 | 135 | Regular expressions 136 | (for if you are using grep or find): 137 | 138 | - Regex debugger: https://regex101.com/ 139 | 140 | - Regex explainer: https://regexr.com/ 141 | 142 | Example to try for a URL: [a-zA-Z]+\\.[a-z]+( |\\.|\n) 143 | 144 | End. 145 | """ 146 | -------------------------------------------------------------------------------- /lecture5/parts/5-dataframes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 5: DataFrames 3 | 4 | In the interest of time, we will cover this part relatively briefly. 5 | 6 | === Discussion Question & Poll === 7 | 8 | This was the poll I accidentally shared last time :-) 9 | 10 | https://forms.gle/TB823v4HSWqYadP88 11 | 12 | Consider the following scenario where a temperature dataset is partitioned in Spark across several locations. Which of the following tasks on the input dataset can be done with a narrow operator, and which will require a wide operator? 13 | 14 | Assume the input dataset consists of locations: 15 | US state, city, population, avg temperature 16 | 17 | It is partitioned into one dataset per US state (50 partitions total). 18 | 19 | 1. Add one to each temperature 20 | 21 | 2. Compute a 5-number summary 22 | 23 | 3. Throw out duplicate city names (multiple cities in the US with the same name) 24 | 25 | 4. Throw out cities that are below 100,000 residents 26 | 27 | 5. Throw out "outlier" temperatures below -50 F or above 150 F 28 | 29 | 6. Throw out "outlier" temperatures 3 std deviations above or 3 std deviations below the mean 30 | 31 | 7. Filter the dataset to include only California cities 32 | 33 | . 34 | . 35 | . 36 | . 37 | . 38 | 39 | ================== 40 | 41 | We said that PySpark supports at least two scalable collection types. 42 | 43 | Our first example was RDDs. 44 | 45 | Our second example of a collection type is DataFrame. 46 | 47 | A DataFrame is like a DataFrame in Pandas - but it's scalable :-) 48 | 49 | Here is an example: 50 | """ 51 | 52 | # Boilerplate and dataset from previous part 53 | import pyspark 54 | from pyspark.sql import SparkSession 55 | spark = SparkSession.builder.appName("SparkExample").getOrCreate() 56 | sc = spark.sparkContext 57 | 58 | CHEM_NAMES = [None, "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne"] 59 | CHEM_DATA = { 60 | # H20 61 | "water": [0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0], 62 | # N2 63 | "nitrogen": [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0], 64 | # O2 65 | "oxygen": [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0], 66 | # F2 67 | "fluorine": [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0], 68 | # CO2 69 | "carbon dioxide": [0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0], 70 | # CH4 71 | "methane": [0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0], 72 | # C2 H6 73 | "ethane": [0, 6, 0, 0, 0, 0, 2, 0, 0, 0, 0], 74 | # C8 H F15 O2 75 | "PFOA": [0, 1, 0, 0, 0, 0, 8, 0, 2, 15, 0], 76 | # C H3 F 77 | "Fluoromethane": [0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0], 78 | # C6 F6 79 | "Hexafluorobenzene": [0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0], 80 | } 81 | 82 | """ 83 | DataFrame is like a Pandas DataFrame. 84 | 85 | https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html 86 | 87 | The main difference with RDDs is we need to create the dataframe 88 | using a tuple or dictionary. 89 | 90 | We can also create one from an RDD by doing 91 | 92 | .map(lambda x: (x,)).toDF() 93 | 94 | For more examples of creating dataframes from RDDs, see extras/dataframe.py. 95 | """ 96 | 97 | def ex_dataframe(data): 98 | # What we need (similar to Pandas): list of columns, iterable of rows. 99 | 100 | # For the columns, use our CHEM_NAMES list 101 | columns = ["chemical"] + CHEM_NAMES[1:] 102 | 103 | # For the rows: any iterable -- i.e. any sequence -- of rows 104 | # For the rows: can use [] or a generator expression () 105 | rows = ((name, *(counts[1:])) for name, counts in CHEM_DATA.items()) 106 | 107 | # Equiv: 108 | # rows = [(name, *(counts[1:])) for name, counts in CHEM_DATA.items()] 109 | # Also equiv: 110 | # for name, counts in CHEM_DATA.items(): 111 | # ... 112 | 113 | df3 = spark.createDataFrame(rows, columns) 114 | 115 | # Breakpoint for inspection 116 | # breakpoint() 117 | 118 | # Adding a new column: 119 | from pyspark.sql.functions import col 120 | df4 = df3.withColumn("H + C", col("H") + col("C")) 121 | df5 = df4.withColumn("H + F", col("H") + col("F")) 122 | 123 | # This is the equiv of Pandas: df3["H + C"] = df3["H"] + df3["C"] 124 | 125 | # Uncomment to debug: 126 | # breakpoint() 127 | 128 | # We could continue this example further (showing other Pandas operation equivalents). 129 | 130 | # Uncomment to run 131 | # ex_dataframe(CHEM_DATA) 132 | 133 | """ 134 | Notes: 135 | 136 | - We can use .show() to print out - nicer version of .collect()! 137 | Only available on dataframes. 138 | 139 | - DataFrames are based on RDDs internally. 140 | A little picture: 141 | 142 | DataFrames 143 | | 144 | RDDs 145 | | 146 | MapReduce 147 | 148 | - Web interface gives us a more helpful dataflow graph this time: 149 | 150 | localhost:4040/ 151 | 152 | (see under Stages and click on a "collect" job for the dataflow graph) 153 | 154 | - DataFrames are higher level than RDDs! They are "structured" data -- 155 | we can work with them using SQL and relational abstractions. 156 | """ 157 | -------------------------------------------------------------------------------- /lecture5/extras/dataframe.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is an example which shows how to create a data frame from a Python dict. 3 | """ 4 | 5 | # Boilerplate 6 | import pyspark 7 | from pyspark.sql import SparkSession 8 | spark = SparkSession.builder.appName("SparkExample").getOrCreate() 9 | sc = spark.sparkContext 10 | 11 | # Dataset 12 | people = spark.createDataFrame([ 13 | {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, 14 | {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, 15 | {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, 16 | {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} 17 | ]) 18 | 19 | people_filtered = people.filter(people.age > 30) 20 | 21 | people_filtered.show() 22 | 23 | people2 = sc.parallelize([ 24 | {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, 25 | {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, 26 | {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, 27 | {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} 28 | ]) 29 | 30 | people2_filtered = people2.filter(lambda x: x["age"] > 30) 31 | 32 | result = people2_filtered.collect() 33 | 34 | print(result) 35 | 36 | """ 37 | More ways to create a DataFrame: 38 | """ 39 | 40 | 41 | CHEM_NAMES = [None, "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne"] 42 | CHEM_DATA = { 43 | # H20 44 | "water": [0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0], 45 | # N2 46 | "nitrogen": [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0], 47 | # O2 48 | "oxygen": [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0], 49 | # F2 50 | "fluorine": [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0], 51 | # CO2 52 | "carbon dioxide": [0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0], 53 | # CH4 54 | "methane": [0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0], 55 | # C2 H6 56 | "ethane": [0, 6, 0, 0, 0, 0, 2, 0, 0, 0, 0], 57 | # C8 H F15 O2 58 | "PFOA": [0, 1, 0, 0, 0, 0, 8, 0, 2, 15, 0], 59 | # C H3 F 60 | "Fluoromethane": [0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0], 61 | # C6 F6 62 | "Hexafluorobenzene": [0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0], 63 | } 64 | 65 | def ex_dataframe_methods(data): 66 | # Load the data (CHEM_DATA) and turn it into a DataFrame 67 | 68 | # A few ways to do this 69 | 70 | """ 71 | Method 1: directly from the RDD 72 | """ 73 | rdd = sc.parallelize(data.values()) 74 | 75 | # RDD is just a collection of items where the items can have any Python type 76 | # a DataFrame requires the items to be rows. 77 | 78 | df1 = rdd.map(lambda x: (x,)).toDF() 79 | 80 | # Breakpoint for inspection 81 | # breakpoint() 82 | 83 | # Try: df1.show() 84 | 85 | # What happened? 86 | 87 | # Not very useful! Let's try a different way. 88 | # Our lambda x: (x,) map looks a bit sus. Does anyone see why? 89 | 90 | """ 91 | Method 2: unpack the data into a row more appropriately by constructing the row 92 | """ 93 | # don't need to do the same thing again -- RDDs are persistent and immutable! 94 | # rdd = sc.parallelize(data.values()) 95 | 96 | # In Python you can unwrap an entire list as a tuple by using *x. 97 | df2 = rdd.map(lambda x: (*x,)).toDF() 98 | 99 | # Breakpoint for inspection 100 | # breakpoint() 101 | 102 | # What happened? 103 | 104 | # Better! 105 | 106 | """ 107 | Method 3: create the DataFrame directly with column headers 108 | (the correct way) 109 | """ 110 | 111 | # What we need (similar to Pandas): list of columns, iterable of rows. 112 | 113 | # For the columns, use our CHEM_NAMES list 114 | columns = ["chemical"] + CHEM_NAMES[1:] 115 | 116 | # For the rows: any iterable -- i.e. any sequence -- of rows 117 | # For the rows: can use [] or a generator expression () 118 | rows = ((name, *(counts[1:])) for name, counts in CHEM_DATA.items()) 119 | 120 | # Equiv: 121 | # rows = [(name, *(counts[1:])) for name, counts in CHEM_DATA.items()] 122 | # Also equiv: 123 | # for name, counts in CHEM_DATA.items(): 124 | # ... 125 | 126 | df3 = spark.createDataFrame(rows, columns) 127 | 128 | # Breakpoint for inspection 129 | # breakpoint() 130 | 131 | # What happened? 132 | 133 | # Now we don't have to worry about RDDs at all. We can use all our favorite DataFrame 134 | # abstractions and manipulate directly using SQL operations. 135 | 136 | # Adding a new column: 137 | from pyspark.sql.functions import col 138 | df4 = df3.withColumn("H + C", col("H") + col("C")) 139 | df5 = df4.withColumn("H + F", col("H") + col("F")) 140 | 141 | # This is the equiv of Pandas: df3["H + C"] = df3["H"] + df3["C"] 142 | 143 | # Uncomment to debug: 144 | # breakpoint() 145 | 146 | # We could continue this example further (showing other Pandas operation equivalents). 147 | 148 | # Uncomment to run 149 | # ex_dataframe(CHEM_DATA) 150 | -------------------------------------------------------------------------------- /lecture5/parts/6-latency-throughput.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 6: End notes 3 | 4 | Latency and throughput, revisited 5 | and 6 | Disadvantages of Spark 7 | 8 | === Latency and throughput === 9 | 10 | So, we know how to build distributed pipelines. 11 | 12 | The **only** change from a sequential pipline is that 13 | tasks work over scalable collection types, instead of regular data types. 14 | Tasks are then interpreted as operators over the scalable collections. 15 | In other words, data parallelism comes for free! 16 | 17 | Scalable collections are a good way to think about parallel AND/OR distributed 18 | pipelines. Operators/tasks can be: 19 | - lazy or not lazy (how they are evaluated) 20 | - wide or narrow (how data is partitioned) 21 | 22 | But there is just one problem with what we have so far :) 23 | Spark and MapReduce are optimized for throughput. 24 | 25 | It's what we call a *batch processing engine.* That means? 26 | 27 | A: It processes all the data "as one batch", or as "one job", 28 | then comes back with an answer 29 | 30 | But doesn't optimizing for throughput always optimize for latency? Not necessarily! 31 | 32 | Let's talk a little bit about latency... 33 | 34 | === Understanding latency (intuitive) === 35 | 36 | A more intuitive real-world example: 37 | imagine a restaurant that has to process lots of orders. 38 | 39 | - Throughput is how many orders we were able to process per hour. 40 | 41 | - Latency is how long *one* person waits for their order. 42 | 43 | Some of you wrote on the midterm: throughput == 1 / latency 44 | 45 | That's not true! 46 | throughput != 1 / latency 47 | 48 | These are not the same! Why not? Two extreme cases: 49 | 50 | ---- 51 | 52 | Suppose the restuarant processes 10 orders over the course of being open for 1 hour 53 | 54 | Throughput = 55 | 10 orders / hour 56 | 57 | Latency is not the same as 1 / throughput! Two extreme cases: 58 | 59 | 1. 60 | Every customer waits for the entire hour! 61 | Every customer submitted their order at the start of the hour, 62 | and got it back at the end. 63 | 64 | Latency = 1 hour 65 | 66 | - customers are not very happy 67 | 68 | - BUT the restaurant can do things very efficiently! 69 | 70 | 2. 71 | One order submitted every 6 minutes, 72 | and completed 6 minutes later. 73 | 74 | Latency = 6 minutes 75 | 76 | - customers are happy 77 | 78 | - BUT the restaurant may have a harder time optimizing their process 79 | as they have to make each order individually. 80 | 81 | (A more abstract example of this is given below in the "Understanding latency (abstract)" section below.) 82 | 83 | Throughput is the same in both cases! 84 | 10 events / 1 hour 85 | 86 | Throughput = N / T 87 | 88 | In first case, latency = T 89 | 90 | In second case, latency = T / N 91 | 92 | (The first scenario is similar to a parallel execution, 93 | the second scenario more similar to a sequential execution.) 94 | 95 | The other formula which is not true: 96 | 97 | Latency != total time / number of orders 98 | True in the second scenario but not in the first scenario. 99 | 100 | How can we visualize this? 101 | 102 | (Draw a timeline from 0 to 1 hour and draw a line for each order) 103 | 104 | So, optimizing latency can look very different from optimizing throughput. 105 | 106 | In a batch processing framework like Spark, 107 | it waits until we ask, and then collects *all* results at once! 108 | So we always get the worst possible latency, in fact we get the maximum latency 109 | on each individual item. We don't get some results sooner and some results later. 110 | 111 | Grouping together items (via lazy transformations) helps optimize the pipeline, but it 112 | *doesn't* necessarily help get results as soon as possible when they're needed. 113 | (Remember: laziness poll/example) 114 | That's why there is a tradeoff between throughput and latency. 115 | 116 | Data processing for low-latency applications is known as "streaming" or "stream processing" 117 | and systems for this case are known as "stream processing applications". 118 | 119 | "To achieve low latency, a system must be able to perform 120 | message processing without having a costly storage operation in 121 | the critical processing path...messages should be processed “in-stream” as 122 | they fly by." 123 | 124 | From "The 8 Requirements of Real-Time Stream Processing": 125 | Mike Stonebraker, Ugur Çetintemel, and Stan Zdonik 126 | https://cs.brown.edu/~ugur/8rulesSigRec.pdf 127 | 128 | Another term for the applications that require low latency requirements (typically, sub-second, sometimes 129 | milliseconds) is "real-time" applications or "streaming" applications. 130 | 131 | === Summary: disadvantages of Spark === 132 | 133 | So that's where we're going next, 134 | talking about applications where you might want your pipeline to respond in real time to data that 135 | is coming in. 136 | We'll use a different API in Spark called Spark Structured Streaming. 137 | """ 138 | -------------------------------------------------------------------------------- /lecture1/parts/1-introduction.py: -------------------------------------------------------------------------------- 1 | """ 2 | Lecture 1: Introduction to data processing pipelines 3 | 4 | Part 1: Introduction 5 | 6 | This lecture will cover the required background for the rest of the course. 7 | 8 | Please bear with us if you have already seen some of this material before! 9 | I will use the polls to get a sense of your prior background and adjust the pacing accordingly. 10 | 11 | === Note on materials from prior iteration of the course === 12 | 13 | The GitHub repository contains the lecture notes from a prior iteration of the course (Fall 2024). 14 | You are welcome to look ahead of the notes, but please note that most content will change as I revise each lecture. 15 | I will generally post the revised lecture before and after each class period. 16 | 17 | **Changes from last year:** 18 | I plan to skip or condense Lecture 3 (Pandas) based on feedback and your prior background. 19 | I will also cover some Pandas in Lecture 1. 20 | (I will confirm this after the responses to HW0.) 21 | 22 | === Poll === 23 | 24 | Today's poll is to help me understand the overall class background in command line/Git. 25 | (I will also ask about your background in more detail on HW0.) 26 | 27 | https://forms.gle/2eYFVpxT1Q8JJRaMA 28 | 29 | ^^ find the link in the lecture notes on GitHub 30 | 31 | Piazza -> GitHub -> lecture1 -> lecture.py 32 | https://piazza.com/ 33 | 34 | === Following along with the lectures === 35 | 36 | Try this! 37 | 38 | 1. You will need to have Git installed (typically installed with Xcode on Mac, or with Git for Windows). Follow the guide here: 39 | 40 | https://www.atlassian.com/git/tutorials/install-git 41 | 42 | Feel free to work on this as I am talking and to get help from your neighbors. 43 | I can help with any issues after class. 44 | 45 | (Note on Mac: you can probably also just `brew install git`) 46 | 47 | 2. You will also need to create an account on GitHub and log in. 48 | 49 | 3. Go to: https://github.com/DavisPL-Teaching/119 50 | 51 | 4. If that's all set up, then click the green "Code" button, click "SSH", and click to copy the command: 52 | 53 | git@github.com:DavisPL-Teaching/119.git 54 | 55 | 5. Open a terminal and type: 56 | 57 | git clone git@github.com:DavisPL-Teaching/119.git 58 | 59 | 6. Type `ls`. 60 | 61 | You should see a new folder called "119" in your home folder. This contains the lecture notes and source files for the class. 62 | 63 | 7. Type `cd `119/lecture1`, then type `ls`. 64 | 65 | 8. Lastly type `python3 lecture.py`. You should see the message below. 66 | """ 67 | 68 | print("Hello, ECS 119!") 69 | 70 | """ 71 | Let's see if that worked! 72 | 73 | If some step above didn't work, you may be missing some of the software we 74 | need installed. Please complete HW0 first and then let us know if you 75 | are still having issues. 76 | 77 | === The basics === 78 | 79 | I will introduce the class through a basic model of what a data processing 80 | pipeline is, that we will use throughout the class. 81 | 82 | We will also see: 83 | - Constraints that data processing pipelines have to satisfy 84 | - How they interact with one another 85 | - Sneak peak of some future topics covered in the class. 86 | 87 | To answer these questions, we need a basic model of "data processing pipeline" - Dataflow Graphs. 88 | 89 | Recall discussion question from last lecture: 90 | 91 | EXAMPLE: 92 | You have compiled a spreadsheet of website traffic data for various popular websites (Google, Instagram, chatGPT, Reddit, Wikipedia, etc.). You have a dataset of user sessions, each together with time spent, login sessions, and click-through rates. You want to put together an app which identifies trends in website popularity, duration of user visits, and popular website categories over time. 93 | 94 | What are the main "abstract" components of the pipeline in this scenario? 95 | 96 | - A dataset 97 | - Processing steps 98 | - Some kind of user-facing output 99 | 100 | closely related: 101 | "Extract, Transform, Load" model (ETL) 102 | 103 | What is an ETL job? 104 | 105 | - **Extract:** Load in some data from an input source 106 | (e.g., CSV file, spreadsheet, a database) 107 | 108 | - **Transform:** Do some processing on the data 109 | 110 | - **Load:** (perhaps a confusing name) 111 | we save the output to an output source. 112 | (e.g. CSV file, spreadsheet, a database) 113 | 114 | """ 115 | 116 | data = { 117 | "User": ["Alice", "Alice", "Charlie"], 118 | "Website": ["Google", "Reddit", "Wikipedia"], 119 | "Time spent (seconds)": [120, 300, 240], 120 | } 121 | 122 | # As dataframe: 123 | import pandas as pd 124 | df = pd.DataFrame(data) 125 | 126 | # print(data) 127 | # print(df) 128 | 129 | """ 130 | Recap: 131 | 132 | - We spent some time getting everyone up to speed: 133 | After completing HW0, you should be able to follow along with the lectures 134 | locally on your laptop device 135 | 136 | - We started to introduce the abstract model that we will use throughout the class 137 | for data processing pipelines - this will be called the Dataflow Graph model 138 | 139 | - We began by introducing a simpler concept called Extract, Transform, Load (ETL). 140 | 141 | ***** Where we ended for Friday ***** 142 | """ 143 | -------------------------------------------------------------------------------- /lecture4/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 4: Parallelism 2 | 3 | ## Friday, October 24 4 | 5 | Changing gears today! 6 | 7 | Announcements: 8 | 9 | - A few details about the midterm (scheduled for Wed, Nov 5): 10 | 11 | Format: 12 | 13 | + Cheat sheet, single sided 14 | 15 | + Roughly: 10 questions, 8 MC/short answer, 2 free response 16 | 17 | The free response questions are longer! Each ~1-2 pages 18 | 19 | + Time length is limited to class time 20 | 21 | Please try to arrive early! 22 | 23 | + A sample exam (from last year) will be released sometime next week 24 | 25 | Studying: 26 | 27 | + Review the polls! `exams/poll_answers.md` 28 | 29 | + Study guide: `exams/midterm_study_list.md` 30 | 31 | Contains up to Lecture 1, 2, so far 32 | 33 | More topics will be added after next week! 34 | 35 | + Please ask questions at the OHs and discussion section and on Piazza! 36 | 37 | - I have to leave class right at 4pm today - I will end class 5 minutes early 38 | (3:55) in case there are questions after class. 39 | 40 | Plan for today: 41 | 42 | - Study guide 43 | 44 | - Poll 45 | 46 | - Motivation: scaling your application 47 | 48 | - Key definitions: parallel, concurrent, and distributed computing 49 | 50 | Questions? 51 | 52 | ## Monday, October 27 53 | 54 | Reminders: 55 | 56 | - OH after class today, Friday 11am 57 | 58 | Plan for today: 59 | 60 | I am trying a slightly new format for the lecture notes by dividing them into 61 | parts, roughly one per class period. 62 | For today, see `parts/1-motivation.py` and `parts/2-definitions.py` 63 | 64 | Plan: 65 | 66 | - Finish small activity on RAM and Pandas + poll 67 | 68 | - Example pipeline and running it with parallel workers 69 | 70 | - Key definitions: parallel/concurrent/distributed distinction 71 | 72 | Questions? 73 | 74 | ## Wednesday, October 29 75 | 76 | Announcements: 77 | 78 | - First part of practice midterm released; second part released late this week 79 | 80 | - HW1 grading underway; please stay tuned for a Piazza post when official grades are released. 81 | 82 | - I have sorted the rest of the lecture material into parts 83 | 84 | Plan: 85 | 86 | - Finish parallel/concurrent/distributed distinction (`2-parallelism.py`) 87 | 88 | - Poll 89 | 90 | - Concurrency. 91 | 92 | Questions? 93 | 94 | ## Friday, October 31 (Happy Halloween!) 95 | 96 | Announcements: 97 | 98 | - HW1 is graded -- please submit any grade complaints through Gradescope 99 | 100 | ** Please reserve grade complaints for cases where you disagree with the grade according to the 101 | rubric, NOT where you disagree with the rubric! 102 | 103 | Grade complaints due Nov 5 3pm (before midterm) 104 | 105 | - Mid-quarter survey is available! (See Piazza) 106 | 107 | 1 pt extra credit 108 | 109 | Due Nov 5 3pm (before midterm) 110 | 111 | Plan: 112 | 113 | - Wrap up concurrency with a little bit of terminology 114 | 115 | - Poll 116 | 117 | - Part 4: Types of parallelism. 118 | 119 | - End of class: midterm topics; will post full list + practice midterm on Piazza. 120 | 121 | Questions? 122 | 123 | ## Monday, November 3 124 | 125 | Announcements: 126 | 127 | - The midterm is this Wednesday in class! 128 | 129 | Reminders: In class, single-sided cheat sheet (typed or handwritten) 130 | Topics: see Piazza, midterm_study_list (Lecture 4, up to data parallelism in part 4) 131 | 132 | Help with midterm: 133 | 134 | + **Today** OH after class: I will go over any requested topic, any in-class poll, or any practice exam question 135 | 136 | + **Wednesday** discussion section will be a midterm review 137 | 138 | + **Piazza** for all other questions 139 | 140 | - Last chance to submit mid-quarter survey! (Due Wednesday before class) 141 | 142 | - Next week, Nov 10 and Nov 12 class will be remote on Zoom 143 | 144 | Plan: 145 | 146 | - Finish Part 4: Pipeline parallelism 147 | (Note that pipeline parallelism will not appear on the midterm, 148 | but data+task parallelism will.) 149 | 150 | - Discussion question & poll 151 | 152 | - (~about 3:45) 153 | I'll reserve the last 15 minutes of class in case there are 154 | any questions from the midterm study list, polls, or practice exam 155 | that anyone wants to go over. 156 | (Else, we can begin part 5.) 157 | 158 | Questions? 159 | 160 | ## Friday, November 7 161 | 162 | Announcements: 163 | 164 | - Next week, Monday (Nov 10) and Wed (Nov 12) class will be remote on Zoom 165 | 166 | OH also remote on Zoom 167 | 168 | - Midterm grading is underway 169 | 170 | Plan: 171 | 172 | Midterm review: 173 | 174 | - Poll with a review question 175 | 176 | Continuing lecture 4: 177 | 178 | - Part 5: Quantifying parallelism and Amdahl's Law 179 | 180 | - If time: Part 6 on distribution and conclusions. 181 | 182 | Questions? 183 | 184 | ## Monday, November 10 185 | 186 | Announcements: 187 | 188 | - Midterm grades released! 189 | 190 | Statistics & midterm answers on Piazza: https://piazza.com/class/mfvn4ov0kuc731/post/112 191 | 192 | Regrade requests due one week from today 193 | 194 | Suggested questions to review: q3, q7, q8, q9, q10 195 | 196 | - OH today after class will be on Zoom. (I will share link on Piazza) 197 | 198 | - We have time to go over one midterm question; if you would like to go over any other midterm questions or answers, please post a note on Piazza or come to office hours! 199 | 200 | Plan: 201 | 202 | - Finish Part 5: Quantifying parallelism and Amdahl's Law 203 | 204 | - Poll about Amdahl's law 205 | 206 | - Part 6 on distribution and conclusions. 207 | 208 | Plan to begin Lecture 5 on Wednesday. 209 | -------------------------------------------------------------------------------- /lecture1/parts/6-conclusion.py: -------------------------------------------------------------------------------- 1 | """ 2 | October 10 3 | 4 | Part 6: 5 | Recap on Throughput & Latency and Conclusion 6 | 7 | Recap from last time: 8 | 9 | Throughput: 10 | Measured in number items processed / second 11 | 12 | N = number of input items (size of input dataset(s)) 13 | T = running time of your full pipeline 14 | Formula = 15 | N / T 16 | 17 | Latency: 18 | Measured for a specific input item and specific output 19 | 20 | Formula = 21 | (time output is produced) - (time input is received) 22 | 23 | Often (but not always) measured for a pipeline with just 24 | one input item. 25 | 26 | Discussion question: 27 | 28 | A health company's servers process 12,000 medical records per day. 29 | The medical records come in at a uniform rate between 9am and 9pm every day (1,000 records per hour). 30 | The company's servers submit the records to a back-end service that collects them throughout the hour, and then 31 | processes them at the end of each hour to update a central database. 32 | 33 | What is the throughput of the pipeline? 34 | 35 | What number would best describe the *average latency* of the pipeline? 36 | Describe the justification for your answer. 37 | 38 | https://forms.gle/AFL2SrBr5MhwVV3h7 39 | """ 40 | 41 | """ 42 | ... 43 | 44 | Let's see an example 45 | 46 | We need a pipeline so that we can measure the total running time & the throughput. 47 | 48 | I've taken the pipeline from earlier for country data and rewritten it below. 49 | 50 | see throughput_latency.py 51 | """ 52 | 53 | def get_life_expectancy_data(filename): 54 | return pd.read_csv(filename) 55 | 56 | # Wrap up our pipeline - as a single function! 57 | # You will do a similar thing on the HW to measure performance. 58 | def pipeline(input_file, output_file): 59 | df = get_life_expectancy_data(input_file) 60 | min_year = df["Year"].min() 61 | max_year = df["Year"].max() 62 | # (Commented out the print statements) 63 | # print("Minimum year: ", min_year) 64 | # print("Maximum year: ", max_year) 65 | avg = df["Period life expectancy at birth - Sex: all - Age: 0"].mean() 66 | # print("Average life expectancy: ", avg) 67 | # Save the output 68 | out = pd.DataFrame({"Min year": [min_year], "Max year": [max_year], "Average life expectancy": [avg]}) 69 | out.to_csv(output_file, index=False) 70 | 71 | # SEE throughput_latency.py. 72 | 73 | # import timeit 74 | 75 | def f(): 76 | pipeline("life-expectancy.csv", "output.csv") 77 | 78 | # Run the pipeline 79 | # f() 80 | 81 | """ 82 | === Latency (additional notes - SKIP) === 83 | 84 | What is latency? 85 | 86 | Sometimes, we care about not just the time it takes to run the pipeline... 87 | but the time on each specific input item. 88 | 89 | Why? 90 | - Imagine crawling the web at Google. 91 | The overall time to crawl the entire web is... 92 | It might take a long time to update ALL websites. 93 | But I might wonder, 94 | what is the time it takes from when I update my website 95 | ucdavis-ecs119.com 96 | to when this gets factored into Google's search results. 97 | 98 | This "individual level" measure of time is called latency. 99 | 100 | *Tricky point* 101 | 102 | For the pipelines we have been writing, the latency is the same as the running time of the entire pipeline! 103 | 104 | Why? 105 | 106 | Let's measure the performance of our toy pipeline. 107 | """ 108 | 109 | """ 110 | === Memory usage (also skip :) ) === 111 | 112 | What about the equivalent of memory usage? 113 | 114 | I will not discuss this in detail at this point, but will offer a few important ideas: 115 | 116 | - Input size: 117 | 118 | - Output size: 119 | 120 | - Window size: 121 | 122 | - Distributed notions: Number of machines, number of addresses on each machine ... 123 | 124 | Which of the above is most useful? 125 | 126 | How does memory relate to running time? 127 | For traditional programs? 128 | For data processing programs? 129 | """ 130 | 131 | """ 132 | === Overview of the rest of the course === 133 | 134 | Overview of the schedule (tentative), posted at: 135 | https://github.com/DavisPL-Teaching/119/blob/main/schedule.md 136 | 137 | === Closing quotes === 138 | 139 | Fundamental theorem of computer science: 140 | 141 | "Every problem in computer science can be solved by another layer of abstraction." 142 | 143 | - Based on a statement attributed to Butler Lampson 144 | https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering 145 | 146 | A dataflow graph is an abstraction (why?), but it is a very useful one. 147 | It will help put all problems about data processing into context and help us understand how 148 | to develop, understand, profile, and maintain data processing jobs. 149 | 150 | It's a good human-level way to understand pipelines, and 151 | it will provide a common framework for the rest of the course. 152 | """ 153 | 154 | # Main function: the default thing that you run when running a program. 155 | 156 | # print("Hello from outside of main function") 157 | 158 | if __name__ == "__main__": 159 | # Insert code here that we want to be run by default when the 160 | # program is executed. 161 | 162 | # print("Hello from inside of main function") 163 | 164 | # What we can do: add additional code here 165 | # to test various functions. 166 | # Simple & convenient way to test out your code. 167 | 168 | # Call our pipeline 169 | # pipeline("life-expectancy.csv", "output.csv") 170 | 171 | pass 172 | 173 | # NB: If importing lecture.py as 174 | # a library, the main function (above) doesn't get run. 175 | # If running it directly from the terminal, 176 | # the main function does get run. 177 | # See test file: main_test.py 178 | -------------------------------------------------------------------------------- /exams/final_study_list.md: -------------------------------------------------------------------------------- 1 | # Final Study List 2 | 3 | Study list of topics for the final. 4 | 5 | **The final will cover Lectures 1-6.** 6 | 7 | ## Lectures 1-4 8 | 9 | For Lectures 1-4, please refer to the midterm study list `midterm_study_list.md`. 10 | 11 | ## Review topics from the midterm 12 | 13 | Suggested review topics based on the midterm: 14 | 15 | - How to draw and interpret a dataflow graph 16 | 17 | + I'm looking for a conceptual understanding of what happens when 18 | you "run" the pipeline, what tasks need to be completed in what order. 19 | 20 | - Understanding throughput and latency conceptually given a dataflow graph 21 | 22 | + Estimating running time in the sequential case, parallel cases; applying formulas 23 | 24 | - Concurrency problems: two concurrent executions like (x += 1; x += 1) vs. x += 1 25 | 26 | - Data validation: put the concepts in context: 27 | If asked about what you would do on a dataset or a specific 28 | real-world example, we're really looking for things specific to that 29 | dataset or real-world example. 30 | 31 | - Not covered on the midterm, but will be covered on the final: 32 | Amdahl's law (throughput <= T / S) and maximum speedup case (S) 33 | 34 | ## Lecture 5 35 | 36 | - Scalable collection types 37 | 38 | + Differences from normal Python collections 39 | + Types of scaling - vertical & horizontal scaling 40 | + Benefits/drawbacks 41 | + Examples (RDDs, PySpark DataFrames) and their properties 42 | 43 | - Operators 44 | 45 | + Map 46 | + Filter 47 | + Reduce 48 | 49 | - Operator concepts 50 | 51 | + Immutability 52 | + Evaluation: Lazy vs. not-lazy (transormation vs. action) 53 | - why laziness matters / why it is useful 54 | + Partitoning: Wide vs. narrow 55 | - What operators should be wide vs narrow 56 | + How partitioning works, what it means, how it affects performance 57 | + Key-based partitioning (see MapReduce, HW2) 58 | 59 | - MapReduce 60 | 61 | + For the purposes of the final, we will use either 62 | the simple version of MapReduce from class, 63 | or the generalized one from HW2 64 | (I will remind you of the type of map/reduce for the exam) 65 | 66 | + simplified model (map and reduce, conceptually) 67 | + general model (that we saw on HW2) assuming I give you 68 | the actual types for map and reduce for reference 69 | + you may be asked to describe how to do a computation as a MapReduce 70 | pipeline - describe the precise function 71 | 72 | for map: function that takes 1 input row, produces 1 output row 73 | for reduce: function that takes 2 output rows, returns 1 output row 74 | 75 | - Implementation details: In general, you do not need to know implementation details of Spark, but you should know: 76 | + Number of partitions and how it affects performance 77 | * too low, too high 78 | + Running on a local cluster, running on a distributed computing cluster 79 | + Fault tolerance: you may assume that Spark tolerates node failures (RDDs can recover from a computer or worker crash) 80 | 81 | - Drawing a PySpark or MapReduce computation as a dataflow graph 82 | 83 | - Limitations of Spark 84 | 85 | ## Lecture 6 86 | 87 | - Understanding latency 88 | 89 | + Intuitive: for example, given 10 orders in a 1 s time interval are 90 | processed, what can you say about the latency of each order 91 | 92 | + Refined def of latency: 93 | latency of item X = (end or exit time X) - (start or arrival time X) 94 | 95 | - List of summary points: 96 | + Latency = Response Time 97 | + Latency can only be measured by focusing on a single item or row. (response time on that row) 98 | + Latency-critical, real-time, or streaming applications are those for which we are looking for low latency (typically, sub-second or even millisecond response times). 99 | + Latency is NOT the same as 1 / Throughput 100 | * If it were, we wouldn't need two different words! 101 | + Latency is NOT the same as processing time 102 | * It's processing time for a specific event 103 | + If throughput is about quantity (how many orders processed), latency is about quality (how fast individual orders processed). 104 | 105 | - Batch vs. streaming pipelines 106 | 107 | + When streaming is useful (application scenarios) 108 | 109 | + Latency in both cases 110 | 111 | + How to derive latency given the dataflow graph 112 | 113 | + Batch/stream analogy 114 | 115 | - Implementation details of streaming pipelines: 116 | 117 | + Microbatching and possible microbatching strategies 118 | 119 | + Spark timestamp (assigned to all members of a microbatch) 120 | 121 | - Time 122 | 123 | + Why it matters: measuring latency, measuring progress in the system, assigning microbatches 124 | 125 | + Reasons that time is complicated (time zones, clock resets) 126 | 127 | + Kinds of time: Real time, event time, system time, logical time 128 | 129 | + Monotonic time 130 | * which of the above or monotonic 131 | 132 | + Measuring time: entrance time, processing time, exit time (These are all versions of system time.) 133 | 134 | ## Lecture 7 135 | 136 | Will not be covered on the final. 137 | 138 | TBD: the lecture is very brief and the last 1-2 days of class. 139 | 140 | Example multiple choice question: 141 | 142 | Match each of the following cloud provider services to its use case. 143 | 144 | Major AWS cloud services: S3, EC2, Lambda. 145 | 146 | S3: useful for data storage 147 | 148 | EC2: useful for purchasing compute (basically, cloud computers that you can log into and run via the terminal) 149 | 150 | Lambda: useful for triggering events and running asynchronous code. 151 | 152 | ## Notes 153 | 154 | Some things you do **not** need to know: 155 | Python, Pandas, and PySpark syntax. 156 | Implementation details of PySpark and Spark Streaming, except where mentioned above. 157 | Lecture 7. 158 | -------------------------------------------------------------------------------- /lecture1/parts/2-etl.py: -------------------------------------------------------------------------------- 1 | """ 2 | Monday, September 29 3 | 4 | Part 2: Extract, Transform, Load (ETL) 5 | 6 | === REMINDER: FOLLOWING ALONG === 7 | 8 | https://github.com/DavisPL-Teaching/119 9 | 10 | - Open terminal (Cmd+Space Terminal on Mac) 11 | 12 | - `git clone ` 13 | 14 | + if you have already cloned, do a `git stash` or `git reset .` 15 | 16 | - `git pull` 17 | 18 | - **Why use the command line?** 19 | 20 | Short answer: it's an important skill! 21 | 22 | Long answer: 23 | I do require learning how to use the command line for this course. 24 | More in Lecture 2. 25 | GUI tools only work if someone else already wrote them (they used the command line to write the tool) 26 | You'll find that it is SUPER helpful to know the basics of the command line for stuff like installing software, managing dependencies, and debugging why installation didn't work. 27 | The command prompt is how all internal commands work on your computer - and it's an important skill for data engineering in practice. 28 | 29 | === Continuing our example === 30 | 31 | Recall from last time: 32 | 33 | - Want: a general model of data processing pipelines 34 | 35 | - First-cut model: Extract Transform Load (ETL) 36 | 37 | Any data process job can be split into three stages, 38 | input, processing, output 39 | (extract, transform, load) 40 | 41 | Example on finding popular websites: 42 | """ 43 | 44 | # (Re-copying from above) 45 | data = { 46 | "User": ["Alice", "Alice", "Charlie"], 47 | "Website": ["Google", "Reddit", "Wikipedia"], 48 | "Time spent (seconds)": [120, 300, 240], 49 | } 50 | df = pd.DataFrame(data) 51 | 52 | # Some logic to compute the maximum length of time website sessions 53 | u = df["User"] 54 | w = df["Website"] 55 | t = df["Time spent (seconds)"] 56 | # Max of t 57 | max = t.max() 58 | # Filter 59 | max_websites = df[df["Time spent (seconds)"] == max] 60 | 61 | # Let's print our data and save it to a file 62 | with open("save.txt", "w") as f: 63 | print(max_websites, file=f) 64 | 65 | """ 66 | Running the code 67 | 68 | It can be useful to have open a Python shell while developing Python code. 69 | 70 | There are two ways to run Python code from the command line: 71 | - python3 lecture.py 72 | - python3 -i lecture.py 73 | 74 | Let's try both. 75 | """ 76 | 77 | """ 78 | First step: can we abstract this as an ETL job? 79 | """ 80 | 81 | def extract(): 82 | data = { 83 | "User": ["Alice", "Alice", "Charlie"], 84 | "Website": ["Google", "Reddit", "Wikipedia"], 85 | "Time spent (seconds)": [120, 300, 240], 86 | } 87 | df = pd.DataFrame(data) 88 | return df 89 | 90 | def transform(df): 91 | u = df["User"] 92 | w = df["Website"] 93 | t = df["Time spent (seconds)"] 94 | # Max of t 95 | max = t.max() 96 | # Filter 97 | # This syntax in Pandas for filtering rows 98 | # df[colname] 99 | # df[row filter] (row filter is some sort of Boolean condition) 100 | return df[df["Time spent (seconds)"] == max] 101 | 102 | def load(df): 103 | # Save the dataframe somewhere 104 | with open("save.txt", "w") as f: 105 | print(df, file=f) 106 | 107 | # Uncomment to run 108 | df = extract() # get the input 109 | df = transform(df) # process the input 110 | # print(df) # printing (optional) 111 | load(df) # save the new data. 112 | 113 | """ 114 | We have a working pipeline! 115 | But this may seem rather silly ... why rewrite the pipeline 116 | to achieve the same behavior? 117 | 118 | === Tangent: advantages of abstraction === 119 | 120 | Q: why abstract the steps into Python functions? 121 | 122 | (instead of just using a plain script) 123 | 124 | ETL steps are not done just once! 125 | 126 | A possible development lifecycle: 127 | 128 | - Exploration time: 129 | Thinking about my data, thinking about what I might 130 | want to build, exploring insights 131 | -> there is no pipeline yet, we're just exploring 132 | 133 | - Development time: 134 | Building or developing a working pipeline 135 | -> a script or abstracted functions would both work! 136 | 137 | - Production time: 138 | Deploying my pipeline & reusing it for various purposes 139 | (e.g., I want to run it like 5x per day) 140 | -> pipeline needs to be reused multiple times 141 | -> we could even think about more stages, like 142 | maintaining the pipeline as separate items after production time. 143 | 144 | In general, for this class we will think most about production time, 145 | because we are ultimately interested in being able to fully automate and 146 | maintain pipelines (not just one-off scripts). 147 | 148 | Some of you may have used tools like Jupyter notebooks; 149 | (very good for exploration time!) 150 | while excellent tools, 151 | I will generally be working directly in Python in this course. 152 | 153 | Reasons: I want to get used to thinking of processing directly "as code", 154 | good abstractions via functions and classes, and follow good practices like 155 | unit tests, etc. to integrate the code into a larger project. 156 | 157 | Abstractions mean we can test the code: 158 | """ 159 | 160 | import pytest 161 | 162 | # Unit test example 163 | # @pytest.mark.skip # uncomment to skip this test 164 | def test_extract(): 165 | df = extract() 166 | # What do we want to test here? 167 | # Test that the result has the data type we expect 168 | assert type(df) is not None 169 | assert type(df) == pd.DataFrame 170 | # check the dimensions (I'll skip this) 171 | # Sanity check - check that the values are the correct type! 172 | 173 | # @pytest.mark.skip # uncomment to skip this test 174 | def test_transform(): 175 | df = extract() 176 | df = transform(df) 177 | # check that there is exactly one output 178 | assert df.count().values[0] == 1 179 | 180 | # Run: 181 | # - pytest lecture.py 182 | 183 | """ 184 | Discussion Question / Poll: 185 | 186 | 1. Can you think of any scenario where test_extract() will fail? 187 | 188 | 2. Will test_transform() always pass, no matter the input data set? 189 | 190 | https://forms.gle/j99n5ZN7jsJ6gHB2A 191 | 192 | ********** Where we ended for September 29 ********** 193 | """ 194 | -------------------------------------------------------------------------------- /lecture4/parts/6-distribution.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 6: Distribution and concluding thoughts 3 | 4 | === Distributed computing: some examples === 5 | 6 | === What is distribution? === 7 | 8 | Distribution means that we have multiple workers and belts 9 | **in different physical warehouses** 10 | can process and fail independently. 11 | 12 | The workers must be on different physical computers or physical devices. 13 | (Why does it matter?) 14 | 15 | Running on the same physical device, both workers have 16 | access to the same resources; 17 | Running on two different devices, they access different resources, and one worker could crash even if the other 18 | one doesn't. 19 | So, it introduces new challenges. 20 | 21 | For this one, it's more difficult to simulate in Python directly. 22 | We can imagine that our workers are computed by an external 23 | server, rather than being computed locally on our machine. 24 | 25 | To give a simple instance of this, let's use ssh to connect to a remote 26 | server, then use the server to compute the sum of the numbers. 27 | 28 | (You won't be able to use this code; it's connecting to my own SSH server!) 29 | """ 30 | 31 | # for os.popen to run a shell command (like we did on HW1 part 3) 32 | import os 33 | 34 | def ssh_run_command(cmd): 35 | result = os.popen("ssh cdstanfo@set.cs.ucdavis.edu " + cmd).read() 36 | # Just print the result for now 37 | print(f"result: {result.strip()}") 38 | 39 | def worker1_distributed(): 40 | ssh_run_command("seq 1 1000000 | awk '{s+=$1} END {print s}'") 41 | print("Worker 1 finished") 42 | 43 | def worker2_distributed(): 44 | ssh_run_command("seq 1000001 2000000 | awk '{s+=$1} END {print s}'") 45 | print("Worker 2 finished") 46 | 47 | def average_numbers_distributed(): 48 | worker1_distributed() 49 | worker2_distributed() 50 | print("Distributed computation complete") 51 | 52 | # Uncomment to run 53 | # This won't work on your machine! 54 | average_numbers_distributed() 55 | 56 | # This waits until the first connection finishes before 57 | # starting the next connection; but we could easily modify 58 | # the code to make them both run in parallel. 59 | 60 | """ 61 | Questions: 62 | 63 | Q1: can we have distribution without parallelism? 64 | 65 | A: Yes, we just did 66 | 67 | Q2: can we have distribution with parallelism? 68 | 69 | A: Yes, we could allow the server to run and compute 70 | an answer while we continue to compute other stuff, 71 | or while we run a separate connection to a second 72 | server. 73 | 74 | Q3: can we have distribution without concurrency? 75 | 76 | A: Yes, for example: we have two databases or database 77 | partitions running separately (and they don't interact) 78 | 79 | Q4: can we have distribution with concurrency? 80 | 81 | Yes, we often do, for example when distributed workers 82 | communicate via passing messages to each other 83 | """ 84 | 85 | """ 86 | === Parallelizing our code in Pandas? (Skip) === 87 | 88 | We don't want to parallelize our code by hand. 89 | (why? See problems with concurrency from last week!) 90 | 91 | Dask is a simple library that works quite well for parallelizing datasets 92 | on a single machine as a drop-in replacement for Pandas. 93 | """ 94 | 95 | # conda install dask or pip3 install dask 96 | import dask 97 | 98 | def dask_example(): 99 | # Example dataset 100 | df = dask.datasets.timeseries() 101 | 102 | # Dask is "lazy" -- it only generates data when you ask it to. 103 | # (More on laziness later). 104 | print(type(df)) 105 | print(df.head(5)) 106 | 107 | # Use a standard Pandas filter access 108 | df2 = df[df.y > 0] 109 | print(type(df2)) 110 | print(df2.head(5)) 111 | 112 | # Do a group by operation 113 | df3 = df2.groupby("name").x.mean() 114 | print(type(df3)) 115 | print(df3.head(5)) 116 | 117 | # Compute results -- this processes the whole dataframe 118 | print(df3.compute()) 119 | 120 | # If you just want parallelism on a single machine, 121 | # Dask is a great lightweight solution. 122 | 123 | # Uncomment to run. 124 | # dask_example() 125 | 126 | """ 127 | === A final definition and end note: Vertical vs. horizontal scaling === 128 | 129 | - Vertical: scale "up" resources at a single machine (hardware, parallelism) 130 | - Horizontal: scale "out" resources over multiple machines (distribution) 131 | 132 | This lecture, we have only seen *vertical scaling*. 133 | But vertical scaling has a limit! 134 | Remember that we are still limited in the size of the dataset we can 135 | process on a single machine 136 | (recall Wes McKinney estimate of how large a table can be). 137 | Even without Pandas overheads, 138 | we still can't process data if we run out of memory! 139 | 140 | So, to really scale we may need to distribute our dataset over many 141 | machines -- which we do using a distributed data processing framework 142 | like Spark. 143 | This also gives us a convenient way to think about data pipelines 144 | in general, and visualize them. 145 | We will tour PySpark in the next lecture. 146 | 147 | Everything we have said about identifying and quantifying parallelism also applies to 148 | distributing the code (for the most part -- we will only see exceptions to this if we cover 149 | distributed consistency issues and faults and crashes, this is an optional topic that we will 150 | get to only if we have time.) 151 | 152 | In addition to scaling even further, distribution + parallelism can offer an even 153 | more seamless performance compared to parallelism alone as it can eliminate 154 | many coordination overheads and contention between workers 155 | (see partitioning: different partitions of the database are operated entirely independently by different machines). 156 | 157 | Recap: 158 | 159 | - Finished Amdahl's law, did an example, and a practice example with the poll 160 | 161 | - Connected Amdahl's law back to latency & throughput (with two formulas) 162 | 163 | - We talked about distribution; ran our same running example as a distributed pipeline over ssh 164 | 165 | - We talked about vertical vs horizontal scaling 166 | 167 | - We contrasted parallelism with distributed scaling - where we will be going next in Lecture 5. 168 | """ 169 | -------------------------------------------------------------------------------- /lecture1/parts/4-properties.py: -------------------------------------------------------------------------------- 1 | """ 2 | Monday, October 6 3 | 4 | Part 4: Proeprties of Dataflow Graphs 5 | 6 | On Friday, I introduced the concept of dataflow graphs. 7 | Recall: 8 | To build a dataflow graph, we divide our pipeline into a series of "stages" 9 | To build the graph, we draw: 10 | - One node per stage of the pipeline 11 | - An edge from node A to B (A -> B) if node B directly uses the output of node A. 12 | 13 | === Practice with dataflow graphs === 14 | 15 | At the end of last class period, we introduced a dataset for life expectancy. 16 | We saw a simple data pipeline for this dataset. 17 | Let's separate it into stages as follows: 18 | 19 | (read) = load the CSV input 20 | (max) = compute the max 21 | (min) = compute the min 22 | (avg) = compute the avg 23 | (print) = Print the max, min, and avg 24 | (save) = Save the max, min, and avg in a dataframe to a file 25 | 26 | === Discussion Question and Poll === 27 | 28 | Suppose we draw a dataflow graph with the above nodes. 29 | 30 | 1. What edges will the graph have? 31 | (draw/write all edges) 32 | 33 | 2. Give an example of two stages A and B, where the output for B depends on A, but there is no edge from A to B. 34 | 35 | https://forms.gle/6FB5hhwKpokTHhit9 36 | 37 | Answer: 38 | 39 | -> (max) ----|--> (print) 40 | (read) -> (min) ----| 41 | -> (avg) ----|--> (save) 42 | 43 | Key points: 44 | 45 | Two "independent" computations will not have an edge one way or the other 46 | (printing produces output to the terminal, save produces output to a file, 47 | neither one is used by the other) 48 | 49 | We can read off dependence information from the graph! If there is a path 50 | from A to B, then B depends (either directly or indirectly) on A. 51 | 52 | What graph we get depends on the precise details of our stages. 53 | Ex.: if we load the input three different times, once for the max, once for the min, 54 | once for the avg (and this is listed in our description of the computation), 55 | we would get a different graph with 8 nodes instead of 6. 56 | 57 | In order to draw this thing, we should refer to the particular way that we wrote out 58 | our computation. 59 | 60 | === A few more things === 61 | 62 | A couple of more definitions: 63 | 64 | - A stage B *depends on* a stage A if... 65 | 66 | there is a path from A to B 67 | 68 | point: The dataflow graph reveals exactly which computations depend on which others! 69 | 70 | - A *source* is a node without any input edges 71 | (typically, a node which loads data from an external source) 72 | (corresponds to the E stage of the ETL model) 73 | 74 | - A *sink* is a node without any output edges 75 | (typically, a node which saves data to an external source) 76 | (corresponds to the L stage of the ETL model) 77 | 78 | - A small correction from last time: let's define 79 | an *operator* is any node that is not a source or a sink. 80 | Operators take input data, and produce output data 81 | (corresponds to the T stage of the ETL model) 82 | 83 | Points: 84 | 85 | Every node in the dataflow graph is one of the above 3 types 86 | 87 | The dataflow graph reveals exactly where the I/O operations are for your pipeline. 88 | 89 | We can use the dataflow graph to reveal (visually and conceptually) many useful features of our pipeline. 90 | 91 | In Python: 92 | We could write each node as a stage, as we have been doing before. 93 | 94 | Let's just write one example, in the interest of time 95 | """ 96 | 97 | def max_stage(df): 98 | return df["Year"].max() 99 | 100 | """ 101 | Reminders for why this helps: 102 | 103 | (Maybe it's overkill for a one-liner example like this) 104 | 105 | - Better code re-use 106 | - Better ability to write unit tests 107 | - Separation of concerns between different features, developers, or development efforts 108 | - Makes the software easier to maintain (or modify later) 109 | - Makes the software easier to debug 110 | 111 | Zooming in on one of these... 112 | (pick one) 113 | 114 | Q: How does this correspond to ETL model? 115 | 116 | ETL is basically a dataflow graph with 3 nodes. 117 | """ 118 | 119 | """ 120 | === Data validation === 121 | 122 | We will talk more about data validation at some point, most likely as part of Lecture 3. 123 | (See failures.py for a further discussion) 124 | 125 | Where in a pipeline is data validation most important? 126 | 127 | (There is more than one place where validation could help, but what's the most obvious place to start?) 128 | 129 | A: Right before transformations 130 | (After sources) 131 | 132 | Why? 133 | - Most common problem: malformed input 134 | - I might want to validate that all of my rows have the type that I'm expecting before 135 | I move to any further processing. 136 | - This might even simplify or speed up the later stages as in those stages I'm allowed 137 | to assume that the data is well-formed. 138 | 139 | NB: You can validate at any point in the graph. (And it can be useful!) 140 | 141 | Validation in a dataflow graph: 142 | we may view each edge as having some "constraints" that are validated by the previous stage, 143 | and assumed by the next. 144 | 145 | === Performance === 146 | 147 | Let's touch on one other thing that we can do with dataflow graphs: 148 | we can use them to think about performance. 149 | 150 | Dataflow graphs are basically the "data processing" equivalent of programs. 151 | 152 | For traditional programs, there are two notions of performance that matter: 153 | 154 | - Runtime or time complexity 155 | - Memory usage or space complexity 156 | 157 | For data processing programs? 158 | 159 | We'll care about the most: 160 | - Running time corresponds to: Throughput & Latency 161 | - Memory usage: you can also measure, we'll talk briefly about ways of thinking about this. 162 | 163 | ********** 164 | 165 | Recap: 166 | 167 | We reviewed the definition of dataflow graph 168 | - divided into sources, operators, and sinks 169 | - def of when to draw an edge 170 | 171 | We practiced drawing dataflow graphs 172 | 173 | We used dataflow graphs to explore various features of a data processing computation 174 | 175 | We argued that analagous to regular computer programs for the traditional computing world, 176 | dataflow graphs are the right notion of computer programs for the data processing world. 177 | 178 | ********** where we ended for today ********** 179 | """ 180 | -------------------------------------------------------------------------------- /lecture2/parts/2-commands.py: -------------------------------------------------------------------------------- 1 | """ 2 | Wednesday, Oct 15 3 | 4 | Part 2: Commands and Platform Dependence 5 | 6 | Continuing the shell. 7 | 8 | Poll: 9 | Which of the following are reasons you might want to use the shell? (Select all that apply) 10 | 11 | 12 | 13 | https://forms.gle/YrsjyyXe5Ve1aqEM7 14 | 15 | ----- 16 | 17 | Last time we saw: ls, cd, python3 18 | 19 | (btw: ls is short for "list") 20 | (cd: . = current folder, .. = parent folder) 21 | 22 | (autocomplete; up/down arrow) 23 | 24 | Remaining commands: 25 | 26 | - pytest .py: 27 | Run pytest (Python unit testing framework) on a Python program 28 | 29 | - conda install : 30 | Conda = package manager for various data science libaries & frameworks 31 | This command installs a software package using Conda 32 | 33 | - pip3 install : 34 | Somewhat deprecated nowadays in favor of better package managers 35 | Install python libraries / packages 36 | 37 | Better? 38 | + Use conda 39 | + Use your package manager through your operating system 40 | brew for macOS 41 | apt for Linux 42 | 43 | For modern Python projects: 44 | You should be using venv - makes a virtual package environment per-Python project 45 | 46 | If you ever see a file like .venv in a GitHub repository, that's what that is 47 | 48 | What do all of these programs have in common? 49 | 50 | Commonalities: 51 | They all involved working with system resources in some way. 52 | 53 | Differences: 54 | 55 | - ls: mostly was "informational" command - just figuring out what folder we're 56 | curently inside 57 | 58 | - cd, conda, pip3, python3 - "doing stuff" commands - we're actually modifying the 59 | state of the system when running these. 60 | 61 | Other answers (skip): 62 | 63 | - Different programs may have been developed by different people, in different 64 | teams, in different languages, etc. 65 | 66 | - We can't assume someone wrote a nice GUI for us to connect these programs 67 | or pieces together! (Sadly, often they didn't.) 68 | 69 | Some examples of running these: 70 | """ 71 | 72 | # Try: 73 | # python3, ls, pytest, conda 74 | 75 | # ls: doesn't show hidden folders and files 76 | # On Mac: anything starting with a . is hidden 77 | # Hidden files are used for many important purposes, 78 | # e.g., storing program data, caching information, writing 79 | # configuration for tools like Git, etc. 80 | 81 | """ 82 | when submitting code to others: 83 | best to remove hidden files & folders! 84 | 85 | These can clutter up a project, resulting in a large 86 | .zip file with lots of extra junk/files. 87 | 88 | (Similarly, they can also clutter up a Git repository 89 | - which is why we use .gitignore to tell Git to ignore 90 | certain stuff.) 91 | """ 92 | 93 | # To show hidden + other metadata 94 | # ls -alh 95 | # ^^^^^^^ TL;DR use this to show all the stuff in a folder 96 | 97 | """ 98 | Observations: 99 | 100 | - You can run shell commands in Python 101 | 102 | - You can run Python programs from the shell 103 | (we've already seen how to do this) 104 | 105 | Let's see an example 106 | """ 107 | 108 | # 1. Using the built-in os library 109 | 110 | # os, sys - Python libraries for interacting with 111 | # system resources 112 | 113 | # os is how Python interacts with the operating system 114 | import os 115 | 116 | def ls_1(): 117 | # Listdir: input a folder, show me all the files 118 | # and folders inside it 119 | # The . refers to the current directory 120 | # Also does not include hidden files/folders. 121 | print(os.listdir(".")) 122 | 123 | # ls_1() 124 | 125 | # 2. Running general/arbitrary commands 126 | 127 | # Library for running other commands 128 | import subprocess 129 | 130 | def ls_2(): 131 | subprocess.run(["ls", "-alh"]) 132 | # Equivalent: ls -alh in the shell! 133 | 134 | # ls_2() 135 | # ^^^^ same output as if I ran the command line directly 136 | 137 | """ 138 | Re: Q in chat 139 | When working with the shell, you are often doing very 140 | platform-specific stuff (platform = operating system, architecture, etc.). 141 | 142 | Example differences across platforms: 143 | - syntax for arguments in Mac/Linux vs Windows Powershell 144 | - capitalization of folders 145 | example: on Mac I run ls Subfolder - not case-sensitive - 146 | to get the files inside "subfolder" 147 | 148 | Won't work on another platform! 149 | 150 | "Works on my machine" 151 | But doesn't work on someone else's. 152 | 153 | Summary points: 154 | 155 | Sometimes in Python we just directly call into 156 | commands, and knowing shell syntax is useful as it 157 | gives a very powerful way for Python programs to interact 158 | with the outside world. 159 | 160 | Everything that can be done in the shell can be done in a Python script 161 | (Why?) 162 | 163 | Everything that can be done in Python can be done in the shell 164 | (Why?) 165 | 166 | So knowing shell stuff might help you with running 167 | systems-level stuff in Python, and vice versa. 168 | 169 | ===== A model for interacting with the shell: 3 types of command ===== 170 | 171 | We saw how to run basic commands in the shell and what it means. 172 | 173 | Three types of commands: 174 | 175 | 1. Information 176 | 2. Getting help 177 | 3. Doing something 178 | 179 | === Informational commands: looking around === 180 | 181 | An analogy: 182 | There used to be a whole genre of text-based adventure games. 183 | the shell is kind of like this. 184 | 185 | e.g. 186 | - Zork (1977): 187 | https://textadventures.co.uk/games/play/5zyoqrsugeopel3ffhz_vq 188 | - Peasant's Quest (2004): 189 | https://homestarrunner.com/disk4of12 190 | 191 | Back in the day you would then open up the game (and be provided no information to help. :-) ) 192 | What would you do first? 193 | 194 | If you like, play around with Zork offline, it can be a fun game/distraction 195 | (bit of a blast from the past) 196 | 197 | https://textadventures.co.uk/games/play/5zyoqrsugeopel3ffhz_vq 198 | 199 | If you know how to play Zork then you know how to work with the shell. 200 | 201 | Recap: 202 | 203 | - We saw some stuff about hidden files/folders (starting with .) 204 | 205 | - We talked about running shell commands from Python using subprocess, and Python system libraries 206 | 207 | - We talked about platform differences and how these can be an issue when working 208 | with the shell 209 | 210 | - We introduced a 3-category analogy for shell commands: Info commands, Help commands, and "Do something" commands. 211 | 212 | ********** Where we ended for today ********** 213 | """ 214 | -------------------------------------------------------------------------------- /lecture1/parts/5-performance.py: -------------------------------------------------------------------------------- 1 | """ 2 | October 8 3 | 4 | Part 5: Performance 5 | 6 | Let's talk about performance! 7 | 8 | But first, the poll. 9 | 10 | === Poll / discussion question === 11 | 12 | True or false: 13 | 14 | 1. Two different ways of writing the same overall computation can have two different dataflow graphs. 15 | 16 | 2. Operators always take longer to run than sources and sinks. 17 | 18 | 3. Typically, every node in a dataflow graph takes the same amount of time to run. 19 | 20 | https://forms.gle/wAAyXqbJaCkEzyZP9 21 | 22 | Correct answers: T, F, F 23 | 24 | For 1: 25 | Max and min? 26 | Say I have a dataset in a dataframe df with fields x and y 27 | And I want to do df["x"].max() 28 | and df["x"].min() 29 | """ 30 | 31 | # df = load_input_dataset() 32 | # min = df["x"].min() 33 | # max = df["x"].max() 34 | 35 | """ 36 | Dataflow graph with nodes 37 | (load_input_dataset) node 38 | (max) node 39 | (min) node 40 | 41 | --> (max) 42 | (load) 43 | --> (min) 44 | 45 | --> (min) 46 | (load) 47 | --> (max) 48 | 49 | Same graph! Has the same nodes, and has the same edges. 50 | 51 | This is actually a great example of a slightly different phenomenon: 52 | 53 | 1b. Two different ways of writing the same overall computation can have *the same* dataflow graph. 54 | 55 | A different example? 56 | 57 | If one operator does depend on the other, BUT the answer doesn't depend on the order, you could rearrange them to get an example where 58 | - the overall computation was the same, but 59 | - the dataflow graph was different 60 | 61 | example: 62 | - Get row with values x, y, and z 63 | - First, we compute a = x + y 64 | - Then we compute a + z = x + y + z. 65 | 66 | OR, we could 67 | - First, compute b = x + z 68 | - Then we compute b + y = x + y + z. 69 | 70 | We could get the same answer in two different ways. 71 | 72 | And in this example, the dataflow graph is also different: 73 | 74 | (input) -> (compute x + y) => (compute a + z) 75 | (input) -> (compute x + z) => (compute b + z). 76 | 77 | An easier example is .describe() from last time. 78 | """ 79 | 80 | """ 81 | Last time, we reviewed the notions of performance for traditional programs. 82 | 83 | There's two types of performance that matter: time & space. 84 | 85 | For data processing programs? 86 | 87 | ===== Running time for data processing programs ===== 88 | 89 | Running time has two analogs. We will see how these are importantly different! 90 | 91 | - Throughput 92 | 93 | What is throughput? 94 | 95 | Most pipelines run slower the more input items you have! 96 | 97 | Think about how long it will take to run an application that 98 | processes a dataset of university rankings, and finds the top 10 99 | universities by ranking. 100 | 101 | You will find that if measuring the running time of such an application, 102 | a single variable dominates ... 103 | the number of rows in your dataset. 104 | 105 | Example: 106 | 1000 rows => 1 ms 107 | 10,000 rows => ~10 ms 108 | 100,000 rows => ~100 ms 109 | 110 | - Often linear: the more input items, the longer it will take to run 111 | 112 | So it makes sense to measure the performance in a way that takes this 113 | into account: 114 | 115 | running time = (number of input items) * (running time per item) 116 | 117 | (running time per item) = (running time) / (number of input items) 118 | 119 | Throughput is the inverse of this: 120 | Definition / formula: 121 | (Number of input items) / (Total running time). 122 | 123 | Intuitively: how many rows my pipeline is capable of processing, 124 | per unit time 125 | 126 | There's many real-world examples of this concept: 127 | 128 | -> the number of electrons passing through a wire per second 129 | 130 | -> the number of drops of water passing through a stream per second 131 | 132 | -> the number of orders processed by a restaurant per hour 133 | 134 | "number of things done per unit time" 135 | 136 | Is this the only way to measure performance? 137 | 138 | We also care about the individual level view: how long it takes to process 139 | a *specific* item or order. 140 | 141 | We also might measure, for an individual order, how long it takes for 142 | results for that order to come out of our pipeline. 143 | 144 | Latency = 145 | (time at which output is produced) - (time at which input is received) 146 | 147 | This is called latency. 148 | 149 | It almost seems like we've defined the same thing twice? 150 | 151 | But these are not the same. 152 | Simplest way to see this is that we might process more than one item at 153 | the same time. 154 | 155 | Ex: 156 | Restaurant processes 60 orders per hour 157 | 158 | Scenario I: 159 | Process 5 orders every 5 minutes, get those done, and move on to 160 | the next batch 161 | 162 | Scenario II: 163 | Process 1 order every 1 minute, get it done, and then move on to 164 | the next order. 165 | 166 | In either case, at the end of the hour, I've processed all 60 orders! 167 | 168 | Throughput in Scenario I? In Scenario II? 169 | I: 170 | Throughput = (Number of items processed) / (Total running time) 171 | 172 | 60 orders / 60 minutes = 1 order / minute. 173 | II: 174 | Throughput = (Number of items processed) / (Total running time) 175 | 176 | 60 orders / 60 minutes = 1 order / minute 177 | 178 | What about latency? 179 | 180 | I: 181 | (time at which output is produced) - (time at which input is received) 182 | 183 | = roughly 5 minutes 184 | 185 | II: 186 | (time at which output is produced) - (time at which input is received) 187 | 188 | = roughly 1 minutes 189 | 190 | Both measures of running time at a "per item" or "per row" level, 191 | but they can be very different. 192 | 193 | It is NOT always the case that Throughput = 1 / Latency 194 | or that Throughput and Latency are directly correlated (or inversely correlated). 195 | 196 | ===== Recap ===== 197 | 198 | We talked about how computations are represented as dataflow graphs 199 | to illustrate some important points: 200 | - The same computation (computed in different ways) can have two different dataflow graphs 201 | - The same computation (computed in different ways) could have two of the same dataflow graph 202 | 203 | We introduced throughput + latency 204 | - Restaurant analogy 205 | - We saw formulas for each 206 | - Both measures of performance in terms of running time at an "individual row" level, but throughput is an aggregate measure and latency is viewed at the level of an individual row. 207 | 208 | ********** Where we ended for today ********** 209 | """ 210 | -------------------------------------------------------------------------------- /lecture4/parts/1-motivation.py: -------------------------------------------------------------------------------- 1 | """ 2 | Lecture 4: Parallelism 3 | 4 | Part 1: Motivation 5 | (Oct 24) 6 | 7 | === Discussion Question and Poll === 8 | 9 | Which of the following is an accurate statement about Git's core philosophies? 10 | 11 | https://forms.gle/zB1qhdrP2xXswHMX8 12 | 13 | ===== Introduction ===== 14 | 15 | So far, we know how to set up a basic working data processing pipeline 16 | for our project: 17 | 18 | - We have a prototype of the input, processing, and output stages 19 | and we can think about our pipeline as a dataflow graph (Lecture 1) 20 | 21 | - We have used scripts and shell commands to download any necessary 22 | data, dependencies, and set up any other system configuration (Lecture 2) 23 | (running these automatically if needed, see HW1 part 3) 24 | 25 | How did we build our pipeline? 26 | 27 | So far (in Lecture 1 / HW1), we have been building our pipelines in Pandas 28 | 29 | - Pandas can efficiently representing data in memory as a DataFrame 30 | 31 | - Pandas uses vectorization - you studied this a little in HW1 part 2 32 | 33 | The next thing we need to do is **scale** our application. 34 | 35 | === What is scaling? === 36 | 37 | Scalability is the ability of a system to handle an increasing amount of work. 38 | 39 | === Why scaling? === 40 | 41 | Basically we should scale if we want one of two things: 42 | 43 | 1. Running on **more** input data 44 | e.g.: 45 | + training your ML model on the entire internet instead of a dataset of 46 | Wikipedia pages you downloaded to a folder) 47 | + update our population pipeline to calculate some analytics by individual city, 48 | instead of just at the country or continent level 49 | 50 | 2. Running **more often** or on **more up-to-date** data 51 | e.g.: 52 | + re-running your pipeline once a day on the latest analytics; 53 | + re-running every hour to ensure the model or results stay fresh; or even 54 | + putting your application online as part of a live application 55 | that responds to input from users in real-time 56 | (more on this later in the quarter!) 57 | 58 | Questions we might ask: 59 | 60 | - How likely would it be that you want to scale for a toy project? 61 | For an industry project? 62 | 63 | A: probably more likely for an industry project. 64 | 65 | - What advantages would scaling have on an ML training pipeline? 66 | 67 | === An example: GPT-4 === 68 | 69 | Some facts: 70 | + trained on roughly 13 trillion tokens 71 | + 25,000 A100 processors 72 | + span of 3 months 73 | + over $100M cost according to Sam Altman 74 | https://www.kdnuggets.com/2023/07/gpt4-details-leaked.html 75 | https://en.wikipedia.org/wiki/GPT-4 76 | 77 | (And the next generation of models have taken even more) 78 | 79 | Contrast: 80 | 81 | our population dataset in HW1 is roughly 60,000 lines and roughly 1.6 MB on disk.' 82 | 83 | Over 1 million times less data than the amount of tokens for GPT-4! 84 | 85 | Conclusion: scaling matters. 86 | NVIDIA stock: 87 | 88 | https://www.google.com/finance/quote/NVDA:NASDAQ?window=5Y 89 | 90 | === Thinking about scalability === 91 | 92 | Points: 93 | 94 | - We can think of scaling in terms of throughput and latency 95 | 96 | See extras/scaling-example.png for an example! 97 | 98 | If your application is scaling successfully, 99 | double the # of workers or processors => double the throughput 100 | (best case scenario) 101 | 102 | double the # of processors => half the latency? (Y or N) 103 | 104 | A: Not necessarily 105 | 106 | Often latency is not affected in the same way. 107 | 108 | If we can scale our application successfully, 109 | we are typically looking to increase the throughput of the application. 110 | 111 | === A note about Pandas === 112 | 113 | Disdavantage of Pandas: does not scale! 114 | 115 | - Wes McKinney, the creator of Pandas: 116 | "my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset" 117 | 118 | https://wesmckinney.com/blog/apache-arrow-pandas-internals/ 119 | 120 | Exercise: 121 | 122 | Check how much RAM your personal laptop has. 123 | According to McKinney's estimate, how much how large of a dataset in population.csv 124 | could your laptop handle? 125 | 126 | (Let me show how to do this) 127 | 128 | My answer: 18 GB 129 | 130 | Next time at the start of class, we'll poll the class for various 131 | answers and figure out how large of a dataset we could handle in Pandas. 132 | 133 | Recap: 134 | 135 | - We went over midterm topics 136 | - We saw some motivation for why you might want to scale your application 137 | - We defined scalability and types of scalability (running on more data vs. running more often) 138 | - Pandas does not scale in this sense. 139 | 140 | ----- Where we ended for today ----- 141 | 142 | (Finishing up) 143 | 144 | Recall: statement from Wes McKinney 145 | 146 | Uncomment and run this code to find out your RAM 147 | """ 148 | 149 | import subprocess 150 | import platform 151 | 152 | def get_ram_1(): 153 | system = platform.system() 154 | if system == "Darwin": 155 | subprocess.run(["sysctl", "-nh", "hw.memsize"]) 156 | elif system == "Linux": 157 | subprocess.run(["grep", "MemTotal", "/proc/meminfo"]) 158 | elif system == "Windows": 159 | print("Windows unsupported, try running in WSL?") 160 | else: 161 | print(f"Unsupported platform: {system}") 162 | 163 | # Run the above 164 | print("Amount of RAM on this machine (method 1):") 165 | get_ram_1() 166 | 167 | import psutil 168 | def get_ram_2(): 169 | ram_bytes = psutil.virtual_memory().total 170 | ram_gb = ram_bytes / (1024 ** 3) 171 | print(f"{ram_gb:.2f} GB") 172 | 173 | print("Amount of RAM on this machine (method 2):") 174 | get_ram_2() 175 | 176 | """ 177 | Poll: 178 | 179 | Answers: 180 | 181 | Method 1: 182 | 8.5, 8.6, 17.1, 15, 17.1, 16.5, 17.1 183 | 184 | Method 2: 185 | 8, 8, 16, 7.62, 15.61, 14.45, 15.4, 7.82, 16, 15.3, 31.6 186 | """ 187 | 188 | method1 = [8.5, 8.6, 17.1, 15, 17.1, 16.5, 17.1] 189 | method2 = [8, 8, 16, 7.62, 15.61, 14.45, 15.4, 7.82, 16, 15.3, 31.6] 190 | 191 | average1 = sum(method1) / len(method1) 192 | average2 = sum(method2) / len(method2) 193 | 194 | print(f"method 1 avg: {average1}", f"method 2 avg: {average2}") 195 | 196 | ram_needed = average2 / 10 197 | 198 | print(f"Using method 2: we can handle a dataset up to size: {ram_needed} GB") 199 | 200 | """ 201 | Please fill out your answers in the poll: 202 | 203 | https://forms.gle/sqGrHBdQBrykDoSdA 204 | 205 | Roughly, we can process like _____ as much data 206 | and then we'll a bottleneck. 207 | 208 | population.csv from HW1: 1.5 MB 209 | 210 | According to the class average, we could go up to 1000x the population 211 | and still handle it with Pandas, beyond that we will run out of space, 212 | according to McKinney's statement. 213 | """ 214 | -------------------------------------------------------------------------------- /lecture2/parts/1-introduction.py: -------------------------------------------------------------------------------- 1 | """ 2 | Lecture 2: The Shell 3 | 4 | Part 1: Introduction and Motivation 5 | 6 | This lecture will cover the shell (AKA: terminal or command line), 7 | including: 8 | 9 | - command line basics: cd, ls, cat, less, touch, text editors 10 | 11 | - calling scripts from the command line 12 | 13 | - calling command line functions from a Python script 14 | 15 | - git basics: status, add, push, commit, pull, rebase, branch 16 | 17 | - (if time) searching / pattern matching: find, grep, awk, sed 18 | 19 | Background: 20 | I will assume no background on the shell other than the 21 | one or two basic commands we have been typing in class. 22 | (Like `python3 lecture.py`) 23 | 24 | === Discussion Question & Poll === 25 | 26 | Review from last time 27 | 28 | 1. Which of the following are valid units of latency? 29 | Hours 30 | Items / second 31 | Milliseconds 32 | Nanoseconds 33 | Rows / item 34 | Rows / minute 35 | Seconds 36 | Seconds / item 37 | 38 | 2. True or false? 39 | 40 | Throughput always (increases, decreases, is constant) with the size of the dataset 41 | Running time generally increases with the size of the dataset 42 | Latency is often measured using a dataset with only one item 43 | Latency always decreases if more than one item is processed at the same time 44 | 45 | https://forms.gle/JE4R1bMU13JAvAE36 46 | 47 | Running time generally increases - True 48 | 49 | Throughput = N / T 50 | 51 | N = number of input items 52 | T = total running time 53 | 54 | Sometimes throughput goes up, sometimes it goes down 55 | 56 | N / T - often roughly linear, but not exactly linear! 57 | 58 | - if N = 1, often the system can't benefit from "scale", 59 | so throughput will be quite low 60 | 61 | - is N increases (10, 100, 1000, ...) the system will start to 62 | benefit from scale, so throughput will increase 63 | 64 | - if N -> infinity (more data than the laptop/machine can handle at all), throughput will again tank because the system will 65 | just completely crash or lag / be unable to do things. 66 | 67 | ===== Introduction ===== 68 | 69 | === Scripting and the "Glue code" problem === 70 | 71 | Programs are not very useful on their own. 72 | We need a way for programs to talk to each other! 73 | That is, we need to "glue" programs or program components together. 74 | 75 | Examples: 76 | 77 | - Our data processing pipeline talks to the operating system when it 78 | asks for the input file life-expectancy.csv. 79 | 80 | - Python module example: If another script wants to use our code, it must import it 81 | which requires the Python runtime to find the code on the system 82 | (see `module_test.py` for example) 83 | 84 | - Much of Pandas and Numpy are written in C. So we need our Python 85 | code to call into C code. 86 | 87 | What tools do people use to "glue" programs together? 88 | 89 | 1. Using system libraries (like os and sys in Python) 90 | 91 | 2. Module systems within a programming language (`import` in Python) 92 | 93 | 3. Shell: To talk to a a C program from Python, one way would be to run commands through the shell 94 | 95 | Other solutions: 96 | 97 | 4. Scripting languages: Python, others (e.g. Ruby, Perl) 98 | 99 | For the most part, we can assume in this class that much of this interaction 100 | happens in Python 101 | (In fact, we will see how to do most things in today's lecture both 102 | in the shell, AND in Python!) 103 | But it is still useful to know how this program 104 | interaction happens "under the hood": 105 | 106 | When Python interacts with the operating system and with programming languages other than Python, *internally* a common way to do this is through the shell. 107 | 108 | ---- 109 | 110 | Let's open up the shell now. 111 | """ 112 | 113 | # Mac: Cmd+Space Terminal 114 | # VSCode: Ctrl+` (backtick) 115 | # GitHub Codespaces: Bottom part of the screen 116 | 117 | # Once we have a shell open, we have a "command prompt" where 118 | # we can run arbitrary commands/programs on the computer (so it's 119 | # like an admin window into your machine.) 120 | 121 | """ 122 | Questions: 123 | 124 | + If I can run commands from Python (we'll see that you can), then why should I use the shell? 125 | 126 | + If I can use a well-designed GUI app (such as my Git installation), why should I use the shell? 127 | 128 | Examples where programmers and data scientists regularly use the shell: 129 | 130 | - You have bought a new server machine from Dell, and you want to connect to 131 | it to install some software on it. 132 | 133 | - You bought a virtual machine instance on AWS, and you want to connect to it 134 | to run your program. 135 | 136 | - You want to set up a Docker container with your application so that anyone 137 | can run and interact with it. You need to write a Dockerfile to do this. 138 | 139 | - Debugging software installations - missing dependencies, missing libraries 140 | 141 | (I have Python3 installed, but my program isn't recognizing it) 142 | Where is the software? Where is it expected to be? 143 | ---> move it to the correct location 144 | 145 | - You want to compile and run an experimental tool that was published on GitHub 146 | 147 | Or even, simply: 148 | 149 | - You have written some code, you want to send it to me so I can try it out. 150 | 151 | Shell on different operating systems? 152 | 153 | - Mac, Linux: Terminal app 154 | - Windows is a bit different, commands by default are very different 155 | option 1: 156 | recommend the most: WSL (Windows Subsystem for Linux) 157 | dropdown next to your shell window -> choose which type of 158 | shell you want 159 | With WSL, should be able to select a Ubuntu shell. 160 | 161 | option 2: 162 | Use the shell built into VSCode 163 | 164 | (don't recommend powershell) 165 | """ 166 | 167 | # python3 lecture.py 168 | print("Welcome to ECS 119 Lecture 2.") 169 | 170 | # python3 -i lecture.py 171 | # Quitting: Ctrl-C, Ctrl-D 172 | 173 | """ 174 | === What is the Shell? === 175 | 176 | The shell is a way to talk to your computer, run arbitrary commands, 177 | and connect those commands together. 178 | 179 | Examples we have seen: 180 | 181 | - ls (stands for "list") 182 | Show all files/folders in the current folder 183 | 184 | - cd: change directory 185 | 186 | (ls/cd often work together) 187 | 188 | - python3 .py: Run the python code found in .py 189 | 190 | NOTE: tab-autocomplete: very useful 191 | (saves keystrokes) 192 | (will cycle through theh options if there's more than one.) 193 | 194 | Very quick recap: 195 | We introduced the shell/terminal/command prompt as a way to solve the "glue code" problem 196 | 197 | We went through some motivation for when data scientists might need to use the shell (esp. to interact with things like remote servers), and saw some basic commands. 198 | 199 | We'll pick this up on Wednesday, and remember that we will be at 200 | 11am on Zoom, with discussion section in the usual classroom/class time. 201 | 202 | ***** Where we ended for today. ***** 203 | """ 204 | -------------------------------------------------------------------------------- /lecture2/parts/3-informational.py: -------------------------------------------------------------------------------- 1 | """ 2 | Friday, October 17 3 | 4 | Part 3: Informational Commands 5 | 6 | Discussion Question & Poll: 7 | 8 | 1. Which of the following are a good use cases for things to list in .gitignore? (Select all that apply) 9 | 10 | 11 | 2. Platform-specific things to be aware of could include... (Select all that apply) 12 | 13 | 14 | https://forms.gle/HmmT8BjXtiBvferRA 15 | 16 | Some notes: 17 | 18 | Q1: 19 | - Not all hidden files are unimportant! 20 | Some, like .gitignore may be important, or may be useful to track 21 | with the repository. 22 | 23 | Other hidden files, like .DS_Store are not important and can be ignored. 24 | 25 | Q2: 26 | - Python is cross-platform (at least for things like a Hello World! program) 27 | and will work on any operating system 28 | 29 | - definition: what is a Platform anyway? 30 | 31 | Platform = The operating system + the architecture + any libraries or other environment packages that are installed 32 | 33 | Review poll answers: exams/poll_answers.md 34 | 35 | Continuing the shell: 36 | 37 | Last time: 38 | 39 | I introduced a model for interacting with the shell which I called the 40 | 3-part model: 41 | - Informational commands 42 | - Help commands 43 | - Doing stuff commands 44 | 45 | Analogy: 46 | I mentioned this is kind of like playing a text-based adventure game 47 | like "Zork" (1970s), many other old games 48 | 49 | === Informational commands === 50 | 51 | Just as in a text-based adventure, 52 | the most important thing you need to know when opening a shell is 53 | how to "look" around. What do you see? 54 | 55 | Key features of such commands: 56 | - Don't modify your system state at all 57 | - Might tell you some information about your system and things around you, 58 | and what you might want to do next. 59 | 60 | The same approach to progressing the game in Zork applies to the shell! 61 | Including external tools people have built, and even commands outside of the shell, like 62 | functions in Python: 63 | knowing how to "see" the current 64 | state relevant to your command is often the first step to get more comfortable with the command. 65 | 66 | So how do we "look around"? 67 | 68 | - ls 69 | we have already seen - list files/directories in the current location 70 | 71 | "current working directory" 72 | 73 | - echo $PWD -- get our current location 74 | PWD = Print Working Directory 75 | echo = Repeat whatever I said 76 | echo "text" -- repeat text 77 | 78 | $VAR - means a variable with name VAR 79 | 80 | These are called "environment variables" 81 | 82 | - echo $PATH -- get other locations where the shell looks for programs 83 | If you've had any difficulties installing software, you may have heard of 84 | the path! 85 | 86 | In order for software to actually run, you need to add it to your PATH. 87 | python3 -- "command not found"??? 88 | conda -- "command not found"??? 89 | It's possible that you need to add something to your PATH. 90 | 91 | When we do something like `python3 --version` -- we're checking that you 92 | have the software installed, AND that it's been added to your path. 93 | 94 | It's a common source of confusion to have multiple installations of the same 95 | software on your machine -- this is common, for example, with Python 96 | You need to add all installations, or, only the most relevant/recent installation 97 | to your PATH to ensure that that's the one that gets run. 98 | 99 | Programs in $PATH are available both programmatically (to other programs), 100 | and to the user. 101 | 102 | - echo $SHELL -- get the current shell we are using 103 | 104 | Point: 105 | There are different shells and terminal implementations. 106 | The default one on MacOS these days is zsh 107 | bash is another common/very well-used shell (for example, on Linux systems) 108 | 109 | If you're on Windows I recommend using WSL so that you have access to a 110 | similar shell (usually bash) 111 | 112 | I can run one shell from another shell. Try: 113 | - bash 114 | - zsh 115 | - Kind of like the Python3 command prompt 116 | 117 | Are there advantages to one shell over another? 118 | 119 | Yes, there are also some advanced/modern shells that you can install 120 | 121 | - maybe with some interesting graphical interface 122 | - maybe with some interesting color coding 123 | - maybe with some AI support 124 | 125 | Most shells try to support a similar syntax so that people don't get 126 | confused going from one shell to another. 127 | 128 | ---- other possible answers (skipped) ---- 129 | 130 | Usability: Some modern shells have fancy things like syntax highlighting, 131 | GUIs that you can click around in, etc. 132 | 133 | Portability: You'll want a shell that sort of behaves as a Unix-like shell 134 | Avoid: PowerShell (Windows syntax) 135 | 136 | (Mac and Linux systems are Unix-based. Windows is based on a totally different 137 | OS architecture.) 138 | 139 | A system will come with a built-in shell that you would start out with 140 | If you want a different one you could use that shell to install another shell. 141 | 142 | === Environment variables === 143 | 144 | The $ are called "environment variables" -- there are others! 145 | These represent the current state of our shell environment. 146 | 147 | When I write x = 3 in Python, x becomes a local variable assigned to the integer "3" 148 | Similarly, $PATH and $PWD are local variables in the shell. 149 | They're assigned to some values, and when written the shell will expand them out 150 | to whatever values they're assigned to. 151 | 152 | When we run `echo $PWD` what's actually happening: 153 | - $PWD gets expanded out to its value (/Users/..../119/lecture2) 154 | - This value gets printed back out to the shell output by `echo` 155 | 156 | You can also define and set your own environment variables. 157 | 158 | Are environment variables local? Will they persist after the shell session terminates? 159 | 160 | A: 161 | No, they won't 162 | But there is a way to make things persist and these are the shell config files like 163 | - .bash_profile, .bashrc, .zshrc, ... 164 | - These are files with random code in them that gets executed whenever you open a shell. 165 | + For zsh, every time I open a shell, .zshrc is executed 166 | - This is why we don't have to keep adding Python, conda, etc. to the $PATH every 167 | time we open a new shell. 168 | 169 | This system - of environment variables and $PATH and .zshrc, etc. is 170 | the precarious fabric on which all software installation is working under the 171 | hood. 172 | 173 | In case you need to access similar functionality from a Python script: 174 | """ 175 | 176 | # with a built-in Python library 177 | def pwd_1(): 178 | print(os.getcwd()) 179 | 180 | # pwd_1() 181 | 182 | # with subprocess (run an arbitrary command) 183 | # This one is a bit harder 184 | def pwd_2(): 185 | # os.environ is the Python equivalent of the shell $ indicator. 186 | subprocess.run(["echo", os.environ["PWD"]]) 187 | 188 | # pwd_2() 189 | 190 | # In fact, we could just use this directly as well, and this offers a third way 191 | def pwd_3(): 192 | print(os.environ["PWD"]) 193 | 194 | # pwd_3() 195 | 196 | # Q: What happens when we run this from a different folder? 197 | 198 | # It matters what folder you run a program or command from! 199 | 200 | """ 201 | Recap: 202 | 203 | - We talked about informational commands - ways to get the state of the system 204 | 205 | + Current working directory 206 | + Files/folders in the system (or in the current woroking directory) 207 | + The shell that's running 208 | 209 | - We talked about environment variables 210 | 211 | $PWD, $PATH 212 | 213 | These are important pieces of system information 214 | 215 | - We talked a little bit about .zshrc, .bash_profile, etc. which are 216 | shell configuration files 217 | 218 | + Lists of shell commands that run when you open a shell. 219 | 220 | BTW, virtual machines and things like Docker also have similar such config files 221 | 222 | Dockerfile -- list of shell commands that gets run. 223 | 224 | Next time we will talk about: 225 | - help commands, doing stuff commands 226 | """ 227 | -------------------------------------------------------------------------------- /lecture1/parts/3-dataflow-graphs.py: -------------------------------------------------------------------------------- 1 | """ 2 | Friday, October 3 3 | 4 | Part 3: 5 | From ETL to Dataflow Graphs. 6 | 7 | === Poll === 8 | 9 | Which of the following are most likely advantages of writing or rewriting 10 | a data processing pipeline as well-structured software (separated into modules, classes, and functions)? 11 | 12 | https://forms.gle/akNYHe8SY1CSU5KT9 13 | 14 | === Dataflow graphs === 15 | 16 | We can view all of the above steps as something called a dataflow graph. 17 | 18 | ETL jobs can be thought of visually like this: 19 | 20 | (Extract) -> (Transform) -> (Load) 21 | 22 | This is a Directed Acyclic Graph (DAG) 23 | 24 | A graph is set a nodes and a set of edges 25 | 26 | () () -> () 27 | () -> () 28 | 29 | Nodes = points 30 | Edges = arrows between points 31 | 32 | We're going to generalize the ETL model to allow an arbitrary 33 | number of nodes and edges. 34 | 35 | Q: What are the nodes? What are the edges? 36 | 37 | - Nodes: 38 | Each node is a function that takes input data and produces output data. 39 | 40 | Such a function is called an *operator.* 41 | 42 | In Python: 43 | """ 44 | 45 | # example: operator with 3 inputs: 46 | def stage(input1, input2, input3): 47 | # do some processing 48 | # return the output 49 | return output 50 | 51 | # example: operator with 0 inputs: 52 | def stage(): 53 | # do some processing 54 | # return the output 55 | return output 56 | 57 | """ 58 | - Edges: 59 | Edges are datasets! 60 | They can either be: 61 | - individual rows of data, or 62 | - multiple rows of data... 63 | 64 | - More specifically, we draw an edge 65 | from a node (A) to a node (B) if 66 | the operator B directly uses the output of operator (A). 67 | 68 | In our previous example? 69 | 70 | (1) Extract = loading the set of websites sessions into a Pandas dataframe 71 | + Input: None (because we loaded from a file) 72 | + Output: A Pandas DataFrame 73 | 74 | (2) Transform = taking in the Pandas DataFrame and returning the session with the 75 | maximum time spent 76 | + Input: The pandas DataFrame from the Extract stage 77 | + Output: the maximum session 78 | 79 | (3) Load = taking the maximum session and saving that to a file (in our case, save.txt) 80 | + Input: the maximum session from the Transform stage 81 | + Output: None (because we saved to a file) 82 | 83 | Graph: 84 | 85 | (1) -> (2) -> (3) 86 | 87 | Questions: 88 | 89 | - Why is there an edge from (1) to (2)? 90 | Because the Transform stage uses the output from the Extract stage 91 | 92 | - Why is there NOT an edge from (2) to (1)? 93 | Stage (1) doesn't use the output from stage (2) 94 | 95 | - Why is there NOT an edge from (1) to (3)? 96 | Stage (3) doesn't directly use the output from stage (1). 97 | 98 | The graph is **acyclic,** meaning it does not have loops. 99 | 100 | (This is why it's a "directed acyclic graph" (DAG)). 101 | 102 | - (Why can we assume this?) 103 | + It doesn't appear to make sense for stage (1) to use stage (2)'s output, 104 | and stage (2) to use stage (1)'s output. 105 | + Generalizations of this are possible, but we will not get into this now. 106 | 107 | === Q: Do all data processing pipelines look like series of stages? === 108 | 109 | A "very complicated" data processing job: 110 | a long series of stages 111 | 112 | (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> ... -> (10) 113 | 114 | But not all data processing dataflow graphs have this form! 115 | Let's do a quick example 116 | 117 | Suppose that in addition to the maximum session, we want the minimum session. 118 | """ 119 | 120 | def stage1(): 121 | data = { 122 | "User": ["Alice", "Alice", "Charlie"], 123 | "Website": ["Google", "Reddit", "Wikipedia"], 124 | "Time spent (seconds)": [120, 300, 240], 125 | } 126 | df = pd.DataFrame(data) 127 | return df 128 | 129 | def stage2(df): 130 | t = df["Time spent (seconds)"] 131 | # Max of t 132 | max = t.max() 133 | # Filter 134 | # This syntax in Pandas for filtering rows 135 | # df[colname] 136 | # df[row filter] (row filter is some sort of Boolean condition) 137 | return df[df["Time spent (seconds)"] == max] 138 | 139 | def stage3(df): 140 | # Save the dataframe somewhere 141 | with open("save.txt", "w") as f: 142 | print(df, file=f) 143 | 144 | # New stage: 145 | # Also compute the minimum session 146 | def stage4(df): 147 | t = df["Time spent (seconds)"] 148 | min = t.min() 149 | return df[df["Time spent (seconds)"] == min] 150 | 151 | # Finally, print the output from stage4 152 | def stage5(df): 153 | # print: the .head() of the dataframe, which will give you the first 154 | # few rows. 155 | print(df.head()) 156 | 157 | # Try running the pipeline 158 | df1 = stage1() 159 | df2 = stage2(df1) 160 | df3 = stage3(df2) # uses output from stage 2 161 | df4 = stage4(df1) 162 | df5 = stage5(df4) # uses output from stage 4 163 | 164 | """ 165 | As a dataflow graph: 166 | 167 | (1) -> (2) -> (3) 168 | -> (4) -> (5) 169 | 170 | This is our dataflow graph for this example. 171 | 172 | Seems like a simple idea, but this can be done for any data processing pipeline! 173 | 174 | We will see that this is a very powerful abstraction. 175 | 176 | === Why is this useful? === 177 | 178 | - We'll use this to think about development, testing, and validation 179 | - We'll use this to think about parallelism 180 | - We'll use this to think about performance. 181 | """ 182 | 183 | """ 184 | A slightly more realistic example 185 | 186 | What I want to practice: 187 | 1. Thinking about the stages involved in a data processing computation as separate stages 188 | (List out all of the data processing stages) 189 | 2. Writing down the dataflow graph 190 | 3. Translating that to Python code using PySpark by writing a separate Python function 191 | for each stage 192 | 193 | Let's consider how to write a minimal data processing pipeline 194 | as a more structured software pipeline. 195 | 196 | First thing we need is a dataset! 197 | 198 | Useful sites: 199 | - https://ourworldindata.org/data 200 | - https://datasetsearch.research.google.com/ 201 | - sklearn: https://scikit-learn.org/stable/api/sklearn.datasets.html 202 | 203 | The dataset we are going to use: 204 | life-expectancy.csv 205 | 206 | We will also use Pandas, as we have been using: 207 | 208 | Useful tutorial on Pandas: 209 | https://pandas.pydata.org/docs/user_guide/10min.html 210 | https://pandas.pydata.org/docs/user_guide/indexing.html 211 | """ 212 | 213 | # Step 1: Getting a data source 214 | # creating a DataFrame 215 | # DataFrame is just a table: it has rows and columns, and importantly, 216 | # each column has a type (all items in the column must share the same 217 | # type, e.g., string, number, etc.) 218 | df = pd.read_csv("life-expectancy.csv") 219 | 220 | # To play around with our dataset: 221 | # python3 -i lecture.py 222 | 223 | # We find that our data contains: 224 | # - country, country code, year, life expectancy 225 | 226 | # What should we compute about this data? 227 | 228 | # # A simple example: 229 | # min_year = df["Year"].min() 230 | # max_year = df["Year"].max() 231 | # print("Minimum year: ", min_year) 232 | # print("Maximum year: ", max_year) 233 | # avg = df["Period life expectancy at birth - Sex: all - Age: 0"].mean() 234 | # print("Average life expectancy: ", avg) 235 | 236 | # # Tangent: 237 | # # We can do all of the above with df.describe() 238 | 239 | # # Save the output 240 | # out = pd.DataFrame({"Min year": [min_year], "Max year": [max_year], "Average life expectancy": [avg]}) 241 | # out.to_csv("output.csv", index=False) 242 | 243 | """ 244 | Q for next time: rewrite this as a Dataflow graph using the steps above 245 | 246 | === Recap === 247 | 248 | We learned that ETL jobs are a special case of dataflow graphs, 249 | where we have a set of nodes (operators/stages) and edges (which are drawn when the output 250 | of one operator or stage depends on the output of the previous operator or stage) 251 | 252 | Revisiting the steps above: 253 | 1. Write down all the stages in our pipeline 254 | 2. Draw a dataflow graph (one node per stage) 255 | 3. Implement the code (one Python function per stage) 256 | 257 | We have done 1 and (sort of) 3, we will do 2 at the start of class next time. 258 | 259 | ********* Where we ended for today ********** 260 | """ 261 | -------------------------------------------------------------------------------- /lecture2/parts/5-git.py: -------------------------------------------------------------------------------- 1 | """ 2 | Wednesday, October 22 3 | 4 | Part 5: Git 5 | 6 | Poll quesiton: 7 | 8 | 1. Give an example of a command that uses a positional argument 9 | 10 | 2. Give an example of a command that uses a named argument 11 | 12 | 3. Why do you think that commands have both positional and named arguments? 13 | 14 | A) There is no reason for this, it's a historical accident 15 | B) Positional arguments are more often optional, named arguments are more often required 16 | C) Named arguments are more often optional, positional arguments are more often required 17 | D) Named arguments can be combined with positional arguments to specify options or modify command behavior 18 | E) Named arguments emphasize the intended purpose of the argument for readability purposes 19 | F) Named arguments allow easily specifying Boolean flags (like turning debug mode on or off) 20 | 21 | https://forms.gle/UNCmxWcRE53MkLNv7 22 | 23 | Comments: 24 | 25 | Analogy: 26 | positional argument 27 | cd dir 28 | def cd(dir) 29 | 30 | named argument 31 | python --version 32 | def python(version=True) 33 | 34 | Correct answers: C, D, E, F 35 | 36 | Named arguments are more often optional 37 | 38 | Named arguments are often used as configuration flags 39 | 40 | python3 -i <--- modifies the "way" that we run Python 41 | 42 | Another difference that wasn't mentioned: 43 | 44 | Typically the order does not matter in named arguments! 45 | 46 | ls -alh 47 | ls -lah 48 | ls -a -l -h 49 | 50 | If the user might not want to remember which order to call 51 | the args in, another good use case for named. 52 | 53 | ===== Git ===== 54 | 55 | Git follows the same model as other shell commands! 56 | 57 | Informational commands: 58 | - git status 59 | 60 | Info returned by git status? 61 | + "On branch" main 62 | + Whether my branch is up to date 63 | + Info about modified files 64 | + Do I have any changes to commit 65 | 66 | - git log 67 | 68 | (down arrow, up arrow, q to quit) 69 | 70 | Shows a list of commits that have been made to the repository. 71 | 72 | Mental model of git: it's a tree 73 | 74 | root: 75 | (Instructor initialized the repository) 76 | 77 | Every time a change is made to the code, it grows the tip of the tree 78 | After instructor posts lecture 1 and 2, 79 | 80 | root 81 | | 82 | lecture1 83 | | 84 | lecture2 85 | 86 | If the TA simultaneously makes changes to lecture1 87 | Two diverging branches: 88 | 89 | root 90 | | 91 | lecture1 92 | | \ 93 | lecture2 (instructor's changes) lecture1 (TA's changes) 94 | 95 | When you check out the code, you're at a particular point in 96 | the tree. 97 | 98 | Git is sort of the opposite of keeping everything up to date :-) 99 | 100 | Modern webapp philosophy: everything should sync automatically! 101 | 102 | Git philosophy: nothing should sync automatically! 103 | 104 | That means that everyone opts in to what changes they do/do not 105 | want for the code. 106 | 107 | Side note: interesting questions about philosophy of collaboration 108 | and why we may or may not want to share our work/progress with others. 109 | 110 | Imagine two people working on the same branch at once, 111 | why would that be a problem? 112 | One person's work could break another person's work :-( 113 | 114 | + This gets more common the more people are working on a shared 115 | project. 116 | 117 | + Unit tests fail, code fails to compile, etc. 118 | 119 | + In this scenario, Git saves you: it says, you get to "check 120 | out" your copy of the code and be assured that your copy is 121 | yours to play around with. 122 | 123 | Corollaries: 124 | 125 | - Everyone can be at a different point of the tree 126 | 127 | - Different people can work on different branches 128 | 129 | - If two people try to "push" the code - publishing it 130 | for others, we need to have some way of determining whose 131 | code wins the race, or how to combine the different changes 132 | to the code. 133 | 134 | ===> "merge conflict problem" 135 | 136 | - Each individual working on a branch may not need to see 137 | the entire tree at once to do their work. 138 | 139 | ======> You need the list of changes up until your point in 140 | the tree 141 | 142 | ======> You only need a "local view" or local window into 143 | the tree, which contains a copy of some of the changes. 144 | 145 | When you git clone, or "git stash, git pull" the lecture notes, 146 | you are creating an instance of this philosophy - essentially, 147 | you're working on your own little branch of the tree. 148 | 149 | In that context: 150 | 151 | First two lines of git status: where I am in the tree 152 | 153 | Changes: changes I've made on my local copy of the tree. 154 | 155 | - git log --oneline 156 | 157 | List of changes, one per line 158 | 159 | - git branch -v 160 | 161 | More information about the branch you're on 162 | 163 | - git diff 164 | 165 | Most useful second to git status -- what changes you've made 166 | to the code on your local copy. 167 | 168 | What about help commands? Try: 169 | - man git 170 | - git status --help 171 | - git log --help 172 | - git add --help 173 | - git commit --help 174 | 175 | BTW: log, add, commit -- "subcommands" 176 | You can think of them like a special type of positional argument 177 | 178 | Finally, doing stuff: 179 | 180 | For getting others' changes: 181 | - git pull 182 | 183 | Pull the latest code from the "published" version of the branch 184 | 185 | - git push 186 | 187 | Attempt to push your code to the "published" version of the branch 188 | 189 | If one branch is behind the other, great! we can just grow that 190 | branch to make it equal to the other one 191 | 192 | We're gonna have a problem if they are on diverging or two separate 193 | branches, git does a "merge" 194 | and try to shove the two things together. 195 | 196 | Sometimes it works, sometimes it doesn't and you have to debug. 197 | 198 | Two things could fail: 199 | 200 | 1. Git could not know how to merge the changes 201 | (usually happens if both people modified) 202 | Results in: "Merge conflict" 203 | 204 | 2. Merge succeeds, but the code breaks 205 | 206 | Two conflicting features, one feature breaks someone else's 207 | unit test, etc. 208 | 209 | (Related commands -- not as worried about:) 210 | - git fetch 211 | - git checkout 212 | 213 | For sharing/publishing your own changes 214 | (a common sequence of three to run): 215 | - git add . 216 | 217 | After a git add, I usually do a: 218 | git status 219 | git diff --staged 220 | 221 | AND: 222 | Run the code again just to make sure everything looks good 223 | 224 | - git commit -m "Commit message" 225 | 226 | Modify what you just did: 227 | git commit --amend 228 | 229 | Then I would do a git status again 230 | 231 | **Most important:** 232 | To publish your code to the main branch 233 | 234 | Magic sequence of 3 commands: 235 | git add, git commit, git push. 236 | 237 | This is a multi-step process because git wants you to 238 | be deliberate about all changes. 239 | 240 | git add = what changes you want 241 | 242 | git commit = why? 243 | 244 | git push = publish it. 245 | 246 | git add, git commit = your local branch only 247 | 248 | git push = shares it publicly. 249 | 250 | ========================================= 251 | 252 | ===== Other miscellaneous things ===== 253 | 254 | i) Text editors in the shell 255 | 256 | Running `git commit` without the `-m` option opens up a text 257 | editor! 258 | 259 | Vim: dreaded program for many new command line users 260 | 261 | Get stuck -- don't know how to quit vim! 262 | 263 | :q + enter 264 | 265 | The most "accessible" of these is probably nano. 266 | 267 | Sometimes files open by default in vim and you have to 268 | know how to close the file. 269 | 270 | Use nano (most accessible), don't use vim and emacs. 271 | """ 272 | 273 | def edit_file(file): 274 | subprocess.run(["nano", file]) 275 | 276 | # Let's edit the lecture file and add something here. 277 | # print("Hello, world!") 278 | 279 | """ 280 | Text editors get opened when you run git commands 281 | like git commit without a message. 282 | 283 | ii) Variations of git diff 284 | 285 | git diff 286 | git diff --word-diff (word level diff) 287 | git diff --word-diff-regex=. (character level diff) 288 | 289 | git diff --staged -- after you do a git add, shows diff from green 290 | changes 291 | 292 | iii) Other git commands (selected most useful): 293 | 294 | - git merge -- merge together different conflicting versions of the code 295 | - git rebase 296 | - git rebase -i -- often useful for modifying commit messages 297 | - git branch -- create a new branch, often useful for developing new features. 298 | 299 | Just like before, we can also run these commands in Python. 300 | """ 301 | 302 | def git_status(): 303 | # TODO 304 | raise NotImplementedError 305 | -------------------------------------------------------------------------------- /lecture4/parts/2-parallelism.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 2: Definitions and Parallel/Concurrent Distinction 3 | 4 | Parallel computing: speeding up our pipeline by doing more than 5 | one thing at a time! 6 | 7 | === Getting Started === 8 | 9 | Parallel, concurrent, and distributed computing 10 | 11 | They're different, but often get (incorrectly) used synonymously! 12 | 13 | What is the difference between the three? 14 | 15 | Let's make a toy example. 16 | 17 | We will forgo Pandas for this lecture 18 | and just work in plain Python for the sake of clarity, 19 | but all of this applies to a data processing pipeline written using 20 | vectorized operations as well (as we will see soon). 21 | 22 | Baseline: 23 | (sequential pipeline) 24 | 25 | Sequential means "not parallel", i.e. one thing happening at a time, 26 | run on a single worker. 27 | 28 | (Rule 0 of parallel computing: 29 | Any time we're measuring parallelism we want to start 30 | with a sequential version of the code!) 31 | 32 | Scalability! But at what COST? 33 | https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf 34 | 35 | Q: can I think of sequential computing like a monolithic application, 36 | and parallel computing like microservices? 37 | 38 | That's related to the distibuted computing part - we'll talk more about 39 | the difference between parallel/distributed soon. 40 | 41 | For now think of: 42 | 43 | Sequential baseline = 1 machine, only 1 CPU runs (1 worker) 44 | 45 | Parallel = multiple workers (machines or CPUs) 46 | """ 47 | 48 | def average_numbers(N): 49 | sum = 0 50 | count = 0 51 | for i in range(N): 52 | # Busy loop with some computation in it 53 | sum += i 54 | count += 1 55 | return sum / count 56 | 57 | # Uncomment to run: 58 | # N = 200_000_000 59 | # result = average_numbers(N) 60 | # print(f"Result: {result}") 61 | 62 | # baseline (Sequential performance) is at 9.07s 63 | 64 | # From the command line: 65 | # time python3 lecture.py 66 | 67 | # How are we doing on CPU usage: 68 | # Activity Monitor in MacOS (Window -> CPU Usage) 69 | 70 | """ 71 | Task gets moved from CPUs from time to time, making it a little difficult to see, 72 | but at any given time one CPU is being used to run our program. 73 | 74 | What if we want to do more than one thing at a time? 75 | 76 | Let's say we want to make our pipeline twice as fast. 77 | 78 | We're adding the numbers from 1 to N... so we could: 79 | 80 | - Have worker1 add up the first half of the numbers 81 | 82 | - Have worker2 add up the second half 83 | 84 | At the end, combine worker1's and worker2's results 85 | 86 | Our hope: we take about half the time to complete the computation. 87 | 88 | """ 89 | 90 | # ************************************** 91 | # ********** Ignore this part ********** 92 | # ************************************** 93 | 94 | # NOTE: Python has something called a global interpreter lock (GIL) 95 | # which often prevents code from running in parallel (via threads) 96 | # We are using this purely for illustration, but Python is generally 97 | # not a good fit for parallel and concurrent code. 98 | 99 | from multiprocessing import Process, freeze_support 100 | 101 | def run_in_parallel(*tasks): 102 | running_tasks = [Process(target=task) for task in tasks] 103 | for running_task in running_tasks: 104 | running_task.start() 105 | for running_task in running_tasks: 106 | result = running_task.join() 107 | 108 | # ************************************** 109 | # ************************************** 110 | # ************************************** 111 | 112 | def worker1(): 113 | sum = 0 114 | count = 0 115 | for i in range(N // 2): 116 | sum += i 117 | count += 1 118 | print(f"Worker 1 result: {sum} {count}") 119 | # return (sum, count) 120 | 121 | def worker2(): 122 | sum = 0 123 | count = 0 124 | for i in range(N // 2, N): 125 | sum += i 126 | count += 1 127 | print(f"Worker 2 result: {sum} {count}") 128 | # return (sum, count) 129 | 130 | def average_numbers_parallel(): 131 | results = run_in_parallel(worker1, worker2) 132 | print(f"Computation finished") 133 | 134 | # Uncomment to run 135 | # N = 200_000_000 136 | # if __name__ == '__main__': 137 | # freeze_support() # another boilerplate line to ignore 138 | # average_numbers_parallel() 139 | 140 | # time python3 lecture.py: 5.2s 141 | # CPU usage 142 | 143 | """ 144 | New result: roughly half the time! (Twice as fast) 145 | 146 | Not exactly twice as fast - why? 147 | 148 | One reason is because of the additional boilerplate required 149 | to run multiple workers and combine the results. 150 | 151 | We actually didn't combine the results! 152 | (We should add the results from worker1 and worker2) 153 | This would also add a small amount of overhead 154 | 155 | We've successfully achieved parallelism! 156 | We have two workers running at the same time. 157 | 158 | === What is parallelism? === 159 | 160 | Imagine a conveyor belt, where our numbers are coming in on the belt... 161 | 162 | ASCII art: 163 | 164 | ==> | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | ==> 165 | ... ========================================== worker1 166 | worker2 167 | 168 | Our worker takes the items off the belt and 169 | adds them up as they come by. 170 | 171 | Worker: 172 | (sum, count) 173 | (0, 0) -> (1, 1) -> (3, 2) -> (6, 3) -> (10, 4) -> ... 174 | 175 | When is this parallel? 176 | 177 | There are multiple workers working at the same time. 178 | 179 | The workers could be working on the same conveyor belt 180 | or two different conveyor belts 181 | 182 | Worker1 and worker2 are working on separate conveyer belts! 183 | 184 | ==> | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | ==> 185 | ... ========================================== worker1 186 | 187 | ==> | | ... | 1000002 | 1000001 | 1000000 | ==> 188 | ... ========================================== worker2 189 | 190 | === What is concurrency? === 191 | 192 | Concurrency is when there are multiple tasks happening that might overlap or conflict. 193 | 194 | - If the workers are working on the same conveyer belt ... then operations might conflict 195 | 196 | - If the workers are working on different conveyer belts ... then operations won't conflict! 197 | 198 | ----- Where we ended for today ----- 199 | 200 | ----------------------------- 201 | 202 | Oct 29 203 | (Finishing up a few things) 204 | 205 | Recap: 206 | 207 | - Once our pipeline is working sequentially, we want to figure 208 | out how to **scale** to more data and more frequent updates 209 | 210 | - We talked about parallelism: multiple workers working at once 211 | 212 | - Conveyer belt analogy: 213 | parallel = multiple workers working at the samt eim. 214 | 215 | === Related definitions and sneak peak === 216 | 217 | Difference between parallelism & concurrency & distribution: 218 | 219 | - Parallelism: multiple workers working at the same time 220 | - Concurrency: multiple workers accessing the same data (even at different times) by performing potentially conflicting operations 221 | 222 | For now: think about it as multiple workers modifying or moving items 223 | on the same conveyer belt. 224 | 225 | - Distribution: workers operating over multiple physical devices which are independently controlled and may fail independently. 226 | 227 | Good analogy: 228 | Distributed computing is like multiple warehouses, each with its own 229 | workers and conveyer belts. 230 | 231 | For the purposes of this class: if code is running on multiple devices, 232 | it is distributed; otherwise it's not. 233 | 234 | In the conveyor belt analogy, this means... 235 | 236 | - Parallelism can exist without concurrency! 237 | (How?) 238 | 239 | Multiple belts, each worker has its own belt 240 | 241 | - Concurrency can exist without parallelism! 242 | (How?) 243 | 244 | Operations can conflict even if the two workers 245 | are not working at the same time! 246 | 247 | Worker1 takes an item off the belt 248 | Worker1 makes some modifications and puts it back 249 | Then Worker 1 goes on break 250 | Worker2 comes to the belt 251 | Doesn't realize that worker 1 was doing anything here 252 | Takes the item off the belt 253 | Worker 2 tries to make the same modifications. 254 | 255 | We have a conflict! 256 | 257 | In fact, this is what happens in Python if you 258 | use threads. 259 | (Threads vs. processes) 260 | 261 | Multiple workers working concurrently, only one 262 | active at a given time. 263 | 264 | - Both parallelism and concurrency can exist with/without distribution! 265 | 266 | Are the different conveyer belts operated by different computers? 267 | Do they function and fail independently? 268 | 269 | Are the different workers running on different computers? 270 | Do they function and fail independently? 271 | """ 272 | -------------------------------------------------------------------------------- /lecture6/parts/2-microbatching.py: -------------------------------------------------------------------------------- 1 | """ 2 | Part 2: 3 | Spark Streaming and Microbatching 4 | 5 | === Poll === 6 | 7 | Which of the following are most likely application scenarios for which latency is a primary concern? 8 | 9 | . 10 | . 11 | . 12 | 13 | https://forms.gle/Le4NZTDEujzcqmg47 14 | 15 | === Spark Streaming === 16 | 17 | In particular: Structured Streaming 18 | Structured = using relational and SQL abstractions 19 | Structured Streaming syntax is similar (often almost identical) to Spark DataFrames 20 | 21 | There's an analogy going on here! 22 | Batch processing application using DataFrames <---> Streaming application using Structured Streaming 23 | 24 | Let's see our streaming example in more detail. 25 | 26 | (We demoed this example last time) 27 | """ 28 | 29 | # Old imports 30 | import pyspark 31 | from pyspark.sql import SparkSession 32 | spark = SparkSession.builder.appName("OrderProcessing").getOrCreate() 33 | sc = spark.sparkContext 34 | 35 | # New imports 36 | from pyspark.sql.functions import array_repeat, from_json, col, explode 37 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType 38 | 39 | # Define the schema of the incoming JSON data 40 | schema = StructType([ 41 | StructField("order_number", IntegerType()), 42 | StructField("item", StringType()), 43 | StructField("timestamp", StringType()), 44 | StructField("qty", IntegerType()) 45 | ]) 46 | 47 | def process_orders_stream(order_stream): 48 | """ 49 | important: 50 | order_stream: now a stream handle, instead of a list of plain data! 51 | """ 52 | 53 | # Parse the JSON data 54 | df0 = order_stream.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) 55 | 56 | # First cut: just return the parsed orders 57 | # return df0 58 | 59 | ### Full computation 60 | 61 | # df0 is all bunched up in a single column, can we expand it? 62 | 63 | # Yes: select the data we want 64 | df1 = df0.select( 65 | col("parsed_value.order_number").alias("order_number"), 66 | col("parsed_value.item").alias("item"), 67 | col("parsed_value.timestamp").alias("timestamp"), 68 | col("parsed_value.qty").alias("qty") 69 | ) 70 | 71 | # (Notice this looks very similar to SQL! 72 | # Structured Streams uses an almost identical API to Spark DataFrames.) 73 | 74 | # return df1 75 | 76 | # Create a new field which is a list [item, item, ...] for each qty 77 | df2 = df1.withColumn("order_numbers", array_repeat(col("order_number"), col("qty"))) 78 | 79 | # Explode the list into separate rows 80 | df3 = df2.select(explode(col("order_numbers")).alias("order_number"), col("item"), col("timestamp")) 81 | 82 | return df3 83 | 84 | """ 85 | We need to decide where to get our input! Spark supports getting input from: 86 | - Apache Kafka (good production option) 87 | - Files on a distributed file system (HDFS (Hadoop File system), S3) 88 | - A network socket (basically a connection to some other worker process or network service) 89 | 90 | We're looking for a toy example, so let's use a plain socket. 91 | 92 | This will require us to open up another terminal and run the following command: 93 | 94 | nc -lk 9999 95 | 96 | """ 97 | 98 | # We need an input (source) 99 | 100 | # (Uncomment to run) 101 | # Set up the input stream using a local network socket 102 | # One of the ways to get input - from a network socket 103 | order_stream = spark.readStream.format("socket") \ 104 | .option("host", "localhost") \ 105 | .option("port", 9999) \ 106 | .load() 107 | 108 | # Call the function 109 | out_stream = process_orders_stream(order_stream) 110 | 111 | # We need an output (sink) 112 | 113 | # Print the output stream and run the computation 114 | out = out_stream.writeStream.outputMode("append").format("console").start() 115 | 116 | # Run the pipeline 117 | 118 | # Run until the connection closes. 119 | out.awaitTermination() 120 | 121 | """ 122 | There are actually two streaming APIs in Spark, 123 | the old DStream API and the newer Structured Streaming API. 124 | 125 | Above uses the Structured Streaming API (which is more modern and flexible 126 | and solves some problems with DStreams, I also personally found it better 127 | to work with on my machine.) 128 | 129 | === Q + A === 130 | 131 | Q: How is the syntax different from a batch pipeline? 132 | 133 | A: It's nearly identical to DataFrames, except the input/output 134 | input: pass in a stream instead of a dataframe 135 | output: we call .writeStream.outputMode(...) 136 | 137 | Q: How is the behavior different from a batch pipeline? 138 | 139 | A: 140 | It groups events into "microbatches", and processes each microbatch 141 | in real time (aiming to achieve low latency) 142 | 143 | You can also set the microbatch duration 144 | (1 batch every 1 second, 1 batch every 0.5 seconds, 1 batch every 0.25 seconds, ...) 145 | depending on your application needs 146 | 147 | Q: How does Spark determine when to "run" the pipeline? 148 | 149 | A: Calling .start() 150 | 151 | Q: How do we know when the pipeline is finished? 152 | 153 | A: 154 | With .awaitTermination() we just wait for the user to terminate the connection 155 | (ctrl-C) 156 | You can also configure different options, for example terminate after inactivity 157 | for 1 hour, etc. 158 | 159 | Upshot: 160 | input/output configuration is different, 161 | actual application logic of the pipline is the same. 162 | 163 | === Microbatching === 164 | 165 | To process a streaming pipeline, Spark groups several recent orders into something 166 | called a "microbatch", and processes that all at once. 167 | 168 | (Side note: this isn't how all streaming frameworks work, but this is 169 | the main idea behind how Spark Streaming works.) 170 | 171 | Why does Spark do this? 172 | 173 | - If you process every item one at a time, you get minimum latency! 174 | 175 | Every user will get their order processed right away. 176 | 177 | But, throughput will suck. 178 | 179 | - We never benefit from parallelism (because we're never doing more than one order at the same time) 180 | 181 | - We never benefit from techniques like vectorization (we can't put multiple orders into vector operations on a CPU or GPU) 182 | 183 | (Recall: turning operatinos into vector or matrix multiplications is often faster and 184 | I can only do that if I have many rows that look the same) 185 | 186 | So: by grouping items into these "microbatches", Spark hopes to still provide good latency (by processing microbatches frequently, e.g., every half a second) but at the same time benefit from 187 | throughput optimizations, e.g., parallelism and vectorization. 188 | 189 | That leads to an interesting question: how do we determine the microbatches? 190 | 191 | - There is a tension here; the smaller the batches, the better the latency; 192 | but the worse the throughput! 193 | 194 | Possible ways? 195 | 196 | 1. Wait 0.5 seconds, collect all items during that second and group it into a batch 197 | 2. Set a limit on the number of items per batch (e.g. 100), once you get 100 198 | items, close the batch 199 | 3. Set a limit on the number of items, OR wait for a 2 second timeout 200 | 201 | Some observations: 202 | 203 | - Suggestion 1 will always result in latency < 0.5 s, but batches may be small 204 | 205 | - Suggestion 2 has a serious problem, on a non-busy day, imagine there's only 1 Amazon 206 | user. 207 | 208 | Amazon user submits their order -> we wait for the other 99 orders to come in (which 209 | never happens, or takes hours) 210 | 211 | Our one user will never get their results. 212 | 213 | - Suggestion 3 fixes the problem with suggestion 2, by imposing a timeout. 214 | 215 | Suggestion 3 tries to achieve large batch sizes, but caps out at a certain maximum; 216 | latency will always be at most 2 seconds (often smaller if there are many orders). 217 | 218 | === Another way of thinking about this === 219 | 220 | It comes down to a question of "time" and how to measure progress. 221 | 222 | All distributed systems measure progress by enforcing some notion of "time" 223 | 224 | Suggestion #1 measures time in terms of the operating system clock 225 | (e.g., time.time()) 226 | 227 | Suggestions #2 measures time in terms of how many items arrive in the pipeline, 228 | and uses that to decide when to move forward. 229 | 230 | This is related to something called "logical time", and we will cover it in 231 | the following part 3. 232 | 233 | Both of these suggestions become more interesting/complicated when you consider 234 | a distributed application, where you might have (say) 5-10 different machines 235 | taking in input requests, and all of them have their own notion of time that 236 | they are measuring and enforcing. 237 | 238 | This turns out to be very important, so it is the next thing we will 239 | cover in the context of streaming pipelines. 240 | 241 | It is also important to how we measure latency and can be important 242 | to the actual behavior of our pipeline. 243 | 244 | That's a bit about why time is important, and we'll get into 245 | different notions of time in this context next time. 246 | """ 247 | -------------------------------------------------------------------------------- /lecture1/extras/failures.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | === Failures and risks === 4 | 5 | Failures and risks are problems 6 | which might invalidate our pipeline (wrong results) 7 | or cause it to misbehave (crash or worse). 8 | 9 | What could go wrong in our pipeline above? 10 | Let's go through each stage at a time: 11 | 12 | 1. Input stage 13 | 14 | What could go wrong here? 15 | 16 | - Malformed data and type mismatches 17 | - Wrong data 18 | - Missing data 19 | - Private data 20 | """ 21 | 22 | """ 23 | Problem: input data could be malformed 24 | """ 25 | 26 | # Exercise 4: Insert a syntax error by adding an extra comma into the CSV file. What happens? 27 | 28 | # A: that row gets shifted over by one 29 | # All data in each column is now misaligned; 30 | # some columns contain a mix of year and life expectancy data. 31 | 32 | # Exercise 5: Insert a row with a value that is not a number. What happens? 33 | 34 | # A: changing the year on just one entry to a string, 35 | # the "Year" field turned into a string field. 36 | 37 | # Reminder about dataframes: every column has a uniform 38 | # type. (Integer, string, real/float value, etc.) 39 | 40 | # Take home point: even a single mislabeled or 41 | # malformed row can mess up the entire DataFrame 42 | 43 | # Solutions? 44 | 45 | # - be careful about input data (get your data from 46 | # a good source and ensure that it's well formed) 47 | 48 | # - validation: write and run unit tests to check 49 | # check that the input data has the properties we 50 | # want. 51 | 52 | # e.g.: write a test_year function that goes through 53 | # the year column and checks that we have integers. 54 | 55 | """ 56 | Problem: input data could be wrong 57 | """ 58 | 59 | # Example: 60 | # Code with correct input data: 61 | # avg. 61.61799192059744 62 | # Code with incorrect input data: 63 | # avg.: 48242.7791579047 64 | 65 | # Exercise 6: Delete a country or add a new country. What happens? 66 | 67 | # Deleting a country: 68 | # 61.67449487559832 instead of 61.61799192059744 69 | # (very slightly different) 70 | 71 | # Solutions? 72 | 73 | # Put extra effort into validating your data! 74 | 75 | """ 76 | Discussion questions: 77 | - If we download multiple versions of this data 78 | from different sources (for example, from Wikipedia, from GitHub, 79 | etc.) are they likely to have the same countries? Why or why not? 80 | 81 | - What can be done to help validate our data has the right set 82 | of countries? 83 | 84 | - How might choosing a different set of countries affect the 85 | app we are using? 86 | 87 | Recap from today: 88 | 89 | - Python main functions (ways to run code: python3 lecture.py (main function), python3 -i lecture.py (main function + interactive), pytest lecture.py to run unit tests) 90 | - what can go wrong in a pipeline? 91 | - input data issues & validation. 92 | 93 | =============================================================== 94 | 95 | === Poll === 96 | 97 | 1. Which of the following are common problems with input data that you might encounter in the real world and in industry? 98 | 99 | - (poll options cut) 100 | 101 | 2. How many countries are there in the world? 102 | 103 | Common answers: 104 | 105 | - 193: UN Members 106 | - 195: UN Members + Observers 107 | - 197: UN Members + Observers + Widely recognized 108 | - 200-300something: if including all partially recognized countries or territories. 109 | 110 | As we saw before, our dataset happens to have 261. 111 | - e.g.: our dataset did not include all countries with some form of 112 | limited recognition, e.g. Somaliland 113 | but it would include the 193, 195, or 197 above. 114 | 115 | Further resources: 116 | 117 | - https://en.wikipedia.org/wiki/List_of_states_with_limited_recognition 118 | 119 | - CGP Grey: How many countries are there? https://www.youtube.com/watch?v=4AivEQmfPpk 120 | 121 | In any dataset in the real world, it is common for there to be some 122 | subjective inclusion criteria or measurement choices. 123 | 124 | """ 125 | 126 | """ 127 | 2. Processing stage 128 | 129 | What could go wrong here? 130 | 131 | - Software bugs -- pipeline is not correct (gives the wrong answer) 132 | - Performance bugs -- pipeline is correct but is slow 133 | - Nondeterminism -- pipelines to produce different answers on different runs 134 | 135 | This is actually very common in the data processing world! 136 | - e.g.: your pipeline may be a stream of data and every time you run 137 | it you are running on a different snapshot, or "window" of the data 138 | - e.g.: your pipeline measures something in real time, such as timing 139 | - a calculation that requires a random subset of the data (e.g., 140 | statistical random sample) 141 | - Neural network? 142 | - Running a neural network or large language model with different versions 143 | (e.g., different version of GPT every time you call the GPT API) 144 | - ML model with stochastic components 145 | - Due to parallel and distributed computing 146 | If you decide to parallelize your pipeline, and you do it incorrectly, 147 | depending on the order in which different operations complete you 148 | might get a different answer. 149 | """ 150 | 151 | """ 152 | 3. Output stage 153 | 154 | What could go wrong here? 155 | 156 | - System errors and exceptions 157 | - Output formatting 158 | - Readability 159 | - Usability 160 | 161 | Often output might be: saving to a file or saving to a database, or even 162 | saving data to a cloud framework or cloud provider; 163 | and all of three of these cases could fail. 164 | e.g. error: you don't have permissions to the file; file already exists; 165 | not enough memory on the machine/cloud instance; etc. 166 | 167 | Summary: whenever saving output, there is the possibility that the save operation 168 | might fail 169 | 170 | Output formatting: make sure to use a good library! 171 | Things like Pandas will help here -- formatting requirements already solved 172 | 173 | When displaying output directly to the user: 174 | - Are you displaying the most relevant information? 175 | - Are you displaying too much information? 176 | - Are you displaying too little information? 177 | - Are you displaying confusing/incomprehensible information? 178 | 179 | e.g.: displaying 10 items we might have a different strategy than if 180 | we want to display 10,000 181 | 182 | example: review dataframe display function 183 | - dataframe: display header row, first 5 rows, last 5 rows 184 | - shrink the window size ==> fields get replaced by "..." 185 | 186 | There are some exercises at the bottom of the file. 187 | """ 188 | 189 | """ 190 | === Poll === 191 | 192 | 1. Which stage do you think is likely to be the most computationally intensive part of a data processing pipeline? 193 | 194 | 2. Which stage do you think is likely to present the biggest opportunity for failure cases, including crashes, ethical concerns or bias, or unexpected/wrong/faulty data? 195 | 196 | =============================================================== 197 | """ 198 | 199 | """ 200 | === Rewriting our pipeline one more time === 201 | 202 | Before we continue, let's rewrite our pipeline one last time as a function 203 | (I will explain why in a moment -- this is so we can easily measure its performance). 204 | """ 205 | 206 | """ 207 | === Additional exercises (skip depending on time) === 208 | """ 209 | 210 | """ 211 | Problem: input data could be missing 212 | """ 213 | 214 | # Exercise 7: Insert a row with a missing value. What happens? 215 | 216 | # Solutions? 217 | 218 | """ 219 | Problem: input data could be private 220 | """ 221 | 222 | # Exercise 8: Insert private data into the CSV file. What happens? 223 | 224 | # Solutions? 225 | 226 | """ 227 | Problem: software bugs 228 | """ 229 | 230 | # Exercise 9: Introduce a software bug 231 | 232 | # Solutions? 233 | 234 | """ 235 | Problem: performance bugs 236 | """ 237 | 238 | # Exercise 10: Introduce a performance bug 239 | 240 | # Solutions? 241 | 242 | """ 243 | Problem: order-dependent and non-deterministic behavior 244 | """ 245 | 246 | # Exercise 11: Introduce order-dependent behavior into the pipeline 247 | 248 | # Exercise 12: Introduce non-deterministic behavior into the pipeline 249 | 250 | # Solutions? 251 | 252 | """ 253 | Problem: output errors and exceptions 254 | """ 255 | 256 | # Exercise 13: Save the output to a file that already exists. What happens? 257 | 258 | # Exercise 14: Call the program from a different working directory (CWD) 259 | # (Note: CWD) 260 | 261 | # Exercise 15: Save the output to a file that is read-only. What happens? 262 | 263 | # Exercise 16: Save the output to a file outside the current directory. What happens? 264 | 265 | # (other issues: symlinks, read permissions, busy/conflicting writes, etc.) 266 | 267 | # Solutions? 268 | 269 | """ 270 | Problem: output formatting 271 | 272 | Applications must agree on a common format for data exchange. 273 | """ 274 | 275 | # Exercise 17: save data to a CSV file with a wrong delimiter 276 | 277 | # Exercise 18: save data to a CSV file without escaping commas 278 | 279 | # Solutions? 280 | 281 | """ 282 | Problem: readability and usability concerns -- 283 | too much information, too little information, unhelpful information 284 | """ 285 | 286 | # Exercise 19: Provide too much information as output 287 | 288 | # Exercise 20: Provide too little information as output 289 | 290 | # Exercise 21: Provide unhelpful information as output 291 | 292 | # Solutions? 293 | -------------------------------------------------------------------------------- /lecture2/parts/4-help-and-doing-stuff.py: -------------------------------------------------------------------------------- 1 | """ 2 | Monday, October 20 3 | 4 | Part 4: Help, Doing Stuff, anatomy of shell commands 5 | 6 | ----- 7 | 8 | Continuing with the shell! 9 | 10 | Showing two more commands before the discussion question: 11 | 12 | - cat : 13 | Prints out the contents of a file at 14 | 15 | - less : 16 | Show "less" of the contents of the file at 17 | u to go up, d to go down, q to quit 18 | 19 | - open : 20 | Open the file in your default program for that file. 21 | 22 | (Three ways to view/open a file) 23 | 24 | Discussion Question & Poll: 25 | Which of the following are "informational" commands? 26 | 27 | 28 | https://forms.gle/XkbkUL2QxsLz6dLq7 29 | 30 | ===== 31 | 32 | Informational commands (finishing up) 33 | 34 | Information about the current state of our shell includes: 35 | - what folder we are in 36 | - what environment variables (local variables) are set (and to what values) 37 | - other system information and system data 38 | - file contents etc. 39 | 40 | A few other commands: 41 | 42 | - ls : 43 | 44 | ls .. 45 | ls ../lecture2 46 | 47 | Examples in Python: 48 | """ 49 | 50 | def cat_1(): 51 | with open("lecture.py") as f: 52 | print(f.read()) 53 | 54 | def cat_2(): 55 | subprocess.run(["cat", "lecture.py"]) 56 | 57 | def less(): 58 | subprocess.run(["less", "lecture.py"]) 59 | 60 | # cat_1() 61 | # cat_2() 62 | 63 | # less() 64 | 65 | """ 66 | This concludes the first part on "looking around" 67 | 68 | === Getting help === 69 | 70 | Recall the three-part model: Looking around, getting help, doing something 71 | 72 | Another thing that is fundamentally important -- and perhaps even more important 73 | than the last thing -- is getting help if you *don't* know what to do. 74 | 75 | One of the following 3 things usually works: 76 | - `man cmd` or `cmd --help` or `cmd -h` 77 | 78 | Examples: 79 | - ls: has a man entry, but no --help or -h 80 | - python3: has all three options 81 | 82 | Some ways to get help (examples running these from Python): 83 | """ 84 | 85 | def get_help_for_command(cmd): 86 | subprocess.run([cmd, "--help"]) 87 | subprocess.run([cmd, "-h"]) 88 | subprocess.run(["man", cmd]) 89 | 90 | # get_help_for_command("python3") 91 | 92 | """ 93 | Other ways to get help: 94 | 95 | Using Google/StackOverflow/AI can also be really useful for a number of reasons! 96 | 97 | - A more recent development: 98 | AI tools in the shell: e.g. https://github.com/ibigio/shell-ai 99 | (use at your own risk) 100 | 101 | Example: q make a new git branch -> returns the right git syntax 102 | 103 | to determine the right command to run for what you want to do. 104 | 105 | Important caveat: you need to know what it is you want to do first! 106 | """ 107 | 108 | # Example: 109 | # how to find all files matching a name unix? 110 | # https://www.google.com/search?client=firefox-b-1-d&q=how+to+find+all+files+matching+a+name+unix 111 | # https://stackoverflow.com/questions/3786606/find-all-files-matching-name-on-linux-system-and-search-with-them-for-text 112 | # find ../lecture1 -type f -name lecture.py -exec grep -l "=== Poll ===" {} + 113 | 114 | """ 115 | Some observations: 116 | Using AI doesn't obliviate the need to understand things ourselves. 117 | - we still needed to know how to modify the command for your own purposes 118 | - we still needed to know the platform we are on (Unix) 119 | - (for the AI tool) you still need to figure out how to install it (: 120 | + as some of you have noticed (especially on Windows), installing some software dev tools 121 | can seem like even more work than using/understanding the program itself. 122 | 123 | === Doing stuff === 124 | 125 | Once we know how to "look around", and how to "get help", 126 | we can make a plan for what to do. 127 | 128 | The same advice applies to all commands: knowing how to "modify" the current 129 | state relevant to your command is often the second step to get a grip on how 130 | the command works. 131 | (In the context of a Python library such as Pandas: 132 | python3 -i to interactively "look around" 133 | the values of variables, the online documentation to see the 134 | different functions available, actually write code to do what 135 | you want.) 136 | 137 | (And, once again, this is also exactly what we would do in a text-based adventure :)) 138 | 139 | So what should we do? 140 | We need a way to move around and modify stuff: 141 | 142 | - cd -- change directory 143 | This modifies the state of the system by changing the current 144 | working directory 145 | 146 | - mkdir -- make a new (empty) directory in the current locaiton 147 | (current working directory) 148 | 149 | - cp -- copy a file from one place to another 150 | 151 | (demo: copy folder.txt to ../folder.txt) 152 | 153 | (I follow this pattern a lot -- information first, then do something, then information again) 154 | 155 | - touch -- make a new file 156 | 157 | Create a new empty file at 158 | 159 | (Another example - creating a new Python module) 160 | - mkdir subfolder 161 | - cd subfolder 162 | - touch mod.py 163 | - open mod.py 164 | 165 | - mv : 166 | Move a file from one path to another, or rename it 167 | from one file name to another. 168 | 169 | Examples of how to accomplish similar purposes in Python: 170 | """ 171 | 172 | def cd(dir): 173 | # Sometimes necessary to change the directory from which your 174 | # script was called 175 | os.chdir(dir) 176 | 177 | def touch(file): 178 | with open(file, 'w') as fh: 179 | fh.write("\n") 180 | 181 | # touch("mod-2.py") 182 | 183 | """ 184 | === Anatomy of a shell command === 185 | 186 | Commands are given arguments, like this: 187 | 188 | cmd - 189 | cmd -- 190 | 191 | Some arguments don't have values: 192 | 193 | cmd - 194 | 195 | You can chain together any number of arguments: 196 | 197 | cmd - - ... 198 | 199 | Example: 200 | git --version to get the version of git 201 | git -v : equivalent to the above 202 | 203 | (Informational commands for git) 204 | 205 | This is typical: usually we use a single dash + a single letter 206 | as a shortcut for a double dash plus a long argument name. 207 | 208 | We have seen some of these already. 209 | 210 | Commands also have "positional" arguments, which don't use - or -- flags 211 | 212 | - cd 213 | 214 | - cp 215 | 216 | (More examples in Python:) 217 | """ 218 | 219 | def run_git_version(): 220 | # Both of these are equivalent 221 | subprocess.run(["git", "--version"]) 222 | subprocess.run(["git", "-v"]) 223 | 224 | # run_git_version() 225 | 226 | def run_python3_file_interactive(file): 227 | subprocess.run(["python3", "-i", file]) 228 | 229 | # run_python3_file_interactive("subfolder/mod.py") 230 | 231 | """ 232 | === I/O & Composing Shell Commands === 233 | 234 | What about I/O? 235 | Remember that one of the primary reasons for the shell's existence is to 236 | "glue" different programs together. What does that mean? 237 | 238 | Selected list of important operators 239 | (also called shell combinators): 240 | - |, ||, &&, >, >>, <, << 241 | 242 | Most useful: 243 | - Operator > 244 | Ends the output into a file. 245 | (This is called redirection) 246 | 247 | - Operator >> 248 | Instead of replacing the file, append new content to the end of it 249 | 250 | - || and && 251 | Behave like "or" and "and" in regular programs 252 | Useful for error handling 253 | 254 | cmd1 || cmd2 -- do cmd1, if it fails, do command 2 255 | cmd1 && cmd2 -- do cmd1, if it succeeds, do command 2 256 | 257 | These are "shortcircuiting" boolean operations, 258 | just as in most programming languages, but based 259 | on whether the command succeeds or fails. 260 | 261 | Examples: 262 | python3 lecture.py || echo "Hello" 263 | python3 lecture.py && echo "Hello" 264 | 265 | ===== Skip the following for time ===== 266 | 267 | - | 268 | Chains together two commands 269 | 270 | Exercises: 271 | 272 | - cat followed by ls 273 | 274 | Fixed example from class: cat folder.txt | xargs ls 275 | 276 | Better example (more common): 277 | Using "grep" to search for a particular pattern 278 | 279 | Example, find all polls in lecture 1: 280 | 281 | cat ../lecture1/lecture.py | grep "forms.gle" 282 | 283 | Find all packages installed with conda that contain the word "data": 284 | 285 | conda list | grep "data" 286 | 287 | Output: 288 | 289 | astropy-iers-data 0.2024.6.3.0.31.14 py312hca03da5_0 290 | datashader 0.16.2 py312hca03da5_0 291 | importlib-metadata 7.0.1 py312hca03da5_0 292 | python-tzdata 2023.3 pyhd3eb1b0_0 293 | stack_data 0.2.0 pyhd3eb1b0_0 294 | tzdata 2024a h04d1e81_0 295 | unicodedata2 15.1.0 py312h80987f9_0 296 | 297 | - ls followed by cat 298 | (equivalent to just ls) 299 | - cat followed by cd 300 | (using xargs) 301 | - ls, save the results to a file 302 | (using >) 303 | - python3, save the results to a file 304 | (using >) 305 | - (Hard) cat followed by cd into the first directory of interest 306 | 307 | Recap: 308 | 309 | Help commands: see a command usage & options 310 | 311 | Doing stuff commands: 312 | various ways of creating files, moving files, 313 | copying files, etc. 314 | 315 | Anatomy of commands: 316 | cmd ... or 317 | cmd - - etc. 318 | 319 | We saw various ways of combining and composing 320 | different commands, which can be used for 321 | advanced shell programming to write arbitrary 322 | scripts in the shell. 323 | 324 | ****** Where we ended for today ****** 325 | """ 326 | -------------------------------------------------------------------------------- /lecture5/parts/1-RDDs.py: -------------------------------------------------------------------------------- 1 | """ 2 | Lecture 5: Distributed Pipelines 3 | 4 | Part 1: Introduction to Spark: Scalable collection types and RDDs 5 | 6 | === Poll === 7 | 8 | Speedup through parallelism alone (vertical scaling) is most significantly limited by... 9 | (Select all that apply) 10 | 11 | 1. The number of lines in the Python source code 12 | 2. The version of the operating system (e.g., MacOS Sonoma) 13 | 3. The number of CPU cores on the machine 14 | 4. The number of wire connections on the computer's motherboard 15 | 5. The amount of RAM (memory) and disk space (storage) available 16 | 17 | https://forms.gle/LUsqdy7YYKy7JFVH6 18 | 19 | === Apache Spark (PySpark) === 20 | 21 | In this lecture, we will use Apache Spark (PySpark). 22 | 23 | Spark is a parallel and distributed data processing framework. 24 | 25 | (Note: Spark also has APIs in several other languages, most typically 26 | Scala and Java. The Python version aligns best with the sorts of 27 | code we have been writing so far and is generally quite accessible.) 28 | 29 | Documentation: 30 | https://spark.apache.org/docs/latest/api/python/index.html 31 | 32 | To test whether you have PySpark installed successfully, try running 33 | the lecture now: 34 | 35 | python3 1-RDDs.py 36 | """ 37 | 38 | # Test whether import works 39 | import pyspark 40 | 41 | # All spark code generally starts with the following setup code: 42 | # (Boiler plate code - ignore for now) 43 | from pyspark.sql import SparkSession 44 | spark = SparkSession.builder.appName("SparkExample").getOrCreate() 45 | sc = spark.sparkContext 46 | 47 | """ 48 | Motivation: last lecture (over the last couple of weeks) we saw that: 49 | 50 | - Parallelism can exist in many forms (hard to identify and exploit!) 51 | 52 | - Concurrency can lead to a lot of trouble (naively trying to write concurrent code can lead to bugs) 53 | 54 | - Parallelism alone (without distribution) can only scale your compute, and only by a limited amount (limited by your CPU bandwidth, # cores, and amount of RAM on your laptop!) 55 | 56 | + e.g.: I have 800 GB available, but if I want to work with a dataset 57 | bigger than that, I'm out of luck 58 | 59 | + e.g.: I have 16 CPU cores available, but if I want more than 16X 60 | speedup, I'm out of luck 61 | 62 | We want to be able to scale pipelines automatically to larger datasets. 63 | How? 64 | 65 | Idea: 66 | 67 | - **Build our pipelines** at a higher level of abstraction -- build data sets and operators over data sets 68 | 69 | - **Deploy our pipelines** using a framework or library that will automatically scale and take advantage of parallel and distributed compute resources. 70 | 71 | Analogy: kind of like a compiler or interpreter! 72 | (A long time ago, people use to write all code in assembly language/ 73 | machine code) 74 | 75 | We say "what" we want, the distributed data processing software framework will 76 | handle the "how" 77 | 78 | So what is that higher level abstraction? 79 | 80 | Spoiler: 81 | It's dataflow graphs! 82 | 83 | (With one additional thing) 84 | """ 85 | 86 | """ 87 | === Introduction to distributed programming === 88 | 89 | What is a scalable collection type? 90 | 91 | What is a collection type? A set, a list, a dictionary, a table, 92 | a DataFrame, a database (for example), any collection of objects, rows, 93 | or data items. 94 | 95 | - Pandas DataFrame is one example. 96 | 97 | When we talk about collection types, we usually assume the whole 98 | thing is stored in memory. (Refer to 800GB limit comment above.) 99 | 100 | A: "Scalable" part means the collection is automatically distributed 101 | and parallelized over many different workers and/or computers or devices. 102 | 103 | The magic of this is that we can think of it just like a standard 104 | collection type! 105 | 106 | If I have a scalable set, I can just think of that as a set 107 | 108 | If I have a scalable DataFrame, I can just think of that as a DataFrame 109 | 110 | Basic scalable collection types in Spark: 111 | 112 | - RDD 113 | Resilient Distributed Dataset 114 | 115 | - PySpark DataFrame API 116 | Will bear resemblance to DataFrames in Pandas (and Dask) 117 | """ 118 | 119 | # Uncomment to run 120 | # # RDD - scalable version of a Python set of integers 121 | # basic_rdd = sc.parallelize(range(0, 1_000)) 122 | 123 | # print(basic_rdd) 124 | 125 | # # --- run some commands on the RDD --- 126 | # mapped_rdd = basic_rdd.map(lambda x: x + 2) 127 | # filtered_rdd = mapped_rdd.filter(lambda x: x > 500) 128 | # result = filtered_rdd.collect() 129 | 130 | # print(result) 131 | 132 | """ 133 | We can visualize our pipeline! 134 | 135 | Open up your browser to: 136 | http://localhost:4040/ 137 | 138 | === More examples === 139 | 140 | Scalable collection types are just like normal collection types! 141 | 142 | Let's show this: 143 | 144 | Exercises: 145 | 1. 146 | Write a function 147 | a) in Python 148 | b) in PySpark using RDDs 149 | that takes an input list of integers, 150 | and finds only the integers x such that x * x is exactly 3 digits... 151 | 152 | - .map 153 | - .filter 154 | - .collect 155 | """ 156 | 157 | def ex1_python(l1): 158 | # anonymous functions with map filter! 159 | l2 = map(lambda x: x * x, l1) 160 | # ^^ equivalent to 161 | # def anon_function_square(x): 162 | # return x * x 163 | # l2 = map(anon_function_square, l1) 164 | # list comprehension syntax: 165 | # [x * x for x in l1] 166 | l3 = filter(lambda x: 100 <= x <= 999, l2) 167 | print(list(l3)) 168 | 169 | INPUT_EXAMPLE = list(range(100)) 170 | 171 | # ex1_python(INPUT_EXAMPLE) 172 | 173 | # Output: 174 | # [100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961] 175 | # All the 3 digit square numbers! 176 | 177 | def ex1_rdd(list): 178 | l1 = sc.parallelize(list) # how you construct an RDD 179 | l2 = l1.map(lambda x: x * x) 180 | # BTW: equivalent to: 181 | # def square(x): 182 | # return x * x 183 | # l2 = l1.map(square) 184 | l3 = l2.filter(lambda x: 100 <= x <= 999) 185 | print(l3.collect()) 186 | 187 | # ex1_rdd(INPUT_EXAMPLE) 188 | 189 | """ 190 | 2. 191 | Write a function 192 | a) in Python 193 | b) in PySpark using RDDs 194 | that takes as input a list of integers, 195 | and adds up all the even integers and all the odd integers 196 | 197 | - .groupBy 198 | - .reduceBy 199 | - .reduceByKey 200 | - .partitionBy 201 | """ 202 | 203 | def ex2_python(l1): 204 | # (Skip: leave as exercise) 205 | # TODO 206 | raise NotImplementedError 207 | 208 | def ex2_rdd(l1): 209 | l2 = sc.parallelize(l1) 210 | l3 = l2.groupBy(lambda x: x % 2) 211 | l4 = l3.flatMapValues(lambda x: x) 212 | # ^^ needed for technical reasons 213 | # actually, would be easier just to run a map to (x % 2, x) 214 | # then call reduceByKey, but I wanted to conceptually separate 215 | # out the groupBy step from the sum step. 216 | l5 = l4.reduceByKey(lambda x, y: x + y) 217 | for key, val in l5.collect(): 218 | print(f"{key}: {val}") 219 | # Uncomment to inspect l1, l2, l3, l4, and l5 220 | # breakpoint() 221 | 222 | # ex2_rdd(INPUT_EXAMPLE) 223 | 224 | """ 225 | Good! But there's one thing left -- we haven't really measured 226 | that our pipeline is actually getting run in parallel. 227 | 228 | Q: Can we check that? 229 | 230 | Test: parallel_test.py 231 | 232 | A: Tools: 233 | 234 | time (doesn't work) 235 | 236 | Activity monitor 237 | 238 | localhost:4040 239 | (see Executors tab) 240 | 241 | Q: what is localhost? What is going on behind the scenes? 242 | 243 | A: Spark is running a local cluster on our machine to schedule and run 244 | tasks (batch jobs). 245 | 246 | Q: Why do we need sc.context? 247 | 248 | A: 249 | Not locally using Python compute, so any operation we do 250 | needs to get submitted and run as a job through the cluster. 251 | 252 | Q: What does RDD stand for? 253 | 254 | RDD means Resilient Distributed Dataset. 255 | https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf 256 | 257 | === That's Just Data Parallelism! (TM) === 258 | 259 | Yes, the following is a punchline: 260 | 261 | scalable collection type == data parallelism. 262 | 263 | They are really the same thing. 264 | 265 | Task, pipeline parallelism are limited by the # of nodes in the graph! 266 | Data parallelism = arbitrary scaling, so it's what enables scalable 267 | collectiont types. 268 | 269 | Brings us to: how can we tell from looking at a dataflow graph if it can 270 | be parallelized and distributed automatically in a framework like PySpark? 271 | 272 | A: All tasks must be data-parallel. 273 | 274 | === Summary === 275 | 276 | We saw scalable collection types 277 | (with some initial RDD examples) 278 | 279 | Scalable collection types are just like normal collection types, 280 | but they behave (behind the scenes) like they work in parallel! 281 | 282 | They do this by automatically exploiting data parallelism. 283 | 284 | Behind the scenes, both vertical scaling and horizontal scaling 285 | can be performed automatically by the underlying data processing 286 | engine (in our case, Spark). 287 | 288 | This depends on the engine to do its job well -- for the most part, 289 | we will assume in this class that the engine does a better job than 290 | we do, but we will get to some limitations later on. 291 | 292 | Many other data processing engines exist... 293 | (to name a few, Hadoop, Google Cloud Dataflow, Materialize, Storm, Flink) 294 | (we will discuss more later on and the technology behind these.) 295 | 296 | === Plan for remaining parts === 297 | 298 | Overall plan for Lecture 5: 299 | 300 | - Scalable collection types 301 | 302 | - Programming over collection types 303 | 304 | - Important properties: immutability, laziness 305 | 306 | - MapReduce 307 | 308 | Simpler abstraction underlying RDDs and Spark 309 | 310 | - Partitioning in RDDs and collection types 311 | 312 | Possible topics/optional: 313 | 314 | - Distributed consistency: crashes, failures, duplicated/dropped/reorder messages 315 | 316 | - Pitfalls. 317 | """ 318 | -------------------------------------------------------------------------------- /lecture4/parts/5-quantifying.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Part 5: Quantifying Parallelism and Amdahl's Law. 4 | 5 | Content from Nov 7 poll was moved to end of Part 4 lecture. 6 | 7 | === Quantifying parallelism === 8 | 9 | We know how to tell *if* there's parallelism. 10 | What about *how much*? 11 | 12 | i.e.: What amount of parallelism is available in a system? 13 | 14 | Definition: 15 | **Speedup** is defined by: 16 | (running time of sequential code) / (running time of parallel code) 17 | 18 | example: 19 | 4.6s for parallel impl 20 | 9.2s for sequential impl 21 | 22 | Speedup would be 2x. 23 | (We can run 2-parallelism.py to check; and we might get different numbers 24 | on different platforms or machines, for example, if your machine has only one CPU 25 | you might not see any speedup.) 26 | 27 | Re-running: 28 | Speedup = 9.6 / 5.2 = 1.84x speedup. 29 | 30 | You could run with four workers, and get up to a 4x speedup 31 | ... or with 8 workers, and get up to an 8x speedup ... 32 | 33 | You might wonder, how much can I keep speeding up this computation, 34 | won't this stop working at some point? 35 | 36 | At some point ... we hit a bottleneck 37 | 38 | Fundamental law of parallelism: 39 | Amdahl's law: 40 | https://en.wikipedia.org/wiki/Amdahl%27s_law 41 | 42 | Amdahl's law gives a theoretical upper bound on the amount of speedup that is possible for any task (in arbitrary code, but also applying specifically 43 | to data processing code). 44 | 45 | It's a useful way to quantify parallelism & see how useful it would be. 46 | 47 | === Amdahl's Law === 48 | 49 | We're interested in knowing: how much speedup is possible? 50 | 51 | Standard form of the law: 52 | 53 | Suppose we have a computation that I think could benefit from one or more types of parallelism. 54 | The amount of speedup in a computation is at most 55 | 56 | Speedup <= 1 / (1 - p) 57 | 58 | where: 59 | 60 | p is the percentage of the task (in running time) that can be parallelized. 61 | 62 | === Example with a simple task === 63 | 64 | We have written a complex combination of C and Python code to train our ML model. 65 | Based on profiling the code (callgrind or some other profiling tool), we believe that 66 | 95% of the code can be fully parallelized, however there is a 5% of the time of the code 67 | that is spent parsing the input model file and producing as output an output model file 68 | that we have determined cannot be parallelized. 69 | 70 | Q: What is the maximum speedup for our code? 71 | 72 | Applying Amdahl's law: 73 | 74 | p = .95 75 | 76 | Speedup <= 1 / (1 - .95) = 1 / .05 = 20x. 77 | 78 | Pretty good - but not infinite! 79 | 80 | Example: the best we can get is from 100 hours to 5 hours, a 20x speedup. 81 | 82 | How to apply this knowledge? 83 | I was considering purchasing a supercomputer server machine with 160 cores. 84 | Based on the above calculation, I realize that I'm only going to effectively 85 | be able to make use of an at most 20x speedup, 86 | so I think my 160 cores may not be useful, and I buy a smaller machine 87 | with 24 cores. 88 | 89 | === Alternate form === 90 | 91 | Here is an alternative form of the law that is equivalent, but sometimes a bit more useful. 92 | Let: 93 | - T be the total amount of time to complete a task sequentially 94 | (without any parallelism) 95 | (in our example: 100 hours) 96 | 97 | - S be the amount of time to compute some inherently sequential bottleneck 98 | --> We don't believe it's possible to do any part of S in parallel 99 | (in our example: 5 hours) 100 | 101 | Then the maximum speedup of the task is at most 102 | 103 | speedup <= (T / S) = 100 hours / 5 hours = 20x. 104 | 105 | Note: this applies to distributed computations as well! 106 | 107 | This is giving a theoretical upper bound, not taking into account 108 | other overheads (for example, it doesn't take into account 109 | communication overhead between threads, processes or distributed devices). 110 | 111 | So it's not an actual number on what speedup we will get, but it still can be a 112 | useful upper bound. 113 | 114 | Recap: 115 | 116 | - We reviewed 3 type of parallelism in dataflow graphs 117 | 118 | - We define speedup 119 | 120 | - We talked about estimating the "maximum speedup" in a pipeline, using 121 | a law called Amdahl's Law 122 | 123 | - We saw two forms of the law: 124 | 125 | Speedup <= 1 / (1 - p) 126 | 127 | Speedup <= T / S 128 | 129 | where: 130 | T is running time of sequential code 131 | S is running time of a bottleneck that can't be parallelized 132 | p is % of code that can be parallelized 133 | 134 | p = (T - S) / T. 135 | 136 | ---- where we ended for Nov 7 ---- 137 | 138 | Recall formulas from last time 139 | 140 | === Example === 141 | 142 | 1. SQL query example 143 | 144 | - imagine an SQL query where you need to match 145 | the employee name with their salary and produce a joined table 146 | (join on name_table and salary_table) 147 | 148 | Assume that all operations take 1 ms per row: 149 | - 1 ms to load each input row from name_table 150 | - 1 ms to load each input row from salary_table 151 | - 1 ms to join -- per row in the joined table 152 | 153 | Also assume that there are 100 employees in name_table, 154 | 100 in salary_table, and 100 in the joined table. 155 | 156 | Q: What is the maximum speedup here? 157 | 158 | Dataflow graph: 159 | 160 | (load name_table) ----| 161 | |---> (join tables) 162 | (load salary_table) --| 163 | 164 | speedup <= (T / S) 165 | 166 | What are T and S? 167 | 168 | T = ? 169 | 300ms = 170 | 100ms to load first table 171 | 100ms to load second table, 172 | 100ms to calculate joined table. 173 | 174 | with no parallelism! 175 | 176 | S = what cannot be parallelized? 177 | 178 | Idea: view at the level of input rows! 179 | 180 | Let's identify what needs to happen for some specific employee 181 | 182 | - I need to load the employee name before I produce the particular output row in joined table for that employee 183 | 184 | - I need to load the employee salary before I produce the particular output row in joined table for that employee 185 | 186 | x I need to load the employee name, then load the employee salary, then produce the particular output row 187 | 188 | ^^^ not really a sequential bottleneck 189 | 190 | Minimum "sequential bottleneck" is 2 ms! 191 | 192 | Therefore: 193 | 194 | Speedup <= T / S = 300 ms / 2 ms = 150x. 195 | 196 | === Poll === 197 | 198 | Use Amdahl's law to estimate the maximum speedup in the following scenario. 199 | 200 | As in last Monday's poll, a Python script needs to: 201 | - load a dataset into Pandas: students.csv, with 100 rows 202 | - calculate a new column which is the total course load for each student 203 | - send an email to each student with their total course load 204 | 205 | Assume that it takes 1 ms (per row) to read in each input row, 1 ms (per row) to calculate the new column, and 1 ms (per row) to send an email. 206 | 207 | Q: What is the theoretical bound on the maximum speedup in the pipeline? 208 | 209 | https://forms.gle/W5NpbuZGs4Se45VCA 210 | 211 | DFG: 212 | 213 | (load) -> (calculate col) -> (send email) 214 | 100 ms 100 ms 100ms 215 | 216 | T = 100 + 100 + 100 = 300 ms 217 | 218 | S = 3ms -- need to perform 3 actions in sequence for a single student - can't be parallelized! 219 | 220 | 300 ms / 3 ms = 100x. 221 | 222 | === More examples and exercises === 223 | (Skip - may do in discussion section) 224 | 225 | 2. Let's take our data parallelism example: 226 | 227 | We had an employee database, and tasks: 228 | 229 | 1. load employee dataset 230 | 231 | 2. strip the spaces from employee names 232 | 233 | 3. extract the first/given name 234 | 235 | with dataflow graph: 236 | 237 | (1) -> (2) -> (3) 238 | 239 | Again assume 1ms for each task per input row. 240 | What are T and S here? 241 | 242 | 3. An extended version of the table join example. 243 | We have two tables, of employee names and employee salaries. 244 | We want to compute which employees make more than 1 million Euros. 245 | The employee salaries are listed in dollars. 246 | 247 | We are given as input the CEO name. 248 | We want to get the salary associated with the CEO, 249 | convert it from USD to Euros, and filter only the rows where the 250 | result is over 1 million. 251 | Assume all basic operations take 1 unit of time per row. 252 | 253 | === Additional notes === 254 | 255 | Note 1: 256 | You can think of this as the limit as number of cores/processes 257 | goes to infinity. 258 | 259 | T = time it takes to complete with 1 worker 260 | S = time it takes to complete the task with a theoretically infinite number of workers and no cost of overhead when communicating between workers. 261 | 262 | **Advanced topics note:** 263 | There's a version of the law that takes the number of processes into account. 264 | 265 | - basically the law would take the portion of the pipeline that *can* be parallelized, and divide it by 266 | # of processors 267 | 268 | S portion - cannot be parallelized 269 | T - S - can be parallelized 270 | 271 | - Don't need to know this for this class. 272 | 273 | Note 2: 274 | How Amdahl's law applies to aggregation cases. 275 | 276 | average_numbers example 277 | 278 | Our average_numbers example is slightly more complex than above as it involves an aggregation 279 | (group-by). 280 | Aggregation can be parallelized. 281 | (Why? What type of parallelism?) 282 | 283 | For the purposes of Amdahl's law, let's think of aggregation as requiring at least 1 operation 284 | (1 unit of time to compute the total). 285 | 286 | Q: What does Amdahl's law say the maximum speedup for our simple average_numbers pipeline? 287 | 288 | === Connection to throughput & latency === 289 | 290 | Let's also connect Amdahl's law back to throughput & latency. 291 | 292 | Given T and S... 293 | 294 | 1. Rephrase in terms of throughput: 295 | 296 | If there are N input items, then the _maximum_ throughput is 297 | 298 | throughput <= (N / S) 299 | 300 | (Num input items) / (running time of pipeline) 301 | 302 | Amdahl's law is just assuming that the minimum running time of the pipeline is S - "maximum speedup" case. 303 | 304 | 2. Rephrase in terms of latency: 305 | 306 | Observation: 307 | In the above examples the "sequential bottleneck" we chose 308 | is essentially the latency of a single item! 309 | 310 | Therefore we have: 311 | 312 | latency >= S, 313 | 314 | if S is computed in the way that we have computed above. 315 | """ 316 | --------------------------------------------------------------------------------