├── lecture2
    ├── parts
    │   ├── file2.txt
    │   ├── file1.txt
    │   ├── example_file.txt
    │   ├── file3.txt
    │   ├── subfolder
    │   │   └── mod.py
    │   ├── 6-conclusion.py
    │   ├── 2-commands.py
    │   ├── 1-introduction.py
    │   ├── 3-informational.py
    │   ├── 5-git.py
    │   └── 4-help-and-doing-stuff.py
    └── README.md
├── lecture1
    ├── .gitignore
    ├── extras
    │   ├── main_test.py
    │   ├── module_test.py
    │   ├── cut.py
    │   └── failures.py
    ├── parts
    │   ├── life-expectancy-row1.csv
    │   ├── verify.py
    │   ├── throughput_latency.py
    │   ├── 1-introduction.py
    │   ├── 6-conclusion.py
    │   ├── 2-etl.py
    │   ├── 4-properties.py
    │   ├── 5-performance.py
    │   └── 3-dataflow-graphs.py
    └── README.md
├── .gitignore
├── lecture7
    ├── file.txt
    ├── example-policy.json
    └── README.md
├── lecture0
    ├── Lecture 0 Slides.pdf
    └── README.md
├── lecture5
    ├── extras
    │   ├── narrow_wide.png
    │   ├── parallel_test.py
    │   ├── exercises.py
    │   ├── examples.py
    │   ├── cut.py
    │   └── dataframe.py
    ├── README.md
    └── parts
    │   ├── 5-dataframes.py
    │   ├── 6-latency-throughput.py
    │   └── 1-RDDs.py
├── lecture4
    ├── extras
    │   ├── scaling-example.png
    │   ├── dataflow-graph-example.png
    │   ├── resources.md
    │   └── data-race-example.py
    ├── README.md
    └── parts
    │   ├── 6-distribution.py
    │   ├── 1-motivation.py
    │   ├── 2-parallelism.py
    │   └── 5-quantifying.py
├── lecture3
    ├── README.md
    └── old
    │   └── README.md
├── exams
    ├── final.md
    ├── midterm_study_list.md
    ├── poll_answers.md
    └── final_study_list.md
├── LICENSE
├── lecture6
    ├── parts
    │   ├── orders.md
    │   ├── 4-end-notes.py
    │   └── 2-microbatching.py
    ├── extras
    │   └── streaming.py
    └── README.md
└── schedule.md


/lecture2/parts/file2.txt:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/lecture2/parts/file1.txt:
--------------------------------------------------------------------------------
1 | ../lecture1
2 | 


--------------------------------------------------------------------------------
/lecture1/.gitignore:
--------------------------------------------------------------------------------
1 | output.csv
2 | save.txt
3 | 


--------------------------------------------------------------------------------
/lecture2/parts/example_file.txt:
--------------------------------------------------------------------------------
1 | Commit test
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | __pycache__/
3 | notes/
4 | 


--------------------------------------------------------------------------------
/lecture7/file.txt:
--------------------------------------------------------------------------------
1 | Test data
2 | Apple banana orange
3 | 


--------------------------------------------------------------------------------
/lecture2/parts/file3.txt:
--------------------------------------------------------------------------------
1 | ../lecture1
2 | ../lecture2
3 | ../lecture3
4 | ../lecture4
5 | 


--------------------------------------------------------------------------------
/lecture1/extras/main_test.py:
--------------------------------------------------------------------------------
1 | import lecture
2 | 
3 | print("Hello from main_test.py")
4 | 


--------------------------------------------------------------------------------
/lecture0/Lecture 0 Slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture0/Lecture 0 Slides.pdf


--------------------------------------------------------------------------------
/lecture5/extras/narrow_wide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture5/extras/narrow_wide.png


--------------------------------------------------------------------------------
/lecture4/extras/scaling-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture4/extras/scaling-example.png


--------------------------------------------------------------------------------
/lecture2/parts/subfolder/mod.py:
--------------------------------------------------------------------------------
1 | """
2 | A new module that we created from the command line
3 | """
4 | 
5 | print("Hello from submodule")
6 | 


--------------------------------------------------------------------------------
/lecture4/extras/dataflow-graph-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DavisPL-Teaching/119/HEAD/lecture4/extras/dataflow-graph-example.png


--------------------------------------------------------------------------------
/lecture1/parts/life-expectancy-row1.csv:
--------------------------------------------------------------------------------
1 | Entity,Code,Year,Period life expectancy at birth - Sex: all - Age: 0
2 | Afghanistan,AFG,1950,27.7275
3 | 


--------------------------------------------------------------------------------
/lecture0/README.md:
--------------------------------------------------------------------------------
1 | # Lecture 0: Course Introduction
2 | 
3 | This was the first lecture (on Wednesday, Sep 24).
4 | We went over an introduction to the course, syllabus, and schedule!
5 | The slides can be found in `Lecture 0 Slides.pdf`.
6 | 


--------------------------------------------------------------------------------
/lecture3/README.md:
--------------------------------------------------------------------------------
 1 | # Spring 2025
 2 | 
 3 | This shorter lecture was skipped for Spring 2025.
 4 | 
 5 | Some aspects of Pandas were covered on HW1.
 6 | 
 7 | We will move directly to Lecture 4.
 8 | 
 9 | ## Please note
10 | 
11 | - Any uncovered material will NOT be covered on the midterm.
12 | 


--------------------------------------------------------------------------------
/lecture4/extras/resources.md:
--------------------------------------------------------------------------------
1 | ### Resources and further reading
2 | 
3 | [Parallel Computing: Theory and Practice](https://www.cs.cmu.edu/afs/cs/academic/class/15210-f15/www/tapp.html)
4 | 
5 | A good textbook on parallel computing (written by Umut A. Acar at CMU).
6 | Covers the concepts, design, and implementation of parallel algorithms in more detail, work/span analysis, fork/join parallelism, etc.
7 | 


--------------------------------------------------------------------------------
/exams/final.md:
--------------------------------------------------------------------------------
 1 | # Final details
 2 | 
 3 | Thursday, Dec 11, 8-10am, same room as lecture
 4 | 
 5 | Closed-book, on paper, one-sided cheat sheet allowed (handwritten or typed).
 6 | 
 7 | Similar structure to the midterm:
 8 | 10 true/false, 8 multiple choice / short answer, 2 free response
 9 | 
10 | Study topic list: see `final_study_list.md`
11 | 
12 | Study questions: go over the in-class polls!
13 | 
14 | No questions that ask you to hand-write code
15 | (concepts, not syntax).
16 | 


--------------------------------------------------------------------------------
/lecture7/example-policy.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "Id": "Policy1733525110646",
 3 |     "Version": "2012-10-17",
 4 |     "Statement": [
 5 |         {
 6 |             "Sid": "Stmt1733525108327",
 7 |             "Action": "s3:*",
 8 |             "Effect": "Allow",
 9 |             "Resource": "arn:aws:s3:::119-test-bucket-2",
10 |             "Principal": {
11 |                 "AWS": [
12 |                     "arn:aws:iam::472501947158:user/caleb"
13 |                 ]
14 |             }
15 |         }
16 |     ]
17 | }
18 | 


--------------------------------------------------------------------------------
/lecture1/extras/module_test.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This is a little test script to talk about
 3 | Python modules and scope.
 4 | 
 5 | We may get to it in lecture 1 or in a future lecture.
 6 | """
 7 | 
 8 | print("Python modules and scope")
 9 | 
10 | import sys
11 | import types
12 | def imports():
13 |     for name, val in globals().items():
14 |         if isinstance(val, types.ModuleType):
15 |             yield val.__name__
16 | 
17 | print("__name__:", __name__)
18 | print("All modules:", sys.modules.keys())
19 | print("Local modules:", list(imports()))
20 | print("This module:", sys.modules[__name__])
21 | print("This module:", sys.modules[__name__].__name__)
22 | 


--------------------------------------------------------------------------------
/lecture5/extras/parallel_test.py:
--------------------------------------------------------------------------------
 1 | """
 2 | A little test to show how RDDs are parallelized.
 3 | """
 4 | 
 5 | from pyspark.sql import SparkSession
 6 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate()
 7 | sc = spark.sparkContext
 8 | 
 9 | # Modify as needed
10 | N = 1_000_000_000
11 | 
12 | result = (sc
13 |     .parallelize(range(1, N))
14 |     # Uncomment to force only a single partition
15 |     # .map(lambda x: (0, x)) # first element of ordered pair is the key I want to parallelize on
16 |     # .partitionBy(1)
17 |     # .map(lambda x: x[1])
18 |     .map(lambda x: x ** 2)
19 |     .filter(lambda x: x >= 100 and x < 1000)
20 |     .collect()
21 | )
22 | 
23 | print(result)
24 | 


--------------------------------------------------------------------------------
/lecture5/extras/exercises.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Some MapReduce exercises
 3 | 
 4 | (Skipped - will appear on the homework.)
 5 | 
 6 | https://github.com/DavisPL-Teaching/119-hw2
 7 | """
 8 | 
 9 | # Spark boilerplate (remember to always add this at the top of any Spark file)
10 | import pyspark
11 | from pyspark.sql import SparkSession
12 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate()
13 | sc = spark.sparkContext
14 | 
15 | """
16 | 1. Among the numbers from 1 to 1000, which digit is most common?
17 | the least common?
18 | 
19 | 2. Among the numbers from 1 to 1000, written out in English, which character is most common?
20 | the least common?
21 | 
22 | 3. Does the answer change if we have the numbers from 1 to 1,000,000?
23 | """
24 | 


--------------------------------------------------------------------------------
/lecture1/extras/cut.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Additional cut material
 3 | 
 4 | === Advantages of software view ===
 5 | 
 6 | Advantages of thinking of data processing pipelines as software:
 7 | 
 8 | - Software *design* matters: structuring code into modules, classes, functions
 9 | - Software can be *tested*: validating functions, validating inputs, unit & integration tests
10 | - Software can be *reused* and maintained (not just a one-off script)
11 | - Software can be developed collaboratively (Git, GitHub)
12 | - Software can be optimized for performance (parallelism, distributed computing, etc.)
13 | 
14 | It is a little more work to structure our code this way!
15 | But it helps ensure that our work is reusable and integrates well with other teams, projects, etc.
16 | """
17 | 


--------------------------------------------------------------------------------
/lecture5/extras/examples.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Just some simple examples for syntax reference.
 3 | """
 4 | 
 5 | ### RDD part
 6 | 
 7 | # Start a Spark session
 8 | from pyspark.sql import SparkSession
 9 | spark = SparkSession.builder.appName("DataflowGraphExample").getOrCreate()
10 | sc = spark.sparkContext
11 | 
12 | data = sc.parallelize(range(1, 11))  # RDD containing integers 1 to 10
13 | 
14 | mapped_data = data.map(lambda x: x ** 2)  # [1, 4, 9, ..., 100]
15 | 
16 | filtered_data = mapped_data.filter(lambda x: x > 50)  # [64, 81, 100]
17 | 
18 | ### DataFrame part
19 | 
20 | # Start a Spark session
21 | from pyspark.sql import SparkSession
22 | from pyspark.sql.functions import col
23 | spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
24 | 
25 | # Create a DataFrame with integers from 1 to 10
26 | data = spark.createDataFrame([(i,) for i in range(1, 11)], ["number"])
27 | 
28 | mapped_data = data.withColumn("squared", col("number") ** 2)
29 | 
30 | filtered_data = mapped_data.filter(col("squared") > 50)
31 | 
32 | filtered_data.show()
33 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Caleb Stanford
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/lecture6/parts/orders.md:
--------------------------------------------------------------------------------
 1 | # Orders (copy paste into `nc` window)
 2 | 
 3 | {"order_number": 1, "item": "Apple", "timestamp": "2025-11-24 10:00:00", "qty": 2}
 4 | {"order_number": 2, "item": "Banana", "timestamp": "2025-11-24 10:01:00", "qty": 3}
 5 | {"order_number": 3, "item": "Orange", "timestamp": "2025-11-24 10:02:00", "qty": 1}
 6 | 
 7 | ### More examples
 8 | {"order_number": 4, "item": "Apple", "timestamp": "2025-11-24 10:03:00", "qty": 2}
 9 | {"order_number": 5, "item": "Banana", "timestamp": "2025-11-24 10:04:00", "qty": 1}
10 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 1}
11 | 
12 | ### Stress testing
13 | 
14 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 100}
15 | {"order_number": 6, "item": "Orange", "timestamp": "2025-11-24 10:05:00", "qty": 10000}
16 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 5}
17 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 500}
18 | {"order_number": 3, "item": "Grapes", "timestamp": "2025-11-27 15:44:00", "qty": 50000}
19 | {"order_number": 3, "item": "Orange", "timestamp": "2025-11-27 15:44:00", "qty": 5000000}
20 | 


--------------------------------------------------------------------------------
/lecture7/README.md:
--------------------------------------------------------------------------------
 1 | # Lecture 7: Brief lecture on Cloud Computing
 2 | 
 3 | **This is an old version of the lecture from the Fall 2024 iteration of the course. It has not yet been updated for Fall 2025.**
 4 | 
 5 | ## Dec 6
 6 | 
 7 | Announcements:
 8 | https://piazza.com/class/m12ef423uj5p5/post/165
 9 | 
10 | - OH today and Monday
11 | 
12 | - HW2, HW1 makeup, all in-class polls due Monday EOD
13 | 
14 | - Final is Wednesday, 6-8pm
15 | 
16 | ## Outline for today
17 | 
18 | Start with the poll
19 | 
20 | Cloud computing in AWS:
21 | 
22 | - Getting started; what to know up front
23 | 
24 | - Storing data (S3)
25 | 
26 | - Running computations (EC2 and Lambda)
27 | 
28 | If we have an extra 10-15 minutes at the end of class, I will reserve it for an open Q+A.
29 | 
30 | Questions about HW2 or the final or anything else?
31 | 
32 | ## Poll
33 | 
34 | Getting the poll out of the way:
35 | 
36 | https://forms.gle/3e6vJHShMJfaD2pH8
37 | 
38 | A streaming system processes 20 input rows over the duration of 1 minute that the system is running. The system may use parallelism, meaning that at any given point in time, more than one input may be processed.
39 | 
40 | Given this information, what is the minimum and maximum latency for individual input rows?
41 | 


--------------------------------------------------------------------------------
/lecture6/extras/streaming.py:
--------------------------------------------------------------------------------
 1 | """
 2 | A minimal example of a streaming pipeline in PySpark
 3 | using Structured Streaming.
 4 | 
 5 | (Remember to use nc -lk 9999 to run)
 6 | """
 7 | 
 8 | from pyspark.sql import SparkSession
 9 | from pyspark.sql.functions import from_json, col
10 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType
11 | 
12 | # Create a Spark session
13 | spark = SparkSession.builder.appName("streaming").getOrCreate()
14 | 
15 | # Define the schema of the incoming JSON data
16 | schema = StructType([
17 |     StructField("order_number", IntegerType()),
18 |     StructField("item", StringType()),
19 |     StructField("timestamp", StringType()),
20 |     StructField("qty", IntegerType())
21 | ])
22 | 
23 | # Use local socket as a streaming source
24 | streaming_df = spark.readStream.format("socket") \
25 |     .option("host", "localhost") \
26 |     .option("port", 9999) \
27 |     .load()
28 | 
29 | # Parse the JSON data
30 | parsed_df = streaming_df.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
31 | 
32 | # Start the streaming query
33 | query = parsed_df.writeStream.outputMode("append").format("console").start()
34 | 
35 | # Wait for the streaming to finish
36 | query.awaitTermination()
37 | 


--------------------------------------------------------------------------------
/lecture1/parts/verify.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | import sys
 4 | import matplotlib.pyplot as plt
 5 | import pyspark.sql as sql
 6 | 
 7 | def verify():
 8 |     print()
 9 |     print("Python version:", sys.version)
10 |     print("Numpy version:", np.__version__)
11 |     print("Pandas version:", pd.__version__)
12 |     print("Matplotlib version:", plt.matplotlib.__version__)
13 |     print()
14 | 
15 |     # Simple numpy and pandas operations to verify functionality
16 |     # - Create a numpy array
17 |     array = np.array([1, 2, 3])
18 |     print("Numpy array:", array)
19 | 
20 |     # - Add stuff to the array
21 |     array = array + 4
22 |     print("Numpy array after addition:", array)
23 | 
24 |     # - Make a pandas DataFrame
25 |     df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
26 |     print("Pandas DataFrame:\n", df1)
27 |     print()
28 | 
29 |     # A basic pyspark test
30 |     spark = sql.SparkSession.builder.appName("Verify").getOrCreate()
31 |     df2 = spark.createDataFrame([
32 |         sql.Row(a=1, b=2., c='string1'),
33 |         sql.Row(a=2, b=3., c='string2'),
34 |         sql.Row(a=4, b=5., c='string3'),
35 |     ])
36 |     df2.show()
37 |     print("Spark version:", spark.version)
38 |     print()
39 | 
40 |     # Plot the first DataFrame using matplotlib
41 |     plt.plot(df1)
42 |     plt.show()
43 | 
44 | if __name__ == "__main__":
45 |     verify()
46 | 


--------------------------------------------------------------------------------
/lecture6/parts/4-end-notes.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Part 4: End notes
 3 | 
 4 | === Poll ===
 5 | 
 6 | This is the last poll!
 7 | 
 8 | What type of time corresponds to each of the following scenarios?
 9 | 
10 | (Real time, event time, system time, logical time)
11 | 
12 | Select all that apply
13 | 
14 | https://forms.gle/NCXfDV4J3ySWiyiT6
15 | 
16 | === Summary ===
17 | 
18 | We've seen:
19 | 
20 | - Streaming systems: differ from batch processing systems in that they process
21 |   one item (or row) in your input at a time
22 | 
23 | - This is useful for "latency-critical" applications where you want, say, sub-second
24 |   or sub-millisecond level respnose times
25 | 
26 | - Measuring latency at an individual item level:
27 | 
28 |     Recall formula for latency:
29 | 
30 |     (exit time item X) - (start time item X)
31 | 
32 | - Microbatching: an optimization that trades latency for higher throughput
33 | 
34 |     Microbatching can be based on different notions of time! Usually event time or system time
35 | 
36 |     Microbatching - still a streaming system!
37 | 
38 | - Time: Real, Event, System, Logical
39 | 
40 |     + Monotonic time
41 | 
42 | === Discussion and Failure Cases ===
43 | 
44 | Streaming pipelines' (e.g., in Spark Streaming)
45 | major advantage is better latency.
46 | 
47 | However, they have additional failure cases from their batch counterparts.
48 | Let's cover a few of these:
49 | 
50 | - Out-of-order data (late arrivals)
51 | 
52 | - Clock drift and non-monotonic clocks
53 | 
54 |     (streaming system cares about time - batch prcoessing system didn't!)
55 | 
56 | - Too much data
57 | 
58 | .
59 | .
60 | .
61 | 
62 | Q: How do we deal with out-of-order data?
63 | 
64 | Q: How do we deal with clocks being wrong?
65 | 
66 | Q: How do we deal with too much data?
67 | 
68 | Q: What happens when our pipeline is overloaded with too much data, and the above techniques fail?
69 | """
70 | 


--------------------------------------------------------------------------------
/schedule.md:
--------------------------------------------------------------------------------
 1 | # ECS 119 Tentative Course Schedule - Fall 2025
 2 | 
 3 | **Important note:**
 4 | This schedule is subject to change.
 5 | I will try to keep it up to date, but please see Piazza for the latest information and homework deadlines.
 6 | 
 7 | ## Section 1: Data Processing Basics
 8 | 
 9 | | Week | Date | Topic | Readings & HW | Lecture \# and part |
10 | | --- | --- | --- | --- | --- |
11 | | 0 | Sep 24 | Introduction |  | 0 |
12 | |   | Sep 26 | Introduction to Data Processing Software |  | 1.1 |
13 | | 1 | Sep 29 |  |  | 1.2 |
14 | |   | Oct 1  | **No class** | HW0 Due |  |
15 | |   | Oct 3  |  |  | 1.3 |
16 | | 2 | Oct 6  |  | HW1 Available | 1.4 |
17 | |   | Oct 8  |  |  | 1.5 |
18 | |   | Oct 10 |  |  | 1.6 |
19 | | 3 | Oct 13 | The Shell |  | 2.1 |
20 | |   | Oct 15 |  |  | 2.2 |
21 | |   | Oct 17 |  | HW1 Due | 2.3 |
22 | | 4 | Oct 20 |  |  | 2.4 |
23 | |   | Oct 22 |  |  | 2.5, 2.6 |
24 | |   | Oct 24 | Parallelism |  | 4.1 |
25 | 
26 | ## Section 2: Parallelism
27 | 
28 | | Week | Date | Topic | Readings & HW | Lecture # |
29 | | --- | --- | --- | --- | --- |
30 | | 5 | Oct 27 |  |  | 4.2 |
31 | |   | Oct 29 |  |  | 4.3 |
32 | |   | Oct 31 |  |  | 4.3, 4.4 |
33 | | 6 | Nov 3  | Review or Overflow |  | 4.4 |
34 | |   | Nov 5  | **Midterm** |  |  |
35 | |   | Nov 7  |  |  | 4.5 |
36 | | 7 | Nov 10 |  |  | 4.5, 4.6 |
37 | |   | Nov 12 | Distributed Pipelines | HW2 available | 5.1 |
38 | |   | Nov 14 |  |  | 5.2 |
39 | | 8 | Nov 17 |  |  | 5.3 |
40 | |   | Nov 19 |  |  | 5.4 |
41 | |   | Nov 21 |  |  | 5.4, 5.5, 5.6 |
42 | 
43 | ## Section 3: Distributed Computing
44 | 
45 | | Week | Date | Topic | Readings & HW | Lecture # |
46 | | --- | --- | --- | --- | --- |
47 | | 9 | Nov 24 | Streaming Pipelines | HW2 due | 6.1 |
48 | |   | Nov 26 |  |  | 6.2, 6.3, 6.4 |
49 | |   | Nov 28 | **No Class** (Thanksgiving) |
50 | | 10 | Dec 1 | selection of additional topics[^1] |  | 7 |
51 | |    | Dec 3 | selection of additional topics[^1] |  | 7 |
52 | |    | Dec 5 | selection of additional topics[^1] |  | 7 |
53 | | 11 | Dec 11 | **Final Exam (8am)** |  |  |
54 | 
55 | ## Notes
56 | 
57 | [^1]: Possible additional topics (depending on time and preference):
58 | cloud computing;
59 | data validation and integrity;
60 | data cleaning;
61 | missing data;
62 | distributed programming: failures and consistency requirements;
63 | containerization and orchestration;
64 | cloud computing: platforms, services, and resources.
65 | 


--------------------------------------------------------------------------------
/lecture5/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 5: Distributed Pipelines
  2 | 
  3 | ## Nov 12
  4 | 
  5 | Announcements:
  6 | 
  7 | - HW2 now available! (Early release)
  8 | 
  9 |     https://github.com/DavisPL-Teaching/119-hw2
 10 | 
 11 |     Due Monday, Nov 24
 12 | 
 13 |     Some minor changes possible before Monday, will be announced on Piazza
 14 | 
 15 | Plan:
 16 | 
 17 | - Start with poll
 18 | 
 19 | - Lecture 5, part 1: introduction to distributed pipelines and PySpark.
 20 | 
 21 | Questions?
 22 | 
 23 | ## Friday, November 14
 24 | 
 25 | Announcements:
 26 | 
 27 | - HW2 due Nov 24
 28 | 
 29 |   I made a few clarifications based on fixes from last year!
 30 |   Please pull to get the latest.
 31 | 
 32 |   + [Diff 1](https://github.com/DavisPL-Teaching/119-hw2/commit/f81558e317fbe427367da3b4c5828265ab4085be)
 33 | 
 34 |   + [Diff 2](https://github.com/DavisPL-Teaching/119-hw2/commit/7e6beaab93cdf1f30ea1fbc55535c2db9f99208a)
 35 | 
 36 | Plan:
 37 | 
 38 | - Poll
 39 | 
 40 | - Lecture 5, part 2: Properties of RDDs
 41 | 
 42 | - (If time) continue to Lecture 6, part 3: MapReduce.
 43 | 
 44 | Questions?
 45 | 
 46 | ## Monday, Nov 17
 47 | 
 48 | Reminders:
 49 | 
 50 | - HW2 due Nov 24 (1 week from today)
 51 | 
 52 |   OH: today 415pm, Friday 11am, Monday 24th 4:15pm
 53 | 
 54 | Plan:
 55 | 
 56 | - Finish loose ends from Lecture 5, part 2
 57 | 
 58 | - Poll
 59 | 
 60 | - Part 3: introduction to MapReduce.
 61 | 
 62 | Questions?
 63 | 
 64 | ## Wednesday, Nov 19
 65 | 
 66 | Announcements and reminders:
 67 | 
 68 | - HW2 due Monday (in 5 days)
 69 | 
 70 |   Partial autograder is now available!
 71 |   Try it out ahead of time
 72 | 
 73 |   It will give you a preliminary score out of a maximum of 28.
 74 |   (Currently 38 - will be updated after this evening, will be 28.)
 75 | 
 76 | - ICYMI:
 77 | 
 78 |   + `github_help.md` file for setting up your own Git repo!
 79 | 
 80 |   + `hints.md` has hints for some problems.
 81 | 
 82 | - Some extra MapReduce material in `extras/` - some of this is covered as part of your homework
 83 | 
 84 | Plan:
 85 | 
 86 | - Part 4: Data Partitioning
 87 | 
 88 | - (If time) Part 5 on DataFrames
 89 | 
 90 | Any questions?
 91 | 
 92 | ## Friday, Nov 21st
 93 | 
 94 | Reminders:
 95 | 
 96 | - HW2 due on Monday 11:59pm
 97 | 
 98 | Plan:
 99 | 
100 | - Narrow and wide operators (finishing up Part 4)
101 | 
102 | - Poll
103 | 
104 | - DataFrames (finishing up part 5 - this will be covered relatively briefly)
105 | 
106 | - Part 6: end note on disadvantages of Spark.
107 | 


--------------------------------------------------------------------------------
/lecture3/old/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 3: Data Operators
  2 | 
  3 | **This is an old version of the lecture from the Fall 2024 iteration of the course. It has not yet been updated for Fall 2025.**
  4 | 
  5 | ## Oct 14
  6 | 
  7 | Announcements:
  8 | 
  9 | - HW1 Part 1 is available -- due a week from Friday
 10 | 
 11 |   https://piazza.com/class/m12ef423uj5p5/post/34
 12 |   https://github.com/DavisPL-Teaching/119-hw1
 13 | 
 14 | - All the parts are part of the same assignment repository.
 15 |   Parts 2 and 3 will be released on Wednesday and Friday.
 16 | 
 17 | Plan:
 18 | 
 19 | - Poll
 20 | 
 21 | - Finish loose ends from Lecture 2
 22 | 
 23 | - Start Lecture 3
 24 | 
 25 | ### Poll
 26 | 
 27 | 1. Which is the correct sequence of 3 commands?
 28 | 
 29 | 2. Which of the following are probably correct reason(s) that git requires running 3 commands to publish your code instead of just one?
 30 | 
 31 | https://forms.gle/MpSRmPyWVJfbm3kv6
 32 | https://tinyurl.com/bdh3d2zp
 33 | 
 34 | ## Oct 16
 35 | 
 36 | Announcements:
 37 | 
 38 | - HW1 part 2 available!
 39 | 
 40 |   + https://github.com/DavisPL-Teaching/119-hw1
 41 | 
 42 |   + Please come to office hours! And get started early
 43 | 
 44 |   + Part 3 + instructions to submit will be released on Friday.
 45 | 
 46 |   + Due Friday, Oct 25.
 47 | 
 48 | - Midterm will likely be moved to week of Nov 4/6/8 -- the date will
 49 |   be confirmed on Monday, Oct 21.
 50 | 
 51 | Questions about HW1?
 52 | 
 53 | Plan:
 54 | 
 55 | - Start with the poll
 56 | 
 57 | - Note on Python in vscode
 58 | 
 59 | - Continue lecture 3
 60 | 
 61 | ### Poll
 62 | 
 63 | Which of the following are true statements about Pandas data frames?
 64 | 
 65 | - Every value in a row must have the same type
 66 | - Every value in a column must have the same type
 67 | - Data frames must have at least one row
 68 | - Data frames must have at least one column
 69 | - Data frames are 2-dimensional
 70 | - Data frames cannot have null values
 71 | - Rows must be indexed by integer values
 72 | 
 73 | https://forms.gle/KBenqsqHD6A71Moq8
 74 | https://tinyurl.com/27mtfxyn
 75 | 
 76 | ## Oct 18
 77 | 
 78 | Announcements:
 79 | 
 80 | - HW1 fully available: https://piazza.com/class/m12ef423uj5p5/post/46
 81 | 
 82 | - The "project proposal" part has been moved to a later HW; instead, Part 3
 83 |   is a shorter series of exercises on the shell.
 84 | 
 85 | - To get the latest changes: git add ., git commit -m "message", git pull, **then** resolve any merge conflicts! https://piazza.com/class/m12ef423uj5p5/post/44
 86 | 
 87 | - OH today (starting 415 -- I will stay past 5 if there are still people around
 88 |             asking questions)
 89 | 
 90 | HW submission:
 91 | 
 92 | - We are now using Gradescope for submission instead of GitHub Classroom.
 93 |   You should have been added; please see Piazza for the link and HW1 for details.
 94 |   You can submit either via GitHub or via a zip file.
 95 | 
 96 | - Clarification to late policy: https://piazza.com/class/m12ef423uj5p5/post/47
 97 | 
 98 | - Questions?
 99 | 
100 | Plan:
101 | 
102 | - Poll
103 | 
104 | - Continue Lecture 3:
105 |   remaining SQL operators, common "gotchas", and survey a more general
106 |   view of data processing operators
107 | 
108 | ### Poll
109 | 
110 | https://forms.gle/TTX5Tvp72AGESsxK8
111 | https://tinyurl.com/mu6e73sx
112 | 


--------------------------------------------------------------------------------
/lecture2/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 2: The Shell
  2 | 
  3 | ## Monday Oct 13
  4 | 
  5 | A couple of scheduling announcements:
  6 | 
  7 | - **This Wednesday:**
  8 |     I need to switch lecture & discussion section, and
  9 |     lecture will be on Zoom.
 10 | 
 11 |     So:
 12 |     + Lecture at 11am on Zoom
 13 |         ^^ meeting invite is moved in Canvas so you should be able
 14 |            to find it there.
 15 |     + Discussion section at 3:10pm in the usual classroom (walker hall)
 16 | 
 17 | - **Midterm date: moved** to Wednesday, November 5
 18 | 
 19 |     In class midterm - in this classroom (Walker Hall)
 20 | 
 21 | Reminders:
 22 | 
 23 | - HW1 due this Friday
 24 |     + Office hours today, Friday!
 25 | 
 26 | Plan:
 27 | 
 28 | - Start with the discussion question & poll.
 29 | 
 30 | - Introduction to the shell!
 31 | 
 32 | - 3-part model for interacting with the shell.
 33 | 
 34 | ## Wednesday Oct 15
 35 | 
 36 | Reminders:
 37 | 
 38 | - Discussion section at 3:10pm in lecture classroom
 39 | 
 40 | - HW1 due Friday
 41 | 
 42 | Plan:
 43 | 
 44 | - 3-part model for interacting with the shell.
 45 | 
 46 | Questions?
 47 | 
 48 | ## Friday Oct 17
 49 | 
 50 | Reminders:
 51 | 
 52 | - HW1 due today
 53 | 
 54 |     Good luck!
 55 |     Please keep the questions coming on Piazza! We will try to monitor leading
 56 |     up to the deadline
 57 | 
 58 |     Autograder score:
 59 |     60 pts available now, remainder will be graded with
 60 |     the full version of the autograder after the deadline.
 61 | 
 62 | Plan:
 63 | 
 64 | - 3-part model for interacting with the shell.
 65 | 
 66 | - We will aim to finish Lecture 2 today and Monday.
 67 | 
 68 | Questions?
 69 | 
 70 | ## Monday Oct 20
 71 | 
 72 | Announcements:
 73 | 
 74 | - Canvas and Piazza outage today :-)
 75 | 
 76 |     + Canvas still down, Piazza is back up
 77 | 
 78 | - My office hours are cancelled today! (I have to make a different appointment)
 79 | 
 80 |     **If you need to reach me:** I can be available for Zoom hours, I plan to hold these hours either today or tomorrow evening or Wednesday late afternoon. Please email me if you would like to attend office hours at one of these times and let me know which of these times you would be available.
 81 | 
 82 | - A quick note about uploading to Gradescope
 83 | 
 84 |     + Please do try to download your code and run it!
 85 | 
 86 |     + Part of why we are doing all of this work on the shell is to get you used to figuring out how/why
 87 |       programs do and don't run :-)
 88 | 
 89 |     + It is your responsibility to ensure that the autograder runs on your code.
 90 | 
 91 | Plan:
 92 | 
 93 | - Start with discussion question & poll
 94 | 
 95 | - Help commands and doing stuff commands
 96 | 
 97 | - Hope to get through most/all of remainder of lecture 2.
 98 | 
 99 | Questions?
100 | 
101 | ## Wednesday Oct 22
102 | 
103 | Announcements:
104 | 
105 | - Full disclosure: we are a bit behind from where I wanted to be
106 |   at this point in the quarter!
107 | 
108 |   I know I said we would try to finish the shell lecture last time,
109 |   but there are a couple of topics left that I want to cover
110 |   and these are part of the material for the midterm.
111 | 
112 | - I will discuss more details about the midterm on Friday.
113 | 
114 | Questions?
115 | 
116 | Plan:
117 | 
118 | - Poll
119 | 
120 | - Git
121 | 
122 | - Dangers of the shell
123 | 
124 | - Overview of advanced topics / things we didn't cover.
125 | 


--------------------------------------------------------------------------------
/lecture1/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 1: Introduction to data processing pipelines
  2 | 
  3 | ## Friday, September 26
  4 | 
  5 | Announcements
  6 | 
  7 | - Homework 0 is now available, due Wednesday, October 1.
  8 | https://forms.gle/XVDDVZPJNRZKhrN36
  9 | 
 10 | - This is an installation help homework. Please come to office hours and
 11 |   discussion section to get help!
 12 |   + Monday OH after class
 13 |   + Wednesday discussion section will cover installation help
 14 | 
 15 | - **There will be no class on Wednesday (October 1)** as I will be away at a conference.
 16 | 
 17 | - Enrollment updates
 18 |   + Currently 90 students enrolled, 13 on waitlist
 19 |   + I have received some questions from those of you on the waitlist - more on this in the slides!
 20 | 
 21 | Plan for today:
 22 | 
 23 | 1. Finish Lecture 0 slides (syllabus overview)
 24 | 
 25 | 2. In-class poll
 26 | 
 27 | 3. Following along with lectures
 28 | 
 29 | 4. Begin Lecture 1: Introduction to data processing pipelines.
 30 | 
 31 | ## Monday, September 29
 32 | 
 33 | Reminders:
 34 | 
 35 | - **No class on Wednesday** - HW0 due on Wednesday 11:59pm
 36 | 
 37 | - OH after class today, discussion section to get installation help!
 38 | 
 39 |   + Windows issues: use WSL, downgrade to Java 11
 40 |     https://piazza.com/class/mfvn4ov0kuc731/post/22
 41 | 
 42 | Plan for today:
 43 | 
 44 | - Clone repo to follow along!
 45 | 
 46 | - Discussion question / in-class poll
 47 | 
 48 | - Continue Lecture 1 on ETL and dataflow graphs.
 49 | 
 50 | Questions about HW0 or plan for today?
 51 | 
 52 | ## Friday, October 3
 53 | 
 54 | Thanks to the TA for running installation help office hours - hopefully you have been able to get your installation issues resolved!
 55 | 
 56 | - As of yesterday, most of you have everything working up to PySpark (with one or two still working on PySpark). You will need PySpark working for the latter part of the course.
 57 |   We found that:
 58 |   - for Windows, Java 11, Python 3.12.3, Pyspark 3.5.3 work
 59 |   - for Mac/WSL: Java 21 or 22 works.
 60 | 
 61 | Plan for today:
 62 | 
 63 | - Start with poll
 64 | 
 65 | - From ETL to Dataflow graphs
 66 | 
 67 | - A more realistic example to go through
 68 | 
 69 | - (If time) Failures and risks
 70 | 
 71 | ## Monday, October 6
 72 | 
 73 | Announcements:
 74 | 
 75 | - HW1 is released! Due: Friday, October 17
 76 | 
 77 |   https://github.com/DavisPL-Teaching/119-hw1
 78 | 
 79 |   Please get started early!
 80 | 
 81 | - I will need to end OH early today at 5pm
 82 | 
 83 | Plan:
 84 | 
 85 | - Practice with Dataflow Graphs :)
 86 | 
 87 | - A little bit about data validation
 88 | 
 89 | - Measuring performance
 90 | 
 91 | ## Wednesday, October 8
 92 | 
 93 | Reminders:
 94 | 
 95 | - HW1 due Fri Oct 17
 96 | 
 97 | Announcements:
 98 | 
 99 | - ~~Midterm date: tenatively set for **Friday, November 7**~~
100 | 
101 | - Discussion section Zoom link/recording will be available in Canvas going forward!
102 | 
103 | - Waitlist update
104 | 
105 | Plan:
106 | 
107 | - Start with discussion question
108 | 
109 | - Talk about performance
110 | 
111 | - (If time) Talk about failures and risks in pipelines
112 | 
113 | - We will finish up Lecture 1 today and Friday and then move to Lecture 2 on the Shell.
114 | 
115 | ## Friday, October 10
116 | 
117 | Announcements:
118 | 
119 | - Waitlist update:
120 |   They were able to admit a few students off the waitlist. I was told that they emailed out PTAs.
121 | 
122 | - Reminder: HW1 due Fri Oct 17
123 | 
124 | - Questions on Piazza -- also, join the student-run Discord!
125 | 
126 | - HW1 prelminary autograder available!
127 | 
128 | Plan:
129 | 
130 | - Poll (and a quick check-in on pacing)
131 | 
132 | - Finish performance discussion.
133 | 
134 | Questions?
135 | 


--------------------------------------------------------------------------------
/lecture6/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 6: Streaming Pipelines
  2 | 
  3 | This is the last "full" lecture!
  4 | 
  5 | ## Monday, November 24
  6 | 
  7 | HW2 due today!
  8 | 
  9 | - Time limit for part 3: 30 minutes total
 10 |   (autograder will run for 40 minutes on the whole submission)
 11 | 
 12 |   Try it out and please let us know if you are having any difficulties!
 13 | 
 14 | - Max autograder score: 28/28
 15 | 
 16 | - Also check out: https://piazza.com/class/mfvn4ov0kuc731/post/126
 17 |   if you are running into a runtime error related to the Spark Context.
 18 | 
 19 | I have OH after class today
 20 | 
 21 | Plan:
 22 | 
 23 | - Introduction to streaming pipelines.
 24 | 
 25 | Questions about HW2?
 26 | 
 27 | ## Wednesday, November 26
 28 | 
 29 | Happy Thanksgiving!
 30 | 
 31 | Announcements
 32 | (see [Piazza](https://piazza.com/class/mfvn4ov0kuc731/post/145)):
 33 | 
 34 | 1. No homework 3
 35 | 
 36 | 2. Homework make-up option
 37 | 
 38 | 3. Course evaluations open this Friday
 39 | 
 40 | I will need to end lecture 5-10 minutes early today.
 41 | 
 42 | Lecture today:
 43 | 
 44 | - Part 2 on microbatching.
 45 | 
 46 | - Part 3 on time.
 47 | 
 48 | Questions?
 49 | 
 50 | ## Wednesday, Dec 3
 51 | 
 52 | Announcements:
 53 | 
 54 | - ALL in-class polls due this Friday 11:59pm
 55 | 
 56 | - Course eval due this Friday 11:59pm - you should have received by email
 57 | 
 58 | - HW2 grades were released
 59 | 
 60 |   + Autograder timeout issue - please submit on Gradescope!
 61 | 
 62 |   + Part 2 part2.png naming issue - please submit on Gradescope also (will reduce to a -5 pt deduction)
 63 | 
 64 | - HW make-up option:
 65 | 
 66 |   + HW1 make-up option due by Friday 11:59pm
 67 | 
 68 |   + HW2 make-up option and regrades due by Monday 11:59pm
 69 | 
 70 |   + Pick one of the two if you received < 90 score, 2/3 of the points back.
 71 | 
 72 | - Final study list: see `exams/final_study_list.md`.
 73 | 
 74 | - Final details: `exams/final.md`
 75 | 
 76 | - OH after class today on Zoom.
 77 | 
 78 | Plan:
 79 | 
 80 | - Poll
 81 | 
 82 | - Go over different notions of time (Part 3)
 83 | 
 84 | - Go over details for the final and study guide.
 85 | 
 86 | Questions?
 87 | 
 88 | ## Friday, Dec 5
 89 | 
 90 | Last day of class!
 91 | 
 92 | Announcements:
 93 | 
 94 | - All in-class polls are due by EOD today! The final poll will occur in class today and then I will sync the polls to Canvas one more time (after class) prior to the deadline.
 95 | 
 96 | - Final: next Thursday (Dec 11) at 8-10am!
 97 |   See `exams/` and Piazza for final study materials and the practice final.
 98 | 
 99 | - I hope that we have resolved the HW2 grading issues!
100 |   I apologize for the stress. :-)
101 |   Please fill out a regrade request if we haven't resolved your case.
102 |   HW make-up option: HW1 due today 11:59pm, HW2 due Monday 11:59pm.
103 | 
104 | - I will hold OH either on
105 |   + Monday at 415pm - on Zoom
106 |   + Tuesday at 445pm - in person
107 |     (or both)
108 |   If you plan to attend, please let me know if you plan to attend one of these two times!
109 | 
110 | Plan:
111 | 
112 | - Review & finish Part 3 on time
113 | 
114 | - Poll on different types of time
115 | 
116 | - End notes & wrapping up Lecture 6.
117 | 
118 | If there is time:
119 | 
120 | - A note on HW2 latency & throughput graphs!
121 |   And an easy mistake to make in Python
122 | 
123 | - Since I am skipping Lecture 7, we will likely have some extra time at the end, I can answer questions or go over practice questions from the practice final.
124 | 
125 | - Lecture 7 will not appear on the final,
126 |   but you are welcome to review the materials in lecture7/ on your own time!
127 |   It is a crash course covering the basics of AWS services:
128 | 
129 |   - AWS S3,
130 |   - AWS EC2, and
131 |   - AWS Lambda.
132 | 
133 | Questions?
134 | 


--------------------------------------------------------------------------------
/lecture1/parts/throughput_latency.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Throughput and latency calculation example.
  3 | 
  4 | Throughput: Number of items processed per unit time
  5 | Latency: Time taken to process a single item
  6 | """
  7 | 
  8 | from lecture import pipeline, get_life_expectancy_data
  9 | 
 10 | """
 11 | Timeit:
 12 | https://docs.python.org/3/library/timeit.html
 13 | 
 14 | Library that allows us to measure the running time of a Python function
 15 | 
 16 | Example syntax:
 17 | timeit.timeit('"-".join(str(n) for n in range(100))', number=10000)
 18 | """
 19 | import timeit
 20 | 
 21 | IN_FILE_THROUGHPUT = "life-expectancy.csv"
 22 | IN_FILE_LATENCY = "life-expectancy-row1.csv"
 23 | OUT_FILE = "output.csv"
 24 | 
 25 | def throughput(num_runs):
 26 |     """
 27 |     Measure the throughput of our example pipeline, in data items per second.
 28 | 
 29 |     We need to run it a bunch of times to get an accurate number,
 30 |     that's why we're taking a num_runs parameter and passing it to
 31 |     timeit
 32 | 
 33 |     Updated formula if we run multiple times
 34 |         N = input dataset size
 35 |         T = total running time
 36 |         num_runs = number of times I ran the pipeline
 37 |     Throughput =
 38 |         N / (T / num_runs)
 39 |     """
 40 | 
 41 |     print(f"Measuring throughput over {num_runs} runs...")
 42 | 
 43 |     # Number of items
 44 |     num_items = len(get_life_expectancy_data(IN_FILE_THROUGHPUT))
 45 | 
 46 |     # ^^ Notice I'm doing this cmoputation outside of the actual measurement
 47 |     # (timeit) code - if you start measuring things as you're computing
 48 |     # them, this leads to distortion of the results
 49 |     # (Kind of like Quantum Mechanics - if you measure it, it changes)
 50 | 
 51 |     # Function to run the pipeline
 52 |     def f():
 53 |         pipeline(IN_FILE_THROUGHPUT, "output.csv")
 54 | 
 55 |     # Measure execution time
 56 |     execution_time = timeit.timeit(f, number=num_runs)
 57 | 
 58 |     # Print and return throughput
 59 |     # Display to the nearest integer
 60 |     throughput = num_items * num_runs / execution_time
 61 |     print(f"Throughput: {int(throughput)} items/second")
 62 |     return throughput
 63 | 
 64 | def latency(num_runs):
 65 |     """
 66 |     Measure the average latency of our example pipeline, in seconds.
 67 | 
 68 |     (In today's poll: we saw an example where we measured latency
 69 |      not just with one input row)
 70 | 
 71 |     *Key point:*
 72 |     We use a one-row version of the pipeline to measure latency
 73 |     """
 74 | 
 75 |     print(f"Measuring latency over {num_runs} runs...")
 76 | 
 77 |     # Function to run the pipeline
 78 |     def f():
 79 |         pipeline(IN_FILE_LATENCY, "output.csv")
 80 | 
 81 |     # Measure execution time
 82 |     execution_time = timeit.timeit(f, number=num_runs)
 83 | 
 84 |     # Print and return latency (in milliseconds)
 85 |     # Display to 5 decimal places
 86 |     latency = execution_time / num_runs * 1000
 87 |     print(f"Latency: {latency:.5f} ms")
 88 |     return latency
 89 | 
 90 | if __name__ == "__main__":
 91 |     throughput(1000)
 92 |     latency(1000)
 93 | 
 94 | """
 95 | Observations?
 96 | 
 97 | ~4.1M items/second
 98 |     Pandas is very fast!
 99 |     (We are just doing max/min/avg, probably things would slow
100 |      down for something more complicated)
101 | 
102 | Somewhat stable across runs - this is because we run the
103 |     pipeline always multiple times
104 | 
105 |     (Lesson: always run multiple times for best practice)
106 | 
107 | Is throughput always constant with the number of input items?
108 |     No!
109 | 
110 |     (Try deleting items from input dataset - what happens?)
111 | 
112 | Latency: about ~0.7 ms
113 | 
114 | Latency != 1 / Throughput
115 | 
116 | Latency is much greater for a dataset with one row.
117 | """
118 | 


--------------------------------------------------------------------------------
/lecture4/extras/data-race-example.py:
--------------------------------------------------------------------------------
 1 | """
 2 | A more extended data race example - related to the Oct 31 in-class poll -
 3 | that discusses the relationship between data races and UB.
 4 | 
 5 | This is a subtle topic!
 6 | TL;DR: The following are both correct:
 7 | 
 8 | - x will always between 2 and 200 in Python using multiprocess (multiprocess.Array and multiprocess.Value)
 9 | 
10 | - x could be any integer value making no assumptions about the programming language implementation of += or the underlying architecture.
11 | 
12 | The latter bullet (any value) is also the technically correct answer in C/C++, where data races are something called "undefined behavior", meaning that the compiler is allowed to compile your code to do something entirely different than what you said.
13 | (And it may depend on the compiler!)
14 | 
15 | In a language with undefined behavior:
16 | 
17 |     *if there is a read and a write concurrently to the same data (or two concurrent writes), the value of that data is indeterminate.*
18 | 
19 | Longer discussion:
20 | 
21 | Try running the example! What happens?
22 | 
23 | As some of you suspected, in our Python implementation (using multiprocess), it happens to be the case that:
24 | 
25 | - when run, the result is a value between 100 and 200 in all test runs
26 |   (often 10 or 20 for the N = 10 case -- sometimes some value in between!)
27 | 
28 | - we do observe a data race
29 |   (if there were no data race, all += 1s would happen, and the value would always end up at 200)
30 | 
31 | So why did I say that x can be *any* value?
32 | 
33 | This has to do with what assumptions we are making about how Python represents integers
34 | -- for example, assuming that it consistently handles reads and writes to integers (+= consistently).
35 | This assumption is very dangerous! It is generally *not* true when using operations that aren't designed to be safely called in a concurrent programming context, such as big integers.
36 | (A Python integer is not just a single byte, it contains multiple bytes! That makes it something called a big integer. It may be the case that a read and a write to a Python integer at the same time don't just execute one after the other, but actually completely invalidate that integer, by modifying the different bytes in different ways -- or even, moving the integer somewhere else in memory.)
37 | 
38 | This actually occurs and is not just a hypothetical:
39 | 
40 |     - It will occur in Python when using data structures more than just a single byte, if not protected by a shared lock
41 | 
42 |     - It will occur in any data structure implementation in C, C++, or Java or any implementation that stores raw pointers in memory.
43 | 
44 | That's why the way I want you to think about data races in this class
45 | is with the "simplified view" above.
46 | That is, data races cause data to be invalidated and it could represent any value after the race occurs.
47 | 
48 | Further reading:
49 | 
50 | - Why += 1 is so-called "undefined behavior" (UB) in C: https://stackoverflow.com/a/39396999/2038713
51 | 
52 | - More on "undefined behavior" and why:
53 |   https://news.ycombinator.com/item?id=16247958
54 |   https://davmac.wordpress.com/2018/01/28/understanding-the-c-c-memory-model/
55 | 
56 | - Data races in Python: https://verdagon.dev/blog/python-data-races
57 |   which does not discuss the UB issue.
58 | """
59 | 
60 | import ctypes
61 | from multiprocessing import RawValue, Process, freeze_support
62 | 
63 | N = 100
64 | 
65 | # Comment out other examples to try:
66 | # N = 1
67 | # N = 10
68 | # N = 10_000_000
69 | 
70 | def worker1(x):
71 |     for i in range(N):
72 |         x.value += 1
73 | 
74 | def worker2(x):
75 |     for i in range(N):
76 |         x.value += 1
77 | 
78 | if __name__ == "__main__":
79 |     # Guard to ensure only one worker runs the main code
80 |     freeze_support()
81 | 
82 |     # Set up the shared memory
83 |     start = 0
84 |     x = RawValue(ctypes.c_uint64, start)
85 |     print(f"Start value: {x.value}")
86 | 
87 |     # Run the two workers
88 |     p1 = Process(target=worker1, args=(x,))
89 |     p2 = Process(target=worker2, args=(x,))
90 |     p1.start()
91 |     p2.start()
92 |     p1.join()
93 |     p2.join()
94 | 
95 |     # Get the result
96 |     print(f"Final result: x = {x.value}")
97 | 


--------------------------------------------------------------------------------
/lecture5/extras/cut.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """
  3 | === Cut material ===
  4 | 
  5 | ===== Cut material on MapReduce =====
  6 | 
  7 | Some of this material will appear as exercises on HW2.
  8 | 
  9 | The above is written very abstractly, what does it mean?
 10 | 
 11 | Let's walk through each part:
 12 | 
 13 | Keys:
 14 | 
 15 | The first thing I want to point out is that all the data is given as
 16 |     (key, value)
 17 | 
 18 | pairs. (K1 and K2)
 19 | 
 20 | Generally speaking, we use the first coordinate (key) for partitioning,
 21 | and the second one to compute values.
 22 | 
 23 | Ignore the keys for now, we'll come back to that.
 24 | 
 25 | Map:
 26 |     map: (K1, T1) -> list((K2, T2))
 27 | 
 28 | - we might want to transform the data into a different type
 29 |     T1 and T2
 30 | - we might want to output zero or more than one output -- why?
 31 |     list(K2, T2)
 32 | 
 33 | Examples:
 34 | (write pseudocode for the corresponding lambda function)
 35 | 
 36 | - Compute a list of all Carbon-Fluorine bonds
 37 | 
 38 | - Compute the total number of Carbon-Fluorine bonds
 39 | 
 40 | - Compute the average of the ratio F / C for every molecule that has at least one Carbon
 41 |   (our original example)
 42 | 
 43 | In Spark:
 44 | https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.flatMap.html
 45 | """
 46 | 
 47 | def map_general_ex(rdd, f):
 48 |     # TODO
 49 |     raise NotImplementedError
 50 | 
 51 | # Uncomment to run
 52 | # map_general_ex()
 53 | 
 54 | """
 55 | What about Reduce?
 56 | 
 57 |     reduce: (K2, list(T2)) -> list(T2)
 58 | 
 59 | The following is a common special case:
 60 | 
 61 |     reduce_by_key: (T2, T2) -> T2
 62 | 
 63 | Reduce:
 64 | - data has keys attached. Keys are used for partitioning
 65 | - we aggregate the values *by key* instead of over the entire dataset.
 66 | 
 67 | Examples:
 68 | (write the corresponding Python lambda function)
 69 | (you can use the simpler reduce_by_key version)
 70 | 
 71 | - To compute a total for each key?
 72 | 
 73 | - To compute a count for each key?
 74 | 
 75 | - To compute an average for each key?
 76 | 
 77 | - To compute an average over the entire dataset?
 78 | 
 79 | Important note:
 80 | K1 and K2 are different! Why?
 81 | 
 82 | In Spark:
 83 | 
 84 | https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html
 85 | """
 86 | 
 87 | def reduce_general(rdd, f):
 88 |     # TODO
 89 |     raise NotImplementedError
 90 | 
 91 | """
 92 | Finally, let's use our generalized map and reduce functions to re-implement our original task,
 93 | computing the average Fluorine-to-Carbon ratio in our chemical
 94 | dataset, among molecules with at least one Carbon.
 95 | """
 96 | 
 97 | def fluorine_carbon_ratio_map_reduce(data):
 98 |     # TODO
 99 |     raise NotImplementedError
100 | 
101 | """
102 | ===== Understanding latency (abstract) =====
103 | (review if you are interested in a more abstract view)
104 | 
105 | Why isn't optimizing latency the same as optimizing for throughput?
106 | 
107 | First of all, what is latency?
108 | Imagine this. I have 10M inputs, my pipeline processes 1M inputs/sec (pretty well parallelized.
109 | 
110 |     (input)   --- (process)
111 |       10M        1M / sec
112 | 
113 | Latency is always about response time.
114 | To have a latency you have to know what change or input you're making to the system,
115 | and what result you want to observe -- the latency is the difference (in time)
116 | between the two.
117 | 
118 | Latency is always measured in seconds; it's not a time "per item" or dividing the total time by number
119 | of items (that doesn't tell us how long it took to see a response for each individual item!)
120 | 
121 | So, what's the latency if I just put 1 item into the pipeline?
122 | (That is, I run on an input of just 1 item, and wait for a response to come back)?
123 | 
124 | We can't say! We don't know for example, whether the
125 | 1M inputs/sec means all 1M are processed in parallel, and they all take 1 second,
126 | or the 1M inputs/sec means that
127 | it's really 100K inputs every tenth of a second at a time.
128 | 
129 | This is why:
130 |     Latency DOES NOT = 1 / throughput
131 | in general and it's also why optimizing for throughput doesn't always benefit latency
132 | (or vice versa).
133 | We will get to this more in Lecture 6.
134 | """
135 | 


--------------------------------------------------------------------------------
/exams/midterm_study_list.md:
--------------------------------------------------------------------------------
  1 | # Midterm Study List
  2 | 
  3 | Study list of topics for the midterm.
  4 | 
  5 | **The midterm will cover Lecture 1, Lecture 2, and Lecture 4 up through Data Parallelism (in Part 4).**
  6 | 
  7 | You should know all of the following concepts, but I won't test you on syntax.
  8 | - For example, you won't be asked to write code on the exam,
  9 |   but you might be asked to explain how you might to dome task in words
 10 |   or asked to calculate how much time a task would take given some
 11 |   assumptions about how long the parts take.
 12 | 
 13 | ## Lecture 1 (Introduction to data processing)
 14 | 
 15 | - Data pipelines: the ETL model
 16 | 
 17 |     + Know each stage and why it exists
 18 | 
 19 | - Dataflow Graphs
 20 | 
 21 |     + Know how to draw a dataflow graph
 22 | 
 23 |     + Definition of edges & dependencies
 24 | 
 25 |     + Definition and application of: sources, data operators, sinks
 26 | 
 27 |     + Definition of an operator
 28 | 
 29 |     + Relation between ETL and dataflow graphs
 30 | 
 31 | - Coding practices: Python classes, functions, modules, unit tests
 32 | 
 33 | - Performance:
 34 | 
 35 |   Latency and throughput:
 36 |   concepts, definitions, formulas and how to calculate them
 37 | 
 38 | - Pandas: what a DataFrame is and basic properties
 39 | 
 40 |   (some number of rows and columns)
 41 | 
 42 |   (DataFrame = table)
 43 | 
 44 | - Also know:
 45 | 
 46 |   + Know a little bit about data validation: things that can go wrong in your input!
 47 | 
 48 |     (check for null values, check that values are positive)
 49 | 
 50 |   + Some things about exploration time vs. development time:
 51 |     I might ask you what step(s) you might do to explore a dataset
 52 |     after reading it in to Python or Pandas
 53 | 
 54 |   + Know the following term: vectorization
 55 | 
 56 | ## Lecture 2 (the Shell)
 57 | 
 58 | - "Looking around, getting help, doing stuff" model
 59 | 
 60 |   + 3 types of commands
 61 | 
 62 |   + Underlying state of the system
 63 | 
 64 |     (current working directory, file system, environment variables)
 65 | 
 66 | - Basic command purposes (I won't test you on syntax!)
 67 |   ls, cd, cat, git, python3, echo, man, mkdir, touch, open, rm, cp
 68 |   $PATH, $PWD
 69 | 
 70 | - shell commands can be run from Python and vice versa
 71 | 
 72 | - hidden files and folders, environment variables
 73 | 
 74 | - platform dependence; definition of a "platform"
 75 | 
 76 |   platform = operating system, architecture, any installed software requirements or depdencies
 77 | 
 78 | - named arguments vs. positional arguments
 79 | 
 80 | - Git, philoslophy of git
 81 | 
 82 | - dangers of the shell, file deletion, concept of "ambient authority"
 83 | 
 84 | ## Lecture 4 (Parallelism)
 85 | 
 86 | Part 1:
 87 | 
 88 | - Scaling: scaling in input data or # of tasks; Pandas does not scale
 89 | 
 90 | Part 2:
 91 | 
 92 | - Definitions of parallelism, concurrency, and distribution;
 93 |   how to identify these in tasks; conveyer belt analogy
 94 | 
 95 | - Should be able to give an example scenario using these terms, or an example
 96 |   scenario satisfying soem of parallelism, concurrency, and distribution but not
 97 |   the others
 98 | 
 99 | - How parallelism speeds up a pipeline; be able to estimate how fast a program would take given some assumptions about how long it takes to process each item and how
100 | it is parallelized
101 | 
102 | Part 3:
103 | 
104 | - Know terminology: Concurrency, race condition, data race, undefined behavior, contention, deadlock, consistency, nondeterminism
105 | 
106 | - "Conflicting operations": for example, a read/write are conflicting, two writes are conflicting, two reads are not conflicting as they can both occur simultaneously (this is also part of the definition of a data race)
107 | 
108 | - Know possible concurrent executions of a program: you won't be asked a scenario as complex as the poll on Oct 31. However, you might be asked about a simpler program, like one with two workers, one increments x += 1, the other implements x += 1 twice, etc.
109 | 
110 | - Why we avoid concurrency in data pipelines: difficult to program with; no one
111 |   should write any program with data races in it.
112 | 
113 | Part 4:
114 | 
115 | - Motivation: is parallelism present? how much parallelism?
116 |   Want to parallelize without writing concurrent code ourselves
117 | 
118 | - Types of parallelism: task parallelism, data parallelism
119 | 
120 | - How to identify each of these in the dataflow graph.
121 | 


--------------------------------------------------------------------------------
/exams/poll_answers.md:
--------------------------------------------------------------------------------
  1 | # In-class poll answers
  2 | 
  3 | Sep 24:
  4 | Tools required and characteristics and needs of your application will change drastically with the size of the dataset.
  5 | 
  6 | Sep 26:
  7 | N/A
  8 | 
  9 | Sep 29:
 10 | 1) One possible answer: input file does not exist
 11 | 2) No, because the maximum row of a dataset is not always unique.
 12 | 
 13 | Oct 3:
 14 | All except B ("Will speed up the development of a one-off script")
 15 | 
 16 | Oct 6:
 17 | 1) Edges from:
 18 | read -> max, min, and avg
 19 | max, min, and avg -> print
 20 | max, min, and avg -> save
 21 | 
 22 | 2) read -> print or read -> save (give a specific example)
 23 | 
 24 | Oct 8:
 25 | True, False, False.
 26 | 
 27 | Oct 10:
 28 | 1) Throughput = 1,000 records/hour assuming the full pipeline is measured from 9am to 9pm.
 29 | 2) 30 minutes on average
 30 | 3) From the perspective of the patient (individual row level): uniformly distributed between 0min and 60min delay.
 31 | 
 32 | Oct 13:
 33 | 1) hrs, ms, s, ns
 34 | 2) F F F T T F
 35 | 
 36 | Oct 15:
 37 | Correct answers: 1, 2, 3, 4, 5, and 7 (all except 6: "To load & use pandas to calculate the max and average of a DataFrame")
 38 | 
 39 | Oct 17:
 40 | 1) B, C, and D
 41 | 2) B, C, D, E, and F (all except "A: A python3 'Hello, world!' program works only on certain operating systems")
 42 | 
 43 | Oct 20:
 44 | ls, ls -alh, echo $PATH, python3 --version, conda list, git status, cat, less
 45 | 
 46 | Oct 22:
 47 | 1) Some possible answers: `cd folder/`, `cp file1.txt file2.txt`
 48 | 2) Some possible answers: `ls -alh`, `python3 --version`
 49 | 
 50 | Oct 24:
 51 | B, E, and F.
 52 | 
 53 | Oct 27:
 54 | 1, 2: no one correct answer, most answers were clustered 8 GB or 16GB
 55 | 3: Pandas requires 5-10x the amount of RAM as your dataset, so for 16GB you should have gotten 1.6GB to 3.2GB for the lagest dataset you can handle.
 56 | 
 57 | Oct 29:
 58 | 1. Parallel
 59 | 2. Parallel + Concurrent
 60 | 3. Parallel + Concurrent
 61 | 4. Concurrent + Distributed
 62 | 5. Parallel + Distributed
 63 | 6. Parallel + Concurrent + Distributed
 64 | 
 65 | Oct 31:
 66 | 1. Intended answer was between 0 and 200;
 67 | to be fully precise, the correct answer should be any value between 2 and 200 (inclusive).
 68 | 
 69 | 2. ABDEFG
 70 |    Concurrency, Parallelism, Contention, Race Condition, Data Race -
 71 |    and Spooky Halloween Vibes, because data races are scary!! :-)
 72 | 
 73 | Nov 3:
 74 | Data parallelism (at tasks 1 and 2)
 75 | No task parallelism
 76 | Pipeline parallelism from 1 -> 2 and 2 -> 3.
 77 | (Pipeline parallelism will not appear on midterm)
 78 | 
 79 | Nov 7:
 80 | 1. Data parallelism: **single node**
 81 | Task parallelism: **between a pair of nodes**
 82 | Pipeline parallelism: **between a pair of nodes**
 83 | 
 84 | 2. Answer is yes. For instance, splitting one node "task" into two separate tasks could reveal additional pipeine and task parallelism that would not be present in the graph.
 85 | 
 86 | Nov 10:
 87 | T = 300 ms
 88 | S = 3 ms
 89 | Speedup <= T / S = 100x.
 90 | Maximum speedup is 100x (same as the # of data items - this is not a coincidence).
 91 | 
 92 | Nov 12:
 93 | C and E: CPU cores & RAM available
 94 | 
 95 | Nov 14:
 96 | 5, 6, and 8.
 97 | Note that, depending on assumptions about how regular the timestamps are or if the input data is sorted by row #, it may be possible to make these data-parallel also.
 98 | It is just not quite as straightforward as the others.
 99 | 
100 | Nov 17:
101 | 1: Lazy, 2: Lazy, 3: Not Lazy
102 | Bonus: 5ms + 5ms + 5ms = 15 ms.
103 | 
104 | Nov 19:
105 | Multiple solutions are possible
106 | Map stage should describe a map on each input row (T1) to an output row (T2)
107 | Reduce stage should describe how to combine two output rows (two T2s, get a single T2)
108 | Example solution:
109 | Map stage: map each row to (city name, avg_temp / population)
110 | Reduce stage: for (city1, ratio1), (city2, ratio2), return (city3, ratio3) where ratio3 = max(ratio1, ratio2) and city3 is the corresponding city.
111 | 
112 | Nov 21:
113 | Narrow: 1, 4, 5, 7
114 | Wide: 2, 3, 6
115 | 
116 | Nov 24:
117 | C and I only: "They are not optimized for latency" and "Reduce functions may be applied out-of-order"
118 | 
119 | Nov 26:
120 | B (serving GPT), D (high frequency trading), E (login), F (order qty)
121 | 
122 | Dec 3:
123 | 1. 2ms
124 | 3. Yes, the latency would be higher (specifically, around 4.5ms on average).
125 | 
126 | Dec 5:
127 | 1. Event Time
128 | 2. Logical Time
129 | 3. System Time
130 | 4. Real Time
131 | 5. Event Time
132 | 6. Logical Time + System Time
133 | 7. Logical Time + System Time
134 | 8. Real Time + Logical Time
135 | 9. Event Time
136 | 10. Event Time + Logical Time
137 | 


--------------------------------------------------------------------------------
/lecture2/parts/6-conclusion.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 6: Dangers of the Shell (and a few Loose Ends)
  3 | 
  4 | We covered this part briefly at the end of Wednesday, October 22.
  5 | 
  6 | Finishing up the shell:
  7 | 
  8 | === Dangers of the shell ===
  9 | 
 10 | The shell has something called "ambient authority"
 11 | which is a term from computer security basically meaning that
 12 | you can do anything that you want to, if you just ask.
 13 | 
 14 | Be aware!
 15 | 
 16 | - rm -f part1.py -- permanently delete your code (and changes),
 17 |   no way to recover
 18 |   rm -- remove
 19 |     -f: force removal (don't ask first)
 20 |     -r: remove all subfiles and subdirectories
 21 | 
 22 | - rm -rf "/"
 23 | 
 24 |   removes all files on the system.
 25 | 
 26 |   Many modern systems will actually complain if you try to do this.
 27 | 
 28 | - bash fork bomb :-)
 29 | 
 30 |     :(){ :|:& };:
 31 | 
 32 | """
 33 | 
 34 | def rm_rf_slash():
 35 |     raise RuntimeError("This command is very dangerous! If you are really sure you want to run it, you can comment out this exception first.")
 36 | 
 37 |     # Remove the root directory on the system
 38 |     subprocess.run(["rm", "-rf", "/"])
 39 | 
 40 | # rm_rf_slash()
 41 | 
 42 | """
 43 | sudo: run a command in true "admin" mode
 44 | 
 45 |   sudo rm -rf /
 46 |   ^^^^^^^^^^^^^ Delete the whole system, in administrator mode
 47 | """
 48 | 
 49 | """
 50 | Aside: This is part of what makes the shell so useful, but it is also
 51 | what makes the shell so dangerous!
 52 | 
 53 | All shell commands are assumed to be executed by a "trusted" user.
 54 | It's like the admin console for the computer.
 55 | 
 56 | Example:
 57 | person who gave an LLM agent access to their shell:
 58 | https://twitter.com/bshlgrs/status/1840577720465645960
 59 | 
 60 | "At this point I was amused enough to just let it continue. Unfortunately, the computer no longer boots."
 61 | """
 62 | 
 63 | # sudo rm -rf "/very/important/operating-system/file"
 64 | 
 65 | """
 66 | =============== Closing material (discusses advanced topics and recap; feel free to review on your own time!) ===============
 67 | 
 68 | === What is the Shell? (revisited) ===
 69 | 
 70 | The shell IS:
 71 | 
 72 | - the de facto standard for interacting with real systems,
 73 |   including servers, supercomputers, and even your own operating system.
 74 | 
 75 | - a way to "glue together" different programs, by chaining them together
 76 | 
 77 | The shell is NOT (necessarily):
 78 | 
 79 | - a good way to write complex programs or scripts (use Python instead!)
 80 | 
 81 | - free from errors (it is often easy to make mistakes in the shell)
 82 | 
 83 | - free from security risks (rm -rf /)
 84 | 
 85 | === Q+A ===
 86 | 
 87 | Q: How is this useful for data processing?
 88 | 
 89 | A: Many possible answers! In decreasing order of importance:
 90 | 
 91 | - Interacting with software dev tools (like git, Docker, and package managers)
 92 |   -- many tools are built to be accessed through the shell.
 93 | 
 94 | - Give us a better understanding of how programs run "under the hood"
 95 |   and how the filesystem and operating system work
 96 |   (this is where almost all input/output happens!)
 97 | 
 98 | - Gives you another option to write more powerful functions in Python
 99 |   by directly calling into the shell (subprocess)
100 |   (e.g. fetching data with git; connecting to a
101 |   database implementation or a network API)
102 | 
103 | - Writing quick-and-dirty data processing scripts direclty in the shell
104 |   (Common but we will not be doing this in this class).
105 | 
106 |   Example: Input as a CSV, filter out lines that are not relevant, and
107 |   add up the results to sort by most common keywords or labels.
108 | 
109 | Q: How is the shell similar/different from Python?
110 | 
111 | A: Both of these are useful "glue" languages -- ways to
112 |    connect together different programs.
113 | 
114 |    Python is more high-level, and the shell is more like what happens
115 |    under the hood.
116 | 
117 |    Knowing the shell can improve your Python scripts and vice versa.
118 | 
119 | === Some skipped topics ===
120 | 
121 | Things we didn't cover:
122 | 
123 | - Using the shell for cleaning, filtering, finding, and modifying files
124 | 
125 |   + cf.: grep, find, sed, awk
126 | 
127 | - Regular expressions for pattern matching in text
128 | 
129 | === Miscellaneous further resources ===
130 | 
131 | Future of the shell paper:
132 | 
133 | - https://dl.acm.org/doi/pdf/10.1145/3458336.3465296
134 | 
135 | Regular expressions
136 | (for if you are using grep or find):
137 | 
138 | - Regex debugger: https://regex101.com/
139 | 
140 | - Regex explainer: https://regexr.com/
141 | 
142 |   Example to try for a URL: [a-zA-Z]+\\.[a-z]+( |\\.|\n)
143 | 
144 | End.
145 | """
146 | 


--------------------------------------------------------------------------------
/lecture5/parts/5-dataframes.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 5: DataFrames
  3 | 
  4 | In the interest of time, we will cover this part relatively briefly.
  5 | 
  6 | === Discussion Question & Poll ===
  7 | 
  8 | This was the poll I accidentally shared last time :-)
  9 | 
 10 | https://forms.gle/TB823v4HSWqYadP88
 11 | 
 12 | Consider the following scenario where a temperature dataset is partitioned in Spark across several locations. Which of the following tasks on the input dataset can be done with a narrow operator, and which will require a wide operator?
 13 | 
 14 | Assume the input dataset consists of locations:
 15 | US state, city, population, avg temperature
 16 | 
 17 | It is partitioned into one dataset per US state (50 partitions total).
 18 | 
 19 | 1. Add one to each temperature
 20 | 
 21 | 2. Compute a 5-number summary
 22 | 
 23 | 3. Throw out duplicate city names (multiple cities in the US with the same name)
 24 | 
 25 | 4. Throw out cities that are below 100,000 residents
 26 | 
 27 | 5. Throw out "outlier" temperatures below -50 F or above 150 F
 28 | 
 29 | 6. Throw out "outlier" temperatures 3 std deviations above or 3 std deviations below the mean
 30 | 
 31 | 7. Filter the dataset to include only California cities
 32 | 
 33 | .
 34 | .
 35 | .
 36 | .
 37 | .
 38 | 
 39 | ==================
 40 | 
 41 | We said that PySpark supports at least two scalable collection types.
 42 | 
 43 | Our first example was RDDs.
 44 | 
 45 | Our second example of a collection type is DataFrame.
 46 | 
 47 | A DataFrame is like a DataFrame in Pandas - but it's scalable :-)
 48 | 
 49 | Here is an example:
 50 | """
 51 | 
 52 | # Boilerplate and dataset from previous part
 53 | import pyspark
 54 | from pyspark.sql import SparkSession
 55 | spark = SparkSession.builder.appName("SparkExample").getOrCreate()
 56 | sc = spark.sparkContext
 57 | 
 58 | CHEM_NAMES = [None, "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne"]
 59 | CHEM_DATA = {
 60 |     # H20
 61 |     "water": [0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 62 |     # N2
 63 |     "nitrogen": [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0],
 64 |     # O2
 65 |     "oxygen": [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
 66 |     # F2
 67 |     "fluorine": [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
 68 |     # CO2
 69 |     "carbon dioxide": [0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0],
 70 |     # CH4
 71 |     "methane": [0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 72 |     # C2 H6
 73 |     "ethane": [0, 6, 0, 0, 0, 0, 2, 0, 0, 0, 0],
 74 |     # C8 H F15 O2
 75 |     "PFOA": [0, 1, 0, 0, 0, 0, 8, 0, 2, 15, 0],
 76 |     # C H3 F
 77 |     "Fluoromethane": [0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0],
 78 |     # C6 F6
 79 |     "Hexafluorobenzene": [0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0],
 80 | }
 81 | 
 82 | """
 83 | DataFrame is like a Pandas DataFrame.
 84 | 
 85 | https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html
 86 | 
 87 | The main difference with RDDs is we need to create the dataframe
 88 | using a tuple or dictionary.
 89 | 
 90 | We can also create one from an RDD by doing
 91 | 
 92 |     .map(lambda x: (x,)).toDF()
 93 | 
 94 | For more examples of creating dataframes from RDDs, see extras/dataframe.py.
 95 | """
 96 | 
 97 | def ex_dataframe(data):
 98 |     # What we need (similar to Pandas): list of columns, iterable of rows.
 99 | 
100 |     # For the columns, use our CHEM_NAMES list
101 |     columns = ["chemical"] + CHEM_NAMES[1:]
102 | 
103 |     # For the rows: any iterable -- i.e. any sequence -- of rows
104 |     # For the rows: can use [] or a generator expression ()
105 |     rows = ((name, *(counts[1:])) for name, counts in CHEM_DATA.items())
106 | 
107 |     # Equiv:
108 |     # rows = [(name, *(counts[1:])) for name, counts in CHEM_DATA.items()]
109 |     # Also equiv:
110 |     # for name, counts in CHEM_DATA.items():
111 |     #     ...
112 | 
113 |     df3 = spark.createDataFrame(rows, columns)
114 | 
115 |     # Breakpoint for inspection
116 |     # breakpoint()
117 | 
118 |     # Adding a new column:
119 |     from pyspark.sql.functions import col
120 |     df4 = df3.withColumn("H + C", col("H") + col("C"))
121 |     df5 = df4.withColumn("H + F", col("H") + col("F"))
122 | 
123 |     # This is the equiv of Pandas: df3["H + C"] = df3["H"] + df3["C"]
124 | 
125 |     # Uncomment to debug:
126 |     # breakpoint()
127 | 
128 |     # We could continue this example further (showing other Pandas operation equivalents).
129 | 
130 | # Uncomment to run
131 | # ex_dataframe(CHEM_DATA)
132 | 
133 | """
134 | Notes:
135 | 
136 | - We can use .show() to print out - nicer version of .collect()!
137 |   Only available on dataframes.
138 | 
139 | - DataFrames are based on RDDs internally.
140 |   A little picture:
141 | 
142 |   DataFrames
143 |   |
144 |   RDDs
145 |   |
146 |   MapReduce
147 | 
148 | - Web interface gives us a more helpful dataflow graph this time:
149 | 
150 |   localhost:4040/
151 | 
152 |   (see under Stages and click on a "collect" job for the dataflow graph)
153 | 
154 | - DataFrames are higher level than RDDs! They are "structured" data --
155 |   we can work with them using SQL and relational abstractions.
156 | """
157 | 


--------------------------------------------------------------------------------
/lecture5/extras/dataframe.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This is an example which shows how to create a data frame from a Python dict.
  3 | """
  4 | 
  5 | # Boilerplate
  6 | import pyspark
  7 | from pyspark.sql import SparkSession
  8 | spark = SparkSession.builder.appName("SparkExample").getOrCreate()
  9 | sc = spark.sparkContext
 10 | 
 11 | # Dataset
 12 | people = spark.createDataFrame([
 13 |     {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
 14 |     {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
 15 |     {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
 16 |     {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
 17 | ])
 18 | 
 19 | people_filtered = people.filter(people.age > 30)
 20 | 
 21 | people_filtered.show()
 22 | 
 23 | people2 = sc.parallelize([
 24 |     {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
 25 |     {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
 26 |     {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
 27 |     {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
 28 | ])
 29 | 
 30 | people2_filtered = people2.filter(lambda x: x["age"] > 30)
 31 | 
 32 | result = people2_filtered.collect()
 33 | 
 34 | print(result)
 35 | 
 36 | """
 37 | More ways to create a DataFrame:
 38 | """
 39 | 
 40 | 
 41 | CHEM_NAMES = [None, "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne"]
 42 | CHEM_DATA = {
 43 |     # H20
 44 |     "water": [0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 45 |     # N2
 46 |     "nitrogen": [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0],
 47 |     # O2
 48 |     "oxygen": [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
 49 |     # F2
 50 |     "fluorine": [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
 51 |     # CO2
 52 |     "carbon dioxide": [0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0],
 53 |     # CH4
 54 |     "methane": [0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 55 |     # C2 H6
 56 |     "ethane": [0, 6, 0, 0, 0, 0, 2, 0, 0, 0, 0],
 57 |     # C8 H F15 O2
 58 |     "PFOA": [0, 1, 0, 0, 0, 0, 8, 0, 2, 15, 0],
 59 |     # C H3 F
 60 |     "Fluoromethane": [0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0],
 61 |     # C6 F6
 62 |     "Hexafluorobenzene": [0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0],
 63 | }
 64 | 
 65 | def ex_dataframe_methods(data):
 66 |     # Load the data (CHEM_DATA) and turn it into a DataFrame
 67 | 
 68 |     # A few ways to do this
 69 | 
 70 |     """
 71 |     Method 1: directly from the RDD
 72 |     """
 73 |     rdd = sc.parallelize(data.values())
 74 | 
 75 |     # RDD is just a collection of items where the items can have any Python type
 76 |     # a DataFrame requires the items to be rows.
 77 | 
 78 |     df1 = rdd.map(lambda x: (x,)).toDF()
 79 | 
 80 |     # Breakpoint for inspection
 81 |     # breakpoint()
 82 | 
 83 |     # Try: df1.show()
 84 | 
 85 |     # What happened?
 86 | 
 87 |     # Not very useful! Let's try a different way.
 88 |     # Our lambda x: (x,) map looks a bit sus. Does anyone see why?
 89 | 
 90 |     """
 91 |     Method 2: unpack the data into a row more appropriately by constructing the row
 92 |     """
 93 |     # don't need to do the same thing again -- RDDs are persistent and immutable!
 94 |     # rdd = sc.parallelize(data.values())
 95 | 
 96 |     # In Python you can unwrap an entire list as a tuple by using *x.
 97 |     df2 = rdd.map(lambda x: (*x,)).toDF()
 98 | 
 99 |     # Breakpoint for inspection
100 |     # breakpoint()
101 | 
102 |     # What happened?
103 | 
104 |     # Better!
105 | 
106 |     """
107 |     Method 3: create the DataFrame directly with column headers
108 |     (the correct way)
109 |     """
110 | 
111 |     # What we need (similar to Pandas): list of columns, iterable of rows.
112 | 
113 |     # For the columns, use our CHEM_NAMES list
114 |     columns = ["chemical"] + CHEM_NAMES[1:]
115 | 
116 |     # For the rows: any iterable -- i.e. any sequence -- of rows
117 |     # For the rows: can use [] or a generator expression ()
118 |     rows = ((name, *(counts[1:])) for name, counts in CHEM_DATA.items())
119 | 
120 |     # Equiv:
121 |     # rows = [(name, *(counts[1:])) for name, counts in CHEM_DATA.items()]
122 |     # Also equiv:
123 |     # for name, counts in CHEM_DATA.items():
124 |     #     ...
125 | 
126 |     df3 = spark.createDataFrame(rows, columns)
127 | 
128 |     # Breakpoint for inspection
129 |     # breakpoint()
130 | 
131 |     # What happened?
132 | 
133 |     # Now we don't have to worry about RDDs at all. We can use all our favorite DataFrame
134 |     # abstractions and manipulate directly using SQL operations.
135 | 
136 |     # Adding a new column:
137 |     from pyspark.sql.functions import col
138 |     df4 = df3.withColumn("H + C", col("H") + col("C"))
139 |     df5 = df4.withColumn("H + F", col("H") + col("F"))
140 | 
141 |     # This is the equiv of Pandas: df3["H + C"] = df3["H"] + df3["C"]
142 | 
143 |     # Uncomment to debug:
144 |     # breakpoint()
145 | 
146 |     # We could continue this example further (showing other Pandas operation equivalents).
147 | 
148 | # Uncomment to run
149 | # ex_dataframe(CHEM_DATA)
150 | 


--------------------------------------------------------------------------------
/lecture5/parts/6-latency-throughput.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 6: End notes
  3 | 
  4 | Latency and throughput, revisited
  5 | and
  6 | Disadvantages of Spark
  7 | 
  8 | === Latency and throughput ===
  9 | 
 10 | So, we know how to build distributed pipelines.
 11 | 
 12 | The **only** change from a sequential pipline is that
 13 | tasks work over scalable collection types, instead of regular data types.
 14 | Tasks are then interpreted as operators over the scalable collections.
 15 | In other words, data parallelism comes for free!
 16 | 
 17 | Scalable collections are a good way to think about parallel AND/OR distributed
 18 | pipelines. Operators/tasks can be:
 19 | - lazy or not lazy (how they are evaluated)
 20 | - wide or narrow (how data is partitioned)
 21 | 
 22 | But there is just one problem with what we have so far :)
 23 | Spark and MapReduce are optimized for throughput.
 24 | 
 25 | It's what we call a *batch processing engine.* That means?
 26 | 
 27 | A: It processes all the data "as one batch", or as "one job",
 28 | then comes back with an answer
 29 | 
 30 | But doesn't optimizing for throughput always optimize for latency? Not necessarily!
 31 | 
 32 | Let's talk a little bit about latency...
 33 | 
 34 | === Understanding latency (intuitive) ===
 35 | 
 36 | A more intuitive real-world example:
 37 | imagine a restaurant that has to process lots of orders.
 38 | 
 39 | - Throughput is how many orders we were able to process per hour.
 40 | 
 41 | - Latency is how long *one* person waits for their order.
 42 | 
 43 | Some of you wrote on the midterm: throughput == 1 / latency
 44 | 
 45 |     That's not true!
 46 |     throughput != 1 / latency
 47 | 
 48 | These are not the same! Why not? Two extreme cases:
 49 | 
 50 | ----
 51 | 
 52 | Suppose the restuarant processes 10 orders over the course of being open for 1 hour
 53 | 
 54 | Throughput =
 55 |     10 orders / hour
 56 | 
 57 | Latency is not the same as 1 / throughput! Two extreme cases:
 58 | 
 59 | 1.
 60 |     Every customer waits for the entire hour!
 61 |     Every customer submitted their order at the start of the hour,
 62 |     and got it back at the end.
 63 | 
 64 |     Latency = 1 hour
 65 | 
 66 |         - customers are not very happy
 67 | 
 68 |         - BUT the restaurant can do things very efficiently!
 69 | 
 70 | 2.
 71 |     One order submitted every 6 minutes,
 72 |     and completed 6 minutes later.
 73 | 
 74 |     Latency = 6 minutes
 75 | 
 76 |         - customers are happy
 77 | 
 78 |         - BUT the restaurant may have a harder time optimizing their process
 79 |           as they have to make each order individually.
 80 | 
 81 |     (A more abstract example of this is given below in the "Understanding latency (abstract)" section below.)
 82 | 
 83 | Throughput is the same in both cases!
 84 | 10 events / 1 hour
 85 | 
 86 |     Throughput = N / T
 87 | 
 88 |     In first case, latency = T
 89 | 
 90 |     In second case, latency = T / N
 91 | 
 92 | (The first scenario is similar to a parallel execution,
 93 | the second scenario more similar to a sequential execution.)
 94 | 
 95 | The other formula which is not true:
 96 | 
 97 |     Latency != total time / number of orders
 98 |     True in the second scenario but not in the first scenario.
 99 | 
100 | How can we visualize this?
101 | 
102 |     (Draw a timeline from 0 to 1 hour and draw a line for each order)
103 | 
104 | So, optimizing latency can look very different from optimizing throughput.
105 | 
106 | In a batch processing framework like Spark,
107 | it waits until we ask, and then collects *all* results at once!
108 | So we always get the worst possible latency, in fact we get the maximum latency
109 | on each individual item. We don't get some results sooner and some results later.
110 | 
111 | Grouping together items (via lazy transformations) helps optimize the pipeline, but it
112 | *doesn't* necessarily help get results as soon as possible when they're needed.
113 | (Remember: laziness poll/example)
114 | That's why there is a tradeoff between throughput and latency.
115 | 
116 | Data processing for low-latency applications is known as "streaming" or "stream processing"
117 | and systems for this case are known as "stream processing applications".
118 | 
119 |     "To achieve low latency, a system must be able to perform
120 |     message processing without having a costly storage operation in
121 |     the critical processing path...messages should be processed “in-stream” as
122 |     they fly by."
123 | 
124 |     From "The 8 Requirements of Real-Time Stream Processing":
125 |     Mike Stonebraker, Ugur Çetintemel, and Stan Zdonik
126 |     https://cs.brown.edu/~ugur/8rulesSigRec.pdf
127 | 
128 | Another term for the applications that require low latency requirements (typically, sub-second, sometimes
129 | milliseconds) is "real-time" applications or "streaming" applications.
130 | 
131 | === Summary: disadvantages of Spark ===
132 | 
133 | So that's where we're going next,
134 | talking about applications where you might want your pipeline to respond in real time to data that
135 | is coming in.
136 | We'll use a different API in Spark called Spark Structured Streaming.
137 | """
138 | 


--------------------------------------------------------------------------------
/lecture1/parts/1-introduction.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Lecture 1: Introduction to data processing pipelines
  3 | 
  4 | Part 1: Introduction
  5 | 
  6 | This lecture will cover the required background for the rest of the course.
  7 | 
  8 | Please bear with us if you have already seen some of this material before!
  9 | I will use the polls to get a sense of your prior background and adjust the pacing accordingly.
 10 | 
 11 | === Note on materials from prior iteration of the course ===
 12 | 
 13 | The GitHub repository contains the lecture notes from a prior iteration of the course (Fall 2024).
 14 | You are welcome to look ahead of the notes, but please note that most content will change as I revise each lecture.
 15 | I will generally post the revised lecture before and after each class period.
 16 | 
 17 | **Changes from last year:**
 18 | I plan to skip or condense Lecture 3 (Pandas) based on feedback and your prior background.
 19 | I will also cover some Pandas in Lecture 1.
 20 | (I will confirm this after the responses to HW0.)
 21 | 
 22 | === Poll ===
 23 | 
 24 | Today's poll is to help me understand the overall class background in command line/Git.
 25 | (I will also ask about your background in more detail on HW0.)
 26 | 
 27 | https://forms.gle/2eYFVpxT1Q8JJRaMA
 28 | 
 29 | ^^ find the link in the lecture notes on GitHub
 30 | 
 31 | Piazza -> GitHub -> lecture1 -> lecture.py
 32 | https://piazza.com/
 33 | 
 34 | === Following along with the lectures ===
 35 | 
 36 | Try this!
 37 | 
 38 | 1. You will need to have Git installed (typically installed with Xcode on Mac, or with Git for Windows). Follow the guide here:
 39 | 
 40 |     https://www.atlassian.com/git/tutorials/install-git
 41 | 
 42 |     Feel free to work on this as I am talking and to get help from your neighbors.
 43 |     I can help with any issues after class.
 44 | 
 45 |     (Note on Mac: you can probably also just `brew install git`)
 46 | 
 47 | 2. You will also need to create an account on GitHub and log in.
 48 | 
 49 | 3. Go to: https://github.com/DavisPL-Teaching/119
 50 | 
 51 | 4. If that's all set up, then click the green "Code" button, click "SSH", and click to copy the command:
 52 | 
 53 |     git@github.com:DavisPL-Teaching/119.git
 54 | 
 55 | 5. Open a terminal and type:
 56 | 
 57 |     git clone git@github.com:DavisPL-Teaching/119.git
 58 | 
 59 | 6. Type `ls`.
 60 | 
 61 |     You should see a new folder called "119" in your home folder. This contains the lecture notes and source files for the class.
 62 | 
 63 | 7. Type `cd `119/lecture1`, then type `ls`.
 64 | 
 65 | 8. Lastly type `python3 lecture.py`. You should see the message below.
 66 | """
 67 | 
 68 | print("Hello, ECS 119!")
 69 | 
 70 | """
 71 | Let's see if that worked!
 72 | 
 73 | If some step above didn't work, you may be missing some of the software we
 74 | need installed. Please complete HW0 first and then let us know if you
 75 | are still having issues.
 76 | 
 77 | === The basics ===
 78 | 
 79 | I will introduce the class through a basic model of what a data processing
 80 | pipeline is, that we will use throughout the class.
 81 | 
 82 | We will also see:
 83 | - Constraints that data processing pipelines have to satisfy
 84 | - How they interact with one another
 85 | - Sneak peak of some future topics covered in the class.
 86 | 
 87 | To answer these questions, we need a basic model of "data processing pipeline" - Dataflow Graphs.
 88 | 
 89 | Recall discussion question from last lecture:
 90 | 
 91 | EXAMPLE:
 92 | You have compiled a spreadsheet of website traffic data for various popular websites (Google, Instagram, chatGPT, Reddit, Wikipedia, etc.). You have a dataset of user sessions, each together with time spent, login sessions, and click-through rates. You want to put together an app which identifies trends in website popularity, duration of user visits, and popular website categories over time.
 93 | 
 94 | What are the main "abstract" components of the pipeline in this scenario?
 95 | 
 96 | - A dataset
 97 | - Processing steps
 98 | - Some kind of user-facing output
 99 | 
100 | closely related:
101 | "Extract, Transform, Load" model (ETL)
102 | 
103 | What is an ETL job?
104 | 
105 | - **Extract:** Load in some data from an input source
106 |     (e.g., CSV file, spreadsheet, a database)
107 | 
108 | - **Transform:** Do some processing on the data
109 | 
110 | - **Load:** (perhaps a confusing name)
111 |   we save the output to an output source.
112 |     (e.g. CSV file, spreadsheet, a database)
113 | 
114 | """
115 | 
116 | data = {
117 |     "User": ["Alice", "Alice", "Charlie"],
118 |     "Website": ["Google", "Reddit", "Wikipedia"],
119 |     "Time spent (seconds)": [120, 300, 240],
120 | }
121 | 
122 | # As dataframe:
123 | import pandas as pd
124 | df = pd.DataFrame(data)
125 | 
126 | # print(data)
127 | # print(df)
128 | 
129 | """
130 | Recap:
131 | 
132 | - We spent some time getting everyone up to speed:
133 |     After completing HW0, you should be able to follow along with the lectures
134 |     locally on your laptop device
135 | 
136 | - We started to introduce the abstract model that we will use throughout the class
137 |   for data processing pipelines - this will be called the Dataflow Graph model
138 | 
139 | - We began by introducing a simpler concept called Extract, Transform, Load (ETL).
140 | 
141 | ***** Where we ended for Friday *****
142 | """
143 | 


--------------------------------------------------------------------------------
/lecture4/README.md:
--------------------------------------------------------------------------------
  1 | # Lecture 4: Parallelism
  2 | 
  3 | ## Friday, October 24
  4 | 
  5 | Changing gears today!
  6 | 
  7 | Announcements:
  8 | 
  9 | - A few details about the midterm (scheduled for Wed, Nov 5):
 10 | 
 11 |     Format:
 12 | 
 13 |     + Cheat sheet, single sided
 14 | 
 15 |     + Roughly: 10 questions, 8 MC/short answer, 2 free response
 16 | 
 17 |         The free response questions are longer! Each ~1-2 pages
 18 | 
 19 |     + Time length is limited to class time
 20 | 
 21 |         Please try to arrive early!
 22 | 
 23 |     + A sample exam (from last year) will be released sometime next week
 24 | 
 25 |     Studying:
 26 | 
 27 |     + Review the polls! `exams/poll_answers.md`
 28 | 
 29 |     + Study guide: `exams/midterm_study_list.md`
 30 | 
 31 |         Contains up to Lecture 1, 2, so far
 32 | 
 33 |         More topics will be added after next week!
 34 | 
 35 |     + Please ask questions at the OHs and discussion section and on Piazza!
 36 | 
 37 | - I have to leave class right at 4pm today - I will end class 5 minutes early
 38 |   (3:55) in case there are questions after class.
 39 | 
 40 | Plan for today:
 41 | 
 42 | - Study guide
 43 | 
 44 | - Poll
 45 | 
 46 | - Motivation: scaling your application
 47 | 
 48 | - Key definitions: parallel, concurrent, and distributed computing
 49 | 
 50 | Questions?
 51 | 
 52 | ## Monday, October 27
 53 | 
 54 | Reminders:
 55 | 
 56 | - OH after class today, Friday 11am
 57 | 
 58 | Plan for today:
 59 | 
 60 | I am trying a slightly new format for the lecture notes by dividing them into
 61 | parts, roughly one per class period.
 62 | For today, see `parts/1-motivation.py` and `parts/2-definitions.py`
 63 | 
 64 | Plan:
 65 | 
 66 | - Finish small activity on RAM and Pandas + poll
 67 | 
 68 | - Example pipeline and running it with parallel workers
 69 | 
 70 | - Key definitions: parallel/concurrent/distributed distinction
 71 | 
 72 | Questions?
 73 | 
 74 | ## Wednesday, October 29
 75 | 
 76 | Announcements:
 77 | 
 78 | - First part of practice midterm released; second part released late this week
 79 | 
 80 | - HW1 grading underway; please stay tuned for a Piazza post when official grades are released.
 81 | 
 82 | - I have sorted the rest of the lecture material into parts
 83 | 
 84 | Plan:
 85 | 
 86 | - Finish parallel/concurrent/distributed distinction (`2-parallelism.py`)
 87 | 
 88 | - Poll
 89 | 
 90 | - Concurrency.
 91 | 
 92 | Questions?
 93 | 
 94 | ## Friday, October 31 (Happy Halloween!)
 95 | 
 96 | Announcements:
 97 | 
 98 | - HW1 is graded -- please submit any grade complaints through Gradescope
 99 | 
100 |     ** Please reserve grade complaints for cases where you disagree with the grade according to the
101 |        rubric, NOT where you disagree with the rubric!
102 | 
103 |     Grade complaints due Nov 5 3pm (before midterm)
104 | 
105 | - Mid-quarter survey is available! (See Piazza)
106 | 
107 |     1 pt extra credit
108 | 
109 |     Due Nov 5 3pm (before midterm)
110 | 
111 | Plan:
112 | 
113 | - Wrap up concurrency with a little bit of terminology
114 | 
115 | - Poll
116 | 
117 | - Part 4: Types of parallelism.
118 | 
119 | - End of class: midterm topics; will post full list + practice midterm on Piazza.
120 | 
121 | Questions?
122 | 
123 | ## Monday, November 3
124 | 
125 | Announcements:
126 | 
127 | - The midterm is this Wednesday in class!
128 | 
129 |   Reminders: In class, single-sided cheat sheet (typed or handwritten)
130 |   Topics: see Piazza, midterm_study_list (Lecture 4, up to data parallelism in part 4)
131 | 
132 |   Help with midterm:
133 | 
134 |   + **Today** OH after class: I will go over any requested topic, any in-class poll, or any practice exam question
135 | 
136 |   + **Wednesday** discussion section will be a midterm review
137 | 
138 |   + **Piazza** for all other questions
139 | 
140 | - Last chance to submit mid-quarter survey! (Due Wednesday before class)
141 | 
142 | - Next week, Nov 10 and Nov 12 class will be remote on Zoom
143 | 
144 | Plan:
145 | 
146 | - Finish Part 4: Pipeline parallelism
147 |   (Note that pipeline parallelism will not appear on the midterm,
148 |   but data+task parallelism will.)
149 | 
150 | - Discussion question & poll
151 | 
152 | - (~about 3:45)
153 |   I'll reserve the last 15 minutes of class in case there are
154 |   any questions from the midterm study list, polls, or practice exam
155 |   that anyone wants to go over.
156 |   (Else, we can begin part 5.)
157 | 
158 | Questions?
159 | 
160 | ## Friday, November 7
161 | 
162 | Announcements:
163 | 
164 | - Next week, Monday (Nov 10) and Wed (Nov 12) class will be remote on Zoom
165 | 
166 |   OH also remote on Zoom
167 | 
168 | - Midterm grading is underway
169 | 
170 | Plan:
171 | 
172 | Midterm review:
173 | 
174 | - Poll with a review question
175 | 
176 | Continuing lecture 4:
177 | 
178 | - Part 5: Quantifying parallelism and Amdahl's Law
179 | 
180 | - If time: Part 6 on distribution and conclusions.
181 | 
182 | Questions?
183 | 
184 | ## Monday, November 10
185 | 
186 | Announcements:
187 | 
188 | - Midterm grades released!
189 | 
190 |   Statistics & midterm answers on Piazza: https://piazza.com/class/mfvn4ov0kuc731/post/112
191 | 
192 |   Regrade requests due one week from today
193 | 
194 |   Suggested questions to review: q3, q7, q8, q9, q10
195 | 
196 | - OH today after class will be on Zoom. (I will share link on Piazza)
197 | 
198 | - We have time to go over one midterm question; if you would like to go over any other midterm questions or answers, please post a note on Piazza or come to office hours!
199 | 
200 | Plan:
201 | 
202 | - Finish Part 5: Quantifying parallelism and Amdahl's Law
203 | 
204 | - Poll about Amdahl's law
205 | 
206 | - Part 6 on distribution and conclusions.
207 | 
208 | Plan to begin Lecture 5 on Wednesday.
209 | 


--------------------------------------------------------------------------------
/lecture1/parts/6-conclusion.py:
--------------------------------------------------------------------------------
  1 | """
  2 | October 10
  3 | 
  4 | Part 6:
  5 | Recap on Throughput & Latency and Conclusion
  6 | 
  7 | Recap from last time:
  8 | 
  9 | Throughput:
 10 |     Measured in number items processed / second
 11 | 
 12 |     N = number of input items (size of input dataset(s))
 13 |     T = running time of your full pipeline
 14 |     Formula =
 15 |         N / T
 16 | 
 17 | Latency:
 18 |     Measured for a specific input item and specific output
 19 | 
 20 |     Formula =
 21 |         (time output is produced) - (time input is received)
 22 | 
 23 |     Often (but not always) measured for a pipeline with just
 24 |     one input item.
 25 | 
 26 | Discussion question:
 27 | 
 28 | A health company's servers process 12,000 medical records per day.
 29 | The medical records come in at a uniform rate between 9am and 9pm every day (1,000 records per hour).
 30 | The company's servers submit the records to a back-end service that collects them throughout the hour, and then
 31 | processes them at the end of each hour to update a central database.
 32 | 
 33 | What is the throughput of the pipeline?
 34 | 
 35 | What number would best describe the *average latency* of the pipeline?
 36 | Describe the justification for your answer.
 37 | 
 38 | https://forms.gle/AFL2SrBr5MhwVV3h7
 39 | """
 40 | 
 41 | """
 42 | ...
 43 | 
 44 | Let's see an example
 45 | 
 46 | We need a pipeline so that we can measure the total running time & the throughput.
 47 | 
 48 | I've taken the pipeline from earlier for country data and rewritten it below.
 49 | 
 50 | see throughput_latency.py
 51 | """
 52 | 
 53 | def get_life_expectancy_data(filename):
 54 |     return pd.read_csv(filename)
 55 | 
 56 | # Wrap up our pipeline - as a single function!
 57 | # You will do a similar thing on the HW to measure performance.
 58 | def pipeline(input_file, output_file):
 59 |     df = get_life_expectancy_data(input_file)
 60 |     min_year = df["Year"].min()
 61 |     max_year = df["Year"].max()
 62 |     # (Commented out the print statements)
 63 |     # print("Minimum year: ", min_year)
 64 |     # print("Maximum year: ", max_year)
 65 |     avg = df["Period life expectancy at birth - Sex: all - Age: 0"].mean()
 66 |     # print("Average life expectancy: ", avg)
 67 |     # Save the output
 68 |     out = pd.DataFrame({"Min year": [min_year], "Max year": [max_year], "Average life expectancy": [avg]})
 69 |     out.to_csv(output_file, index=False)
 70 | 
 71 | # SEE throughput_latency.py.
 72 | 
 73 | # import timeit
 74 | 
 75 | def f():
 76 |     pipeline("life-expectancy.csv", "output.csv")
 77 | 
 78 | # Run the pipeline
 79 | # f()
 80 | 
 81 | """
 82 | === Latency (additional notes - SKIP) ===
 83 | 
 84 |     What is latency?
 85 | 
 86 |     Sometimes, we care about not just the time it takes to run the pipeline...
 87 |     but the time on each specific input item.
 88 | 
 89 |     Why?
 90 |     - Imagine crawling the web at Google.
 91 |       The overall time to crawl the entire web is...
 92 |       It might take a long time to update ALL websites.
 93 |       But I might wonder,
 94 |       what is the time it takes from when I update my website
 95 |           ucdavis-ecs119.com
 96 |       to when this gets factored into Google's search results.
 97 | 
 98 |       This "individual level" measure of time is called latency.
 99 | 
100 |     *Tricky point*
101 | 
102 |     For the pipelines we have been writing, the latency is the same as the running time of the entire pipeline!
103 | 
104 |     Why?
105 | 
106 | Let's measure the performance of our toy pipeline.
107 | """
108 | 
109 | """
110 | === Memory usage (also skip :) ) ===
111 | 
112 | What about the equivalent of memory usage?
113 | 
114 | I will not discuss this in detail at this point, but will offer a few important ideas:
115 | 
116 | - Input size:
117 | 
118 | - Output size:
119 | 
120 | - Window size:
121 | 
122 | - Distributed notions: Number of machines, number of addresses on each machine ...
123 | 
124 | Which of the above is most useful?
125 | 
126 | How does memory relate to running time?
127 | For traditional programs?
128 | For data processing programs?
129 | """
130 | 
131 | """
132 | === Overview of the rest of the course ===
133 | 
134 | Overview of the schedule (tentative), posted at:
135 | https://github.com/DavisPL-Teaching/119/blob/main/schedule.md
136 | 
137 | === Closing quotes ===
138 | 
139 | Fundamental theorem of computer science:
140 | 
141 |     "Every problem in computer science can be solved by another layer of abstraction."
142 | 
143 |     - Based on a statement attributed to Butler Lampson
144 |     https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering
145 | 
146 | A dataflow graph is an abstraction (why?), but it is a very useful one.
147 | It will help put all problems about data processing into context and help us understand how
148 | to develop, understand, profile, and maintain data processing jobs.
149 | 
150 | It's a good human-level way to understand pipelines, and
151 | it will provide a common framework for the rest of the course.
152 | """
153 | 
154 | # Main function: the default thing that you run when running a program.
155 | 
156 | # print("Hello from outside of main function")
157 | 
158 | if __name__ == "__main__":
159 |     # Insert code here that we want to be run by default when the
160 |     # program is executed.
161 | 
162 |     # print("Hello from inside of main function")
163 | 
164 |     # What we can do: add additional code here
165 |     # to test various functions.
166 |     # Simple & convenient way to test out your code.
167 | 
168 |     # Call our pipeline
169 |     # pipeline("life-expectancy.csv", "output.csv")
170 | 
171 |     pass
172 | 
173 | # NB: If importing lecture.py as
174 | # a library, the main function (above) doesn't get run.
175 | # If running it directly from the terminal,
176 | # the main function does get run.
177 | # See test file: main_test.py
178 | 


--------------------------------------------------------------------------------
/exams/final_study_list.md:
--------------------------------------------------------------------------------
  1 | # Final Study List
  2 | 
  3 | Study list of topics for the final.
  4 | 
  5 | **The final will cover Lectures 1-6.**
  6 | 
  7 | ## Lectures 1-4
  8 | 
  9 | For Lectures 1-4, please refer to the midterm study list `midterm_study_list.md`.
 10 | 
 11 | ## Review topics from the midterm
 12 | 
 13 | Suggested review topics based on the midterm:
 14 | 
 15 | - How to draw and interpret a dataflow graph
 16 | 
 17 |     + I'm looking for a conceptual understanding of what happens when
 18 |       you "run" the pipeline, what tasks need to be completed in what order.
 19 | 
 20 | - Understanding throughput and latency conceptually given a dataflow graph
 21 | 
 22 |     + Estimating running time in the sequential case, parallel cases; applying formulas
 23 | 
 24 | - Concurrency problems: two concurrent executions like (x += 1; x += 1) vs. x += 1
 25 | 
 26 | - Data validation: put the concepts in context:
 27 |     If asked about what you would do on a dataset or a specific
 28 |     real-world example, we're really looking for things specific to that
 29 |     dataset or real-world example.
 30 | 
 31 | - Not covered on the midterm, but will be covered on the final:
 32 |     Amdahl's law (throughput <= T / S) and maximum speedup case (S)
 33 | 
 34 | ## Lecture 5
 35 | 
 36 | - Scalable collection types
 37 | 
 38 |     + Differences from normal Python collections
 39 |     + Types of scaling - vertical & horizontal scaling
 40 |     + Benefits/drawbacks
 41 |     + Examples (RDDs, PySpark DataFrames) and their properties
 42 | 
 43 | - Operators
 44 | 
 45 |     + Map
 46 |     + Filter
 47 |     + Reduce
 48 | 
 49 | - Operator concepts
 50 | 
 51 |     + Immutability
 52 |     + Evaluation: Lazy vs. not-lazy (transormation vs. action)
 53 |         - why laziness matters / why it is useful
 54 |     + Partitoning: Wide vs. narrow
 55 |         - What operators should be wide vs narrow
 56 |     + How partitioning works, what it means, how it affects performance
 57 |     + Key-based partitioning (see MapReduce, HW2)
 58 | 
 59 | - MapReduce
 60 | 
 61 |     + For the purposes of the final, we will use either
 62 |         the simple version of MapReduce from class,
 63 |         or the generalized one from HW2
 64 |         (I will remind you of the type of map/reduce for the exam)
 65 | 
 66 |     + simplified model (map and reduce, conceptually)
 67 |     + general model (that we saw on HW2) assuming I give you
 68 |       the actual types for map and reduce for reference
 69 |     + you may be asked to describe how to do a computation as a MapReduce
 70 |       pipeline - describe the precise function
 71 | 
 72 |         for map: function that takes 1 input row, produces 1 output row
 73 |         for reduce: function that takes 2 output rows, returns 1 output row
 74 | 
 75 | - Implementation details: In general, you do not need to know implementation details of Spark, but you should know:
 76 |     + Number of partitions and how it affects performance
 77 |         * too low, too high
 78 |     + Running on a local cluster, running on a distributed computing cluster
 79 |     + Fault tolerance: you may assume that Spark tolerates node failures (RDDs can recover from a computer or worker crash)
 80 | 
 81 | - Drawing a PySpark or MapReduce computation as a dataflow graph
 82 | 
 83 | - Limitations of Spark
 84 | 
 85 | ## Lecture 6
 86 | 
 87 | - Understanding latency
 88 | 
 89 |     + Intuitive: for example, given 10 orders in a 1 s time interval are
 90 |       processed, what can you say about the latency of each order
 91 | 
 92 |     + Refined def of latency:
 93 |         latency of item X = (end or exit time X) - (start or arrival time X)
 94 | 
 95 | - List of summary points:
 96 |     + Latency = Response Time
 97 |     + Latency can only be measured by focusing on a single item or row. (response time on that row)
 98 |     + Latency-critical, real-time, or streaming applications are those for which we are looking for low latency (typically, sub-second or even millisecond response times).
 99 |     + Latency is NOT the same as 1 / Throughput
100 |         * If it were, we wouldn't need two different words!
101 |     + Latency is NOT the same as processing time
102 |         * It's processing time for a specific event
103 |     + If throughput is about quantity (how many orders processed), latency is about quality (how fast individual orders processed).
104 | 
105 | - Batch vs. streaming pipelines
106 | 
107 |     + When streaming is useful (application scenarios)
108 | 
109 |     + Latency in both cases
110 | 
111 |     + How to derive latency given the dataflow graph
112 | 
113 |     + Batch/stream analogy
114 | 
115 | - Implementation details of streaming pipelines:
116 | 
117 |     + Microbatching and possible microbatching strategies
118 | 
119 |     + Spark timestamp (assigned to all members of a microbatch)
120 | 
121 | - Time
122 | 
123 |     + Why it matters: measuring latency, measuring progress in the system, assigning microbatches
124 | 
125 |     + Reasons that time is complicated (time zones, clock resets)
126 | 
127 |     + Kinds of time: Real time, event time, system time, logical time
128 | 
129 |     + Monotonic time
130 |         * which of the above or monotonic
131 | 
132 |     + Measuring time: entrance time, processing time, exit time (These are all versions of system time.)
133 | 
134 | ## Lecture 7
135 | 
136 | Will not be covered on the final.
137 | 
138 | TBD: the lecture is very brief and the last 1-2 days of class.
139 | 
140 | Example multiple choice question:
141 | 
142 |   Match each of the following cloud provider services to its use case.
143 | 
144 |   Major AWS cloud services: S3, EC2, Lambda.
145 | 
146 |   S3: useful for data storage
147 | 
148 |   EC2: useful for purchasing compute (basically, cloud computers that you can log into and run via the terminal)
149 | 
150 |   Lambda: useful for triggering events and running asynchronous code.
151 | 
152 | ## Notes
153 | 
154 | Some things you do **not** need to know:
155 | Python, Pandas, and PySpark syntax.
156 | Implementation details of PySpark and Spark Streaming, except where mentioned above.
157 | Lecture 7.
158 | 


--------------------------------------------------------------------------------
/lecture1/parts/2-etl.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Monday, September 29
  3 | 
  4 | Part 2: Extract, Transform, Load (ETL)
  5 | 
  6 | === REMINDER: FOLLOWING ALONG ===
  7 | 
  8 | https://github.com/DavisPL-Teaching/119
  9 | 
 10 | - Open terminal (Cmd+Space Terminal on Mac)
 11 | 
 12 | - `git clone <paste repository link>`
 13 | 
 14 |     + if you have already cloned, do a `git stash` or `git reset .`
 15 | 
 16 | - `git pull`
 17 | 
 18 | - **Why use the command line?**
 19 | 
 20 |   Short answer: it's an important skill!
 21 | 
 22 |   Long answer:
 23 |   I do require learning how to use the command line for this course.
 24 |   More in Lecture 2.
 25 |   GUI tools only work if someone else already wrote them (they used the command line to write the tool)
 26 |   You'll find that it is SUPER helpful to know the basics of the command line for stuff like installing software, managing dependencies, and debugging why installation didn't work.
 27 |   The command prompt is how all internal commands work on your computer - and it's an important skill for data engineering in practice.
 28 | 
 29 | === Continuing our example ===
 30 | 
 31 | Recall from last time:
 32 | 
 33 | - Want: a general model of data processing pipelines
 34 | 
 35 | - First-cut model: Extract Transform Load (ETL)
 36 | 
 37 |     Any data process job can be split into three stages,
 38 |     input, processing, output
 39 |     (extract, transform, load)
 40 | 
 41 | Example on finding popular websites:
 42 | """
 43 | 
 44 | # (Re-copying from above)
 45 | data = {
 46 |     "User": ["Alice", "Alice", "Charlie"],
 47 |     "Website": ["Google", "Reddit", "Wikipedia"],
 48 |     "Time spent (seconds)": [120, 300, 240],
 49 | }
 50 | df = pd.DataFrame(data)
 51 | 
 52 | # Some logic to compute the maximum length of time website sessions
 53 | u = df["User"]
 54 | w = df["Website"]
 55 | t = df["Time spent (seconds)"]
 56 | # Max of t
 57 | max = t.max()
 58 | # Filter
 59 | max_websites = df[df["Time spent (seconds)"] == max]
 60 | 
 61 | # Let's print our data and save it to a file
 62 | with open("save.txt", "w") as f:
 63 |     print(max_websites, file=f)
 64 | 
 65 | """
 66 | Running the code
 67 | 
 68 | It can be useful to have open a Python shell while developing Python code.
 69 | 
 70 | There are two ways to run Python code from the command line:
 71 | - python3 lecture.py
 72 | - python3 -i lecture.py
 73 | 
 74 | Let's try both.
 75 | """
 76 | 
 77 | """
 78 | First step: can we abstract this as an ETL job?
 79 | """
 80 | 
 81 | def extract():
 82 |     data = {
 83 |         "User": ["Alice", "Alice", "Charlie"],
 84 |         "Website": ["Google", "Reddit", "Wikipedia"],
 85 |         "Time spent (seconds)": [120, 300, 240],
 86 |     }
 87 |     df = pd.DataFrame(data)
 88 |     return df
 89 | 
 90 | def transform(df):
 91 |     u = df["User"]
 92 |     w = df["Website"]
 93 |     t = df["Time spent (seconds)"]
 94 |     # Max of t
 95 |     max = t.max()
 96 |     # Filter
 97 |     # This syntax in Pandas for filtering rows
 98 |     # df[colname]
 99 |     # df[row filter] (row filter is some sort of Boolean condition)
100 |     return df[df["Time spent (seconds)"] == max]
101 | 
102 | def load(df):
103 |     # Save the dataframe somewhere
104 |     with open("save.txt", "w") as f:
105 |         print(df, file=f)
106 | 
107 | # Uncomment to run
108 | df = extract() # get the input
109 | df = transform(df) # process the input
110 | # print(df) # printing (optional)
111 | load(df) # save the new data.
112 | 
113 | """
114 | We have a working pipeline!
115 | But this may seem rather silly ... why rewrite the pipeline
116 | to achieve the same behavior?
117 | 
118 | === Tangent: advantages of abstraction ===
119 | 
120 | Q: why abstract the steps into Python functions?
121 | 
122 | (instead of just using a plain script)
123 | 
124 | ETL steps are not done just once!
125 | 
126 | A possible development lifecycle:
127 | 
128 | - Exploration time:
129 |   Thinking about my data, thinking about what I might
130 |   want to build, exploring insights
131 |   -> there is no pipeline yet, we're just exploring
132 | 
133 | - Development time:
134 |   Building or developing a working pipeline
135 |   -> a script or abstracted functions would both work!
136 | 
137 | - Production time:
138 |   Deploying my pipeline & reusing it for various purposes
139 |   (e.g., I want to run it like 5x per day)
140 |   -> pipeline needs to be reused multiple times
141 |   -> we could even think about more stages, like
142 |      maintaining the pipeline as separate items after production time.
143 | 
144 | In general, for this class we will think most about production time,
145 | because we are ultimately interested in being able to fully automate and
146 | maintain pipelines (not just one-off scripts).
147 | 
148 | Some of you may have used tools like Jupyter notebooks;
149 | (very good for exploration time!)
150 | while excellent tools,
151 | I will generally be working directly in Python in this course.
152 | 
153 | Reasons: I want to get used to thinking of processing directly "as code",
154 | good abstractions via functions and classes, and follow good practices like
155 | unit tests, etc. to integrate the code into a larger project.
156 | 
157 | Abstractions mean we can test the code:
158 | """
159 | 
160 | import pytest
161 | 
162 | # Unit test example
163 | # @pytest.mark.skip # uncomment to skip this test
164 | def test_extract():
165 |     df = extract()
166 |     # What do we want to test here?
167 |     # Test that the result has the data type we expect
168 |     assert type(df) is not None
169 |     assert type(df) == pd.DataFrame
170 |     # check the dimensions (I'll skip this)
171 |     # Sanity check - check that the values are the correct type!
172 | 
173 | # @pytest.mark.skip # uncomment to skip this test
174 | def test_transform():
175 |     df = extract()
176 |     df = transform(df)
177 |     # check that there is exactly one output
178 |     assert df.count().values[0] == 1
179 | 
180 | # Run:
181 | # - pytest lecture.py
182 | 
183 | """
184 | Discussion Question / Poll:
185 | 
186 | 1. Can you think of any scenario where test_extract() will fail?
187 | 
188 | 2. Will test_transform() always pass, no matter the input data set?
189 | 
190 | https://forms.gle/j99n5ZN7jsJ6gHB2A
191 | 
192 | ********** Where we ended for September 29 **********
193 | """
194 | 


--------------------------------------------------------------------------------
/lecture4/parts/6-distribution.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 6: Distribution and concluding thoughts
  3 | 
  4 | === Distributed computing: some examples ===
  5 | 
  6 | === What is distribution? ===
  7 | 
  8 | Distribution means that we have multiple workers and belts
  9 | **in different physical warehouses**
 10 | can process and fail independently.
 11 | 
 12 | The workers must be on different physical computers or physical devices.
 13 |     (Why does it matter?)
 14 | 
 15 |     Running on the same physical device, both workers have
 16 |     access to the same resources;
 17 |     Running on two different devices, they access different resources, and one worker could crash even if the other
 18 |     one doesn't.
 19 |     So, it introduces new challenges.
 20 | 
 21 | For this one, it's more difficult to simulate in Python directly.
 22 | We can imagine that our workers are computed by an external
 23 | server, rather than being computed locally on our machine.
 24 | 
 25 | To give a simple instance of this, let's use ssh to connect to a remote
 26 | server, then use the server to compute the sum of the numbers.
 27 | 
 28 | (You won't be able to use this code; it's connecting to my own SSH server!)
 29 | """
 30 | 
 31 | # for os.popen to run a shell command (like we did on HW1 part 3)
 32 | import os
 33 | 
 34 | def ssh_run_command(cmd):
 35 |     result = os.popen("ssh cdstanfo@set.cs.ucdavis.edu " + cmd).read()
 36 |     # Just print the result for now
 37 |     print(f"result: {result.strip()}")
 38 | 
 39 | def worker1_distributed():
 40 |     ssh_run_command("seq 1 1000000 | awk '{s+=$1} END {print s}'")
 41 |     print("Worker 1 finished")
 42 | 
 43 | def worker2_distributed():
 44 |     ssh_run_command("seq 1000001 2000000 | awk '{s+=$1} END {print s}'")
 45 |     print("Worker 2 finished")
 46 | 
 47 | def average_numbers_distributed():
 48 |     worker1_distributed()
 49 |     worker2_distributed()
 50 |     print("Distributed computation complete")
 51 | 
 52 | # Uncomment to run
 53 | # This won't work on your machine!
 54 | average_numbers_distributed()
 55 | 
 56 | # This waits until the first connection finishes before
 57 | # starting the next connection; but we could easily modify
 58 | # the code to make them both run in parallel.
 59 | 
 60 | """
 61 | Questions:
 62 | 
 63 | Q1: can we have distribution without parallelism?
 64 | 
 65 |     A: Yes, we just did
 66 | 
 67 | Q2: can we have distribution with parallelism?
 68 | 
 69 |     A: Yes, we could allow the server to run and compute
 70 |     an answer while we continue to compute other stuff,
 71 |     or while we run a separate connection to a second
 72 |     server.
 73 | 
 74 | Q3: can we have distribution without concurrency?
 75 | 
 76 |     A: Yes, for example: we have two databases or database
 77 |     partitions running separately (and they don't interact)
 78 | 
 79 | Q4: can we have distribution with concurrency?
 80 | 
 81 |     Yes, we often do, for example when distributed workers
 82 |     communicate via passing messages to each other
 83 | """
 84 | 
 85 | """
 86 | === Parallelizing our code in Pandas? (Skip) ===
 87 | 
 88 | We don't want to parallelize our code by hand.
 89 | (why? See problems with concurrency from last week!)
 90 | 
 91 | Dask is a simple library that works quite well for parallelizing datasets
 92 | on a single machine as a drop-in replacement for Pandas.
 93 | """
 94 | 
 95 | # conda install dask or pip3 install dask
 96 | import dask
 97 | 
 98 | def dask_example():
 99 |     # Example dataset
100 |     df = dask.datasets.timeseries()
101 | 
102 |     # Dask is "lazy" -- it only generates data when you ask it to.
103 |     # (More on laziness later).
104 |     print(type(df))
105 |     print(df.head(5))
106 | 
107 |     # Use a standard Pandas filter access
108 |     df2 = df[df.y > 0]
109 |     print(type(df2))
110 |     print(df2.head(5))
111 | 
112 |     # Do a group by operation
113 |     df3 = df2.groupby("name").x.mean()
114 |     print(type(df3))
115 |     print(df3.head(5))
116 | 
117 |     # Compute results -- this processes the whole dataframe
118 |     print(df3.compute())
119 | 
120 | # If you just want parallelism on a single machine,
121 | # Dask is a great lightweight solution.
122 | 
123 | # Uncomment to run.
124 | # dask_example()
125 | 
126 | """
127 | === A final definition and end note: Vertical vs. horizontal scaling ===
128 | 
129 | - Vertical: scale "up" resources at a single machine (hardware, parallelism)
130 | - Horizontal: scale "out" resources over multiple machines (distribution)
131 | 
132 | This lecture, we have only seen *vertical scaling*.
133 | But vertical scaling has a limit!
134 | Remember that we are still limited in the size of the dataset we can
135 | process on a single machine
136 | (recall Wes McKinney estimate of how large a table can be).
137 | Even without Pandas overheads,
138 | we still can't process data if we run out of memory!
139 | 
140 | So, to really scale we may need to distribute our dataset over many
141 | machines -- which we do using a distributed data processing framework
142 | like Spark.
143 | This also gives us a convenient way to think about data pipelines
144 | in general, and visualize them.
145 | We will tour PySpark in the next lecture.
146 | 
147 | Everything we have said about identifying and quantifying parallelism also applies to
148 | distributing the code (for the most part -- we will only see exceptions to this if we cover
149 | distributed consistency issues and faults and crashes, this is an optional topic that we will
150 | get to only if we have time.)
151 | 
152 | In addition to scaling even further, distribution + parallelism can offer an even
153 | more seamless performance compared to parallelism alone as it can eliminate
154 | many coordination overheads and contention between workers
155 | (see partitioning: different partitions of the database are operated entirely independently by different machines).
156 | 
157 | Recap:
158 | 
159 | - Finished Amdahl's law, did an example, and a practice example with the poll
160 | 
161 | - Connected Amdahl's law back to latency & throughput (with two formulas)
162 | 
163 | - We talked about distribution; ran our same running example as a distributed pipeline over ssh
164 | 
165 | - We talked about vertical vs horizontal scaling
166 | 
167 | - We contrasted parallelism with distributed scaling - where we will be going next in Lecture 5.
168 | """
169 | 


--------------------------------------------------------------------------------
/lecture1/parts/4-properties.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Monday, October 6
  3 | 
  4 | Part 4: Proeprties of Dataflow Graphs
  5 | 
  6 | On Friday, I introduced the concept of dataflow graphs.
  7 | Recall:
  8 |     To build a dataflow graph, we divide our pipeline into a series of "stages"
  9 | To build the graph, we draw:
 10 |     - One node per stage of the pipeline
 11 |     - An edge from node A to B (A -> B) if node B directly uses the output of node A.
 12 | 
 13 | === Practice with dataflow graphs ===
 14 | 
 15 | At the end of last class period, we introduced a dataset for life expectancy.
 16 | We saw a simple data pipeline for this dataset.
 17 | Let's separate it into stages as follows:
 18 | 
 19 | (read) = load the CSV input
 20 | (max) = compute the max
 21 | (min) = compute the min
 22 | (avg) = compute the avg
 23 | (print) = Print the max, min, and avg
 24 | (save) = Save the max, min, and avg in a dataframe to a file
 25 | 
 26 | === Discussion Question and Poll ===
 27 | 
 28 | Suppose we draw a dataflow graph with the above nodes.
 29 | 
 30 | 1. What edges will the graph have?
 31 |   (draw/write all edges)
 32 | 
 33 | 2. Give an example of two stages A and B, where the output for B depends on A, but there is no edge from A to B.
 34 | 
 35 | https://forms.gle/6FB5hhwKpokTHhit9
 36 | 
 37 | Answer:
 38 | 
 39 |            -> (max) ----|--> (print)
 40 |     (read) -> (min) ----|
 41 |            -> (avg) ----|--> (save)
 42 | 
 43 | Key points:
 44 | 
 45 |     Two "independent" computations will not have an edge one way or the other
 46 |     (printing produces output to the terminal, save produces output to a file,
 47 |      neither one is used by the other)
 48 | 
 49 |     We can read off dependence information from the graph! If there is a path
 50 |     from A to B, then B depends (either directly or indirectly) on A.
 51 | 
 52 |     What graph we get depends on the precise details of our stages.
 53 |     Ex.: if we load the input three different times, once for the max, once for the min,
 54 |     once for the avg (and this is listed in our description of the computation),
 55 |     we would get a different graph with 8 nodes instead of 6.
 56 | 
 57 |     In order to draw this thing, we should refer to the particular way that we wrote out
 58 |     our computation.
 59 | 
 60 | === A few more things ===
 61 | 
 62 | A couple of more definitions:
 63 | 
 64 | - A stage B *depends on* a stage A if...
 65 | 
 66 |     there is a path from A to B
 67 | 
 68 |     point: The dataflow graph reveals exactly which computations depend on which others!
 69 | 
 70 | - A *source* is a node without any input edges
 71 |     (typically, a node which loads data from an external source)
 72 |     (corresponds to the E stage of the ETL model)
 73 | 
 74 | - A *sink* is a node without any output edges
 75 |     (typically, a node which saves data to an external source)
 76 |     (corresponds to the L stage of the ETL model)
 77 | 
 78 | - A small correction from last time: let's define
 79 |   an *operator* is any node that is not a source or a sink.
 80 |   Operators take input data, and produce output data
 81 |     (corresponds to the T stage of the ETL model)
 82 | 
 83 | Points:
 84 | 
 85 |     Every node in the dataflow graph is one of the above 3 types
 86 | 
 87 |     The dataflow graph reveals exactly where the I/O operations are for your pipeline.
 88 | 
 89 | We can use the dataflow graph to reveal (visually and conceptually) many useful features of our pipeline.
 90 | 
 91 | In Python:
 92 | We could write each node as a stage, as we have been doing before.
 93 | 
 94 | Let's just write one example, in the interest of time
 95 | """
 96 | 
 97 | def max_stage(df):
 98 |     return df["Year"].max()
 99 | 
100 | """
101 | Reminders for why this helps:
102 | 
103 | (Maybe it's overkill for a one-liner example like this)
104 | 
105 | - Better code re-use
106 | - Better ability to write unit tests
107 | - Separation of concerns between different features, developers, or development efforts
108 | - Makes the software easier to maintain (or modify later)
109 | - Makes the software easier to debug
110 | 
111 | Zooming in on one of these...
112 | (pick one)
113 | 
114 | Q: How does this correspond to ETL model?
115 | 
116 | ETL is basically a dataflow graph with 3 nodes.
117 | """
118 | 
119 | """
120 | === Data validation ===
121 | 
122 | We will talk more about data validation at some point, most likely as part of Lecture 3.
123 | (See failures.py for a further discussion)
124 | 
125 | Where in a pipeline is data validation most important?
126 | 
127 | (There is more than one place where validation could help, but what's the most obvious place to start?)
128 | 
129 | A: Right before transformations
130 |     (After sources)
131 | 
132 | Why?
133 |     - Most common problem: malformed input
134 |     - I might want to validate that all of my rows have the type that I'm expecting before
135 |       I move to any further processing.
136 |     - This might even simplify or speed up the later stages as in those stages I'm allowed
137 |       to assume that the data is well-formed.
138 | 
139 | NB: You can validate at any point in the graph. (And it can be useful!)
140 | 
141 | Validation in a dataflow graph:
142 | we may view each edge as having some "constraints" that are validated by the previous stage,
143 | and assumed by the next.
144 | 
145 | === Performance ===
146 | 
147 | Let's touch on one other thing that we can do with dataflow graphs:
148 | we can use them to think about performance.
149 | 
150 | Dataflow graphs are basically the "data processing" equivalent of programs.
151 | 
152 | For traditional programs, there are two notions of performance that matter:
153 | 
154 | - Runtime or time complexity
155 | - Memory usage or space complexity
156 | 
157 | For data processing programs?
158 | 
159 | We'll care about the most:
160 | - Running time corresponds to: Throughput & Latency
161 | - Memory usage: you can also measure, we'll talk briefly about ways of thinking about this.
162 | 
163 | **********
164 | 
165 | Recap:
166 | 
167 | We reviewed the definition of dataflow graph
168 | - divided into sources, operators, and sinks
169 | - def of when to draw an edge
170 | 
171 | We practiced drawing dataflow graphs
172 | 
173 | We used dataflow graphs to explore various features of a data processing computation
174 | 
175 | We argued that analagous to regular computer programs for the traditional computing world,
176 |     dataflow graphs are the right notion of computer programs for the data processing world.
177 | 
178 | ********** where we ended for today **********
179 | """
180 | 


--------------------------------------------------------------------------------
/lecture2/parts/2-commands.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Wednesday, Oct 15
  3 | 
  4 | Part 2: Commands and Platform Dependence
  5 | 
  6 | Continuing the shell.
  7 | 
  8 | Poll:
  9 | Which of the following are reasons you might want to use the shell? (Select all that apply)
 10 | 
 11 | <Options cut>
 12 | 
 13 | https://forms.gle/YrsjyyXe5Ve1aqEM7
 14 | 
 15 | -----
 16 | 
 17 | Last time we saw: ls, cd, python3
 18 | 
 19 | (btw: ls is short for "list")
 20 | (cd: . = current folder, .. = parent folder)
 21 | 
 22 | (autocomplete; up/down arrow)
 23 | 
 24 | Remaining commands:
 25 | 
 26 | - pytest <code>.py:
 27 |   Run pytest (Python unit testing framework) on a Python program
 28 | 
 29 | - conda install <module>:
 30 |   Conda = package manager for various data science libaries & frameworks
 31 |   This command installs a software package <module> using Conda
 32 | 
 33 | - pip3 install <module>:
 34 |   Somewhat deprecated nowadays in favor of better package managers
 35 |   Install python libraries / packages
 36 | 
 37 |   Better?
 38 |   + Use conda
 39 |   + Use your package manager through your operating system
 40 |     brew for macOS
 41 |     apt for Linux
 42 | 
 43 |   For modern Python projects:
 44 |   You should be using venv - makes a virtual package environment per-Python project
 45 | 
 46 |   If you ever see a file like .venv in a GitHub repository, that's what that is
 47 | 
 48 | What do all of these programs have in common?
 49 | 
 50 | Commonalities:
 51 |   They all involved working with system resources in some way.
 52 | 
 53 | Differences:
 54 | 
 55 | - ls: mostly was "informational" command - just figuring out what folder we're
 56 |   curently inside
 57 | 
 58 | - cd, conda, pip3, python3 - "doing stuff" commands - we're actually modifying the
 59 |   state of the system when running these.
 60 | 
 61 | Other answers (skip):
 62 | 
 63 | - Different programs may have been developed by different people, in different
 64 |   teams, in different languages, etc.
 65 | 
 66 | - We can't assume someone wrote a nice GUI for us to connect these programs
 67 |   or pieces together! (Sadly, often they didn't.)
 68 | 
 69 | Some examples of running these:
 70 | """
 71 | 
 72 | # Try:
 73 | # python3, ls, pytest, conda
 74 | 
 75 | # ls: doesn't show hidden folders and files
 76 | # On Mac: anything starting with a . is hidden
 77 | # Hidden files are used for many important purposes,
 78 | # e.g., storing program data, caching information, writing
 79 | # configuration for tools like Git, etc.
 80 | 
 81 | """
 82 | when submitting code to others:
 83 | best to remove hidden files & folders!
 84 | 
 85 | These can clutter up a project, resulting in a large
 86 | .zip file with lots of extra junk/files.
 87 | 
 88 | (Similarly, they can also clutter up a Git repository
 89 | - which is why we use .gitignore to tell Git to ignore
 90 | certain stuff.)
 91 | """
 92 | 
 93 | # To show hidden + other metadata
 94 | # ls -alh
 95 | # ^^^^^^^ TL;DR use this to show all the stuff in a folder
 96 | 
 97 | """
 98 | Observations:
 99 | 
100 | - You can run shell commands in Python
101 | 
102 | - You can run Python programs from the shell
103 |   (we've already seen how to do this)
104 | 
105 | Let's see an example
106 | """
107 | 
108 | # 1. Using the built-in os library
109 | 
110 | # os, sys - Python libraries for interacting with
111 | # system resources
112 | 
113 | # os is how Python interacts with the operating system
114 | import os
115 | 
116 | def ls_1():
117 |     # Listdir: input a folder, show me all the files
118 |     # and folders inside it
119 |     # The . refers to the current directory
120 |     # Also does not include hidden files/folders.
121 |     print(os.listdir("."))
122 | 
123 | # ls_1()
124 | 
125 | # 2. Running general/arbitrary commands
126 | 
127 | # Library for running other commands
128 | import subprocess
129 | 
130 | def ls_2():
131 |     subprocess.run(["ls", "-alh"])
132 |     # Equivalent: ls -alh in the shell!
133 | 
134 | # ls_2()
135 | # ^^^^ same output as if I ran the command line directly
136 | 
137 | """
138 | Re: Q in chat
139 | When working with the shell, you are often doing very
140 | platform-specific stuff (platform = operating system, architecture, etc.).
141 | 
142 | Example differences across platforms:
143 | - syntax for arguments in Mac/Linux vs Windows Powershell
144 | - capitalization of folders
145 |   example: on Mac I run ls Subfolder - not case-sensitive -
146 |   to get the files inside "subfolder"
147 | 
148 |   Won't work on another platform!
149 | 
150 |   "Works on my machine"
151 |   But doesn't work on someone else's.
152 | 
153 | Summary points:
154 | 
155 | Sometimes in Python we just directly call into
156 | commands, and knowing shell syntax is useful as it
157 | gives a very powerful way for Python programs to interact
158 | with the outside world.
159 | 
160 | Everything that can be done in the shell can be done in a Python script
161 | (Why?)
162 | 
163 | Everything that can be done in Python can be done in the shell
164 | (Why?)
165 | 
166 | So knowing shell stuff might help you with running
167 | systems-level stuff in Python, and vice versa.
168 | 
169 | ===== A model for interacting with the shell: 3 types of command =====
170 | 
171 | We saw how to run basic commands in the shell and what it means.
172 | 
173 | Three types of commands:
174 | 
175 | 1. Information
176 | 2. Getting help
177 | 3. Doing something
178 | 
179 | === Informational commands: looking around ===
180 | 
181 | An analogy:
182 | There used to be a whole genre of text-based adventure games.
183 | the shell is kind of like this.
184 | 
185 | e.g.
186 | - Zork (1977):
187 |   https://textadventures.co.uk/games/play/5zyoqrsugeopel3ffhz_vq
188 | - Peasant's Quest (2004):
189 |   https://homestarrunner.com/disk4of12
190 | 
191 | Back in the day you would then open up the game (and be provided no information to help. :-) )
192 | What would you do first?
193 | 
194 | If you like, play around with Zork offline, it can be a fun game/distraction
195 | (bit of a blast from the past)
196 | 
197 | https://textadventures.co.uk/games/play/5zyoqrsugeopel3ffhz_vq
198 | 
199 | If you know how to play Zork then you know how to work with the shell.
200 | 
201 | Recap:
202 | 
203 | - We saw some stuff about hidden files/folders (starting with .)
204 | 
205 | - We talked about running shell commands from Python using subprocess, and Python system libraries
206 | 
207 | - We talked about platform differences and how these can be an issue when working
208 |   with the shell
209 | 
210 | - We introduced a 3-category analogy for shell commands: Info commands, Help commands, and "Do something" commands.
211 | 
212 | ********** Where we ended for today **********
213 | """
214 | 


--------------------------------------------------------------------------------
/lecture1/parts/5-performance.py:
--------------------------------------------------------------------------------
  1 | """
  2 | October 8
  3 | 
  4 | Part 5: Performance
  5 | 
  6 | Let's talk about performance!
  7 | 
  8 | But first, the poll.
  9 | 
 10 | === Poll / discussion question ===
 11 | 
 12 | True or false:
 13 | 
 14 | 1. Two different ways of writing the same overall computation can have two different dataflow graphs.
 15 | 
 16 | 2. Operators always take longer to run than sources and sinks.
 17 | 
 18 | 3. Typically, every node in a dataflow graph takes the same amount of time to run.
 19 | 
 20 | https://forms.gle/wAAyXqbJaCkEzyZP9
 21 | 
 22 | Correct answers: T, F, F
 23 | 
 24 | For 1:
 25 | Max and min?
 26 | Say I have a dataset in a dataframe df with fields x and y
 27 | And I want to do df["x"].max()
 28 | and df["x"].min()
 29 | """
 30 | 
 31 | # df = load_input_dataset()
 32 | # min = df["x"].min()
 33 | # max = df["x"].max()
 34 | 
 35 | """
 36 | Dataflow graph with nodes
 37 | (load_input_dataset) node
 38 | (max) node
 39 | (min) node
 40 | 
 41 |        --> (max)
 42 | (load)
 43 |        --> (min)
 44 | 
 45 |        --> (min)
 46 | (load)
 47 |        --> (max)
 48 | 
 49 | Same graph! Has the same nodes, and has the same edges.
 50 | 
 51 | This is actually a great example of a slightly different phenomenon:
 52 | 
 53 | 1b. Two different ways of writing the same overall computation can have *the same* dataflow graph.
 54 | 
 55 | A different example?
 56 | 
 57 | If one operator does depend on the other, BUT the answer doesn't depend on the order, you could rearrange them to get an example where
 58 | - the overall computation was the same, but
 59 | - the dataflow graph was different
 60 | 
 61 | example:
 62 | - Get row with values x, y, and z
 63 | - First, we compute a = x + y
 64 | - Then we compute a + z = x + y + z.
 65 | 
 66 | OR, we could
 67 | - First, compute b = x + z
 68 | - Then we compute b + y = x + y + z.
 69 | 
 70 | We could get the same answer in two different ways.
 71 | 
 72 | And in this example, the dataflow graph is also different:
 73 | 
 74 | (input) -> (compute x + y) => (compute a + z)
 75 | (input) -> (compute x + z) => (compute b + z).
 76 | 
 77 | An easier example is .describe() from last time.
 78 | """
 79 | 
 80 | """
 81 | Last time, we reviewed the notions of performance for traditional programs.
 82 | 
 83 | There's two types of performance that matter: time & space.
 84 | 
 85 | For data processing programs?
 86 | 
 87 | ===== Running time for data processing programs =====
 88 | 
 89 | Running time has two analogs. We will see how these are importantly different!
 90 | 
 91 | - Throughput
 92 | 
 93 |     What is throughput?
 94 | 
 95 |     Most pipelines run slower the more input items you have!
 96 | 
 97 |     Think about how long it will take to run an application that
 98 |     processes a dataset of university rankings, and finds the top 10
 99 |     universities by ranking.
100 | 
101 |     You will find that if measuring the running time of such an application,
102 |     a single variable dominates ...
103 |     the number of rows in your dataset.
104 | 
105 |     Example:
106 |     1000 rows => 1 ms
107 |     10,000 rows => ~10 ms
108 |     100,000 rows => ~100 ms
109 | 
110 |     - Often linear: the more input items, the longer it will take to run
111 | 
112 |     So it makes sense to measure the performance in a way that takes this
113 |     into account:
114 | 
115 |         running time = (number of input items) * (running time per item)
116 | 
117 |         (running time per item) = (running time) / (number of input items)
118 | 
119 |     Throughput is the inverse of this:
120 |     Definition / formula:
121 |         (Number of input items) / (Total running time).
122 | 
123 |     Intuitively: how many rows my pipeline is capable of processing,
124 |     per unit time
125 | 
126 |     There's many real-world examples of this concept:
127 | 
128 |         -> the number of electrons passing through a wire per second
129 | 
130 |         -> the number of drops of water passing through a stream per second
131 | 
132 |         -> the number of orders processed by a restaurant per hour
133 | 
134 |     "number of things done per unit time"
135 | 
136 | Is this the only way to measure performance?
137 | 
138 | We also care about the individual level view: how long it takes to process
139 | a *specific* item or order.
140 | 
141 | We also might measure, for an individual order, how long it takes for
142 | results for that order to come out of our pipeline.
143 | 
144 |     Latency =
145 |     (time at which output is produced) - (time at which input is received)
146 | 
147 | This is called latency.
148 | 
149 | It almost seems like we've defined the same thing twice?
150 | 
151 | But these are not the same.
152 | Simplest way to see this is that we might process more than one item at
153 | the same time.
154 | 
155 | Ex:
156 |     Restaurant processes 60 orders per hour
157 | 
158 |     Scenario I:
159 |         Process 5 orders every 5 minutes, get those done, and move on to
160 |         the next batch
161 | 
162 |     Scenario II:
163 |         Process 1 order every 1 minute, get it done, and then move on to
164 |         the next order.
165 | 
166 | In either case, at the end of the hour, I've processed all 60 orders!
167 | 
168 | Throughput in Scenario I? In Scenario II?
169 |     I:
170 |         Throughput = (Number of items processed) / (Total running time)
171 | 
172 |         60 orders / 60 minutes = 1 order / minute.
173 |     II:
174 |         Throughput = (Number of items processed) / (Total running time)
175 | 
176 |         60 orders / 60 minutes = 1 order / minute
177 | 
178 | What about latency?
179 | 
180 |     I:
181 |         (time at which output is produced) - (time at which input is received)
182 | 
183 |         = roughly 5 minutes
184 | 
185 |     II:
186 |         (time at which output is produced) - (time at which input is received)
187 | 
188 |         = roughly 1 minutes
189 | 
190 | Both measures of running time at a "per item" or "per row" level,
191 | but they can be very different.
192 | 
193 | It is NOT always the case that Throughput = 1 / Latency
194 | or that Throughput and Latency are directly correlated (or inversely correlated).
195 | 
196 | ===== Recap =====
197 | 
198 | We talked about how computations are represented as dataflow graphs
199 | to illustrate some important points:
200 | - The same computation (computed in different ways) can have two different dataflow graphs
201 | - The same computation (computed in different ways) could have two of the same dataflow graph
202 | 
203 | We introduced throughput + latency
204 | - Restaurant analogy
205 | - We saw formulas for each
206 | - Both measures of performance in terms of running time at an "individual row" level, but throughput is an aggregate measure and latency is viewed at the level of an individual row.
207 | 
208 | ********** Where we ended for today **********
209 | """
210 | 


--------------------------------------------------------------------------------
/lecture4/parts/1-motivation.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Lecture 4: Parallelism
  3 | 
  4 | Part 1: Motivation
  5 | (Oct 24)
  6 | 
  7 | === Discussion Question and Poll ===
  8 | 
  9 | Which of the following is an accurate statement about Git's core philosophies?
 10 | 
 11 | https://forms.gle/zB1qhdrP2xXswHMX8
 12 | 
 13 | ===== Introduction =====
 14 | 
 15 | So far, we know how to set up a basic working data processing pipeline
 16 | for our project:
 17 | 
 18 | - We have a prototype of the input, processing, and output stages
 19 |   and we can think about our pipeline as a dataflow graph (Lecture 1)
 20 | 
 21 | - We have used scripts and shell commands to download any necessary
 22 |   data, dependencies, and set up any other system configuration (Lecture 2)
 23 |   (running these automatically if needed, see HW1 part 3)
 24 | 
 25 | How did we build our pipeline?
 26 | 
 27 | So far (in Lecture 1 / HW1), we have been building our pipelines in Pandas
 28 | 
 29 | - Pandas can efficiently representing data in memory as a DataFrame
 30 | 
 31 | - Pandas uses vectorization - you studied this a little in HW1 part 2
 32 | 
 33 | The next thing we need to do is **scale** our application.
 34 | 
 35 | === What is scaling? ===
 36 | 
 37 | Scalability is the ability of a system to handle an increasing amount of work.
 38 | 
 39 | === Why scaling? ===
 40 | 
 41 | Basically we should scale if we want one of two things:
 42 | 
 43 | 1. Running on **more** input data
 44 |     e.g.:
 45 |     + training your ML model on the entire internet instead of a dataset of
 46 |       Wikipedia pages you downloaded to a folder)
 47 |     + update our population pipeline to calculate some analytics by individual city,
 48 |       instead of just at the country or continent level
 49 | 
 50 | 2. Running **more often** or on **more up-to-date** data
 51 |     e.g.:
 52 |     + re-running your pipeline once a day on the latest analytics;
 53 |     + re-running every hour to ensure the model or results stay fresh; or even
 54 |     + putting your application online as part of a live application
 55 |       that responds to input from users in real-time
 56 |       (more on this later in the quarter!)
 57 | 
 58 | Questions we might ask:
 59 | 
 60 | - How likely would it be that you want to scale for a toy project?
 61 |   For an industry project?
 62 | 
 63 |   A: probably more likely for an industry project.
 64 | 
 65 | - What advantages would scaling have on an ML training pipeline?
 66 | 
 67 | === An example: GPT-4 ===
 68 | 
 69 | Some facts:
 70 |     + trained on roughly 13 trillion tokens
 71 |     + 25,000 A100 processors
 72 |     + span of 3 months
 73 |     + over $100M cost according to Sam Altman
 74 |     https://www.kdnuggets.com/2023/07/gpt4-details-leaked.html
 75 |     https://en.wikipedia.org/wiki/GPT-4
 76 | 
 77 | (And the next generation of models have taken even more)
 78 | 
 79 | Contrast:
 80 | 
 81 |     our population dataset in HW1 is roughly 60,000 lines and roughly 1.6 MB on disk.'
 82 | 
 83 |     Over 1 million times less data than the amount of tokens for GPT-4!
 84 | 
 85 | Conclusion: scaling matters.
 86 | NVIDIA stock:
 87 | 
 88 |     https://www.google.com/finance/quote/NVDA:NASDAQ?window=5Y
 89 | 
 90 | === Thinking about scalability ===
 91 | 
 92 | Points:
 93 | 
 94 | - We can think of scaling in terms of throughput and latency
 95 | 
 96 |     See extras/scaling-example.png for an example!
 97 | 
 98 |     If your application is scaling successfully,
 99 |     double the # of workers or processors => double the throughput
100 |     (best case scenario)
101 | 
102 |     double the # of processors => half the latency? (Y or N)
103 | 
104 |         A: Not necessarily
105 | 
106 |         Often latency is not affected in the same way.
107 | 
108 | If we can scale our application successfully,
109 | we are typically looking to increase the throughput of the application.
110 | 
111 | === A note about Pandas ===
112 | 
113 | Disdavantage of Pandas: does not scale!
114 | 
115 | - Wes McKinney, the creator of Pandas:
116 |     "my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset"
117 | 
118 |     https://wesmckinney.com/blog/apache-arrow-pandas-internals/
119 | 
120 | Exercise:
121 | 
122 |     Check how much RAM your personal laptop has.
123 |     According to McKinney's estimate, how much how large of a dataset in population.csv
124 |     could your laptop handle?
125 | 
126 | (Let me show how to do this)
127 | 
128 | My answer: 18 GB
129 | 
130 | Next time at the start of class, we'll poll the class for various
131 | answers and figure out how large of a dataset we could handle in Pandas.
132 | 
133 | Recap:
134 | 
135 | - We went over midterm topics
136 | - We saw some motivation for why you might want to scale your application
137 | - We defined scalability and types of scalability (running on more data vs. running more often)
138 | - Pandas does not scale in this sense.
139 | 
140 | ----- Where we ended for today -----
141 | 
142 | (Finishing up)
143 | 
144 | Recall: statement from Wes McKinney
145 | 
146 | Uncomment and run this code to find out your RAM
147 | """
148 | 
149 | import subprocess
150 | import platform
151 | 
152 | def get_ram_1():
153 |     system = platform.system()
154 |     if system == "Darwin":
155 |         subprocess.run(["sysctl", "-nh", "hw.memsize"])
156 |     elif system == "Linux":
157 |         subprocess.run(["grep", "MemTotal", "/proc/meminfo"])
158 |     elif system == "Windows":
159 |         print("Windows unsupported, try running in WSL?")
160 |     else:
161 |         print(f"Unsupported platform: {system}")
162 | 
163 | # Run the above
164 | print("Amount of RAM on this machine (method 1):")
165 | get_ram_1()
166 | 
167 | import psutil
168 | def get_ram_2():
169 |     ram_bytes = psutil.virtual_memory().total
170 |     ram_gb = ram_bytes / (1024 ** 3)
171 |     print(f"{ram_gb:.2f} GB")
172 | 
173 | print("Amount of RAM on this machine (method 2):")
174 | get_ram_2()
175 | 
176 | """
177 | Poll:
178 | 
179 | Answers:
180 | 
181 | Method 1:
182 |     8.5, 8.6, 17.1, 15, 17.1, 16.5, 17.1
183 | 
184 | Method 2:
185 |     8, 8, 16, 7.62, 15.61, 14.45, 15.4, 7.82, 16, 15.3, 31.6
186 | """
187 | 
188 | method1 = [8.5, 8.6, 17.1, 15, 17.1, 16.5, 17.1]
189 | method2 = [8, 8, 16, 7.62, 15.61, 14.45, 15.4, 7.82, 16, 15.3, 31.6]
190 | 
191 | average1 = sum(method1) / len(method1)
192 | average2 = sum(method2) / len(method2)
193 | 
194 | print(f"method 1 avg: {average1}", f"method 2 avg: {average2}")
195 | 
196 | ram_needed = average2 / 10
197 | 
198 | print(f"Using method 2: we can handle a dataset up to size: {ram_needed} GB")
199 | 
200 | """
201 | Please fill out your answers in the poll:
202 | 
203 | https://forms.gle/sqGrHBdQBrykDoSdA
204 | 
205 | Roughly, we can process like _____ as much data
206 | and then we'll a bottleneck.
207 | 
208 | population.csv from HW1: 1.5 MB
209 | 
210 | According to the class average, we could go up to 1000x the population
211 | and still handle it with Pandas, beyond that we will run out of space,
212 | according to McKinney's statement.
213 | """
214 | 


--------------------------------------------------------------------------------
/lecture2/parts/1-introduction.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Lecture 2: The Shell
  3 | 
  4 | Part 1: Introduction and Motivation
  5 | 
  6 | This lecture will cover the shell (AKA: terminal or command line),
  7 | including:
  8 | 
  9 | - command line basics: cd, ls, cat, less, touch, text editors
 10 | 
 11 | - calling scripts from the command line
 12 | 
 13 | - calling command line functions from a Python script
 14 | 
 15 | - git basics: status, add, push, commit, pull, rebase, branch
 16 | 
 17 | - (if time) searching / pattern matching: find, grep, awk, sed
 18 | 
 19 | Background:
 20 | I will assume no background on the shell other than the
 21 | one or two basic commands we have been typing in class.
 22 | (Like `python3 lecture.py`)
 23 | 
 24 | === Discussion Question & Poll ===
 25 | 
 26 | Review from last time
 27 | 
 28 | 1. Which of the following are valid units of latency?
 29 | Hours
 30 | Items / second
 31 | Milliseconds
 32 | Nanoseconds
 33 | Rows / item
 34 | Rows / minute
 35 | Seconds
 36 | Seconds / item
 37 | 
 38 | 2. True or false?
 39 | 
 40 | Throughput always (increases, decreases, is constant) with the size of the dataset
 41 | Running time generally increases with the size of the dataset
 42 | Latency is often measured using a dataset with only one item
 43 | Latency always decreases if more than one item is processed at the same time
 44 | 
 45 | https://forms.gle/JE4R1bMU13JAvAE36
 46 | 
 47 | Running time generally increases - True
 48 | 
 49 |   Throughput = N / T
 50 | 
 51 |   N = number of input items
 52 |   T = total running time
 53 | 
 54 |   Sometimes throughput goes up, sometimes it goes down
 55 | 
 56 |   N / T - often roughly linear, but not exactly linear!
 57 | 
 58 |   - if N = 1, often the system can't benefit from "scale",
 59 |     so throughput will be quite low
 60 | 
 61 |   - is N increases (10, 100, 1000, ...) the system will start to
 62 |     benefit from scale, so throughput will increase
 63 | 
 64 |   - if N -> infinity (more data than the laptop/machine can handle at all), throughput will again tank because the system will
 65 |   just completely crash or lag / be unable to do things.
 66 | 
 67 | ===== Introduction =====
 68 | 
 69 | === Scripting and the "Glue code" problem ===
 70 | 
 71 | Programs are not very useful on their own.
 72 | We need a way for programs to talk to each other!
 73 | That is, we need to "glue" programs or program components together.
 74 | 
 75 | Examples:
 76 | 
 77 | - Our data processing pipeline talks to the operating system when it
 78 |   asks for the input file life-expectancy.csv.
 79 | 
 80 | - Python module example: If another script wants to use our code, it must import it
 81 |   which requires the Python runtime to find the code on the system
 82 |   (see `module_test.py` for example)
 83 | 
 84 | - Much of Pandas and Numpy are written in C. So we need our Python
 85 |   code to call into C code.
 86 | 
 87 | What tools do people use to "glue" programs together?
 88 | 
 89 | 1. Using system libraries (like os and sys in Python)
 90 | 
 91 | 2. Module systems within a programming language (`import` in Python)
 92 | 
 93 | 3. Shell: To talk to a a C program from Python, one way would be to run commands through the shell
 94 | 
 95 | Other solutions:
 96 | 
 97 | 4. Scripting languages: Python, others (e.g. Ruby, Perl)
 98 | 
 99 | For the most part, we can assume in this class that much of this interaction
100 | happens in Python
101 | (In fact, we will see how to do most things in today's lecture both
102 | in the shell, AND in Python!)
103 | But it is still useful to know how this program
104 | interaction happens "under the hood":
105 | 
106 | When Python interacts with the operating system and with programming languages other than Python, *internally* a common way to do this is through the shell.
107 | 
108 | ----
109 | 
110 | Let's open up the shell now.
111 | """
112 | 
113 | # Mac: Cmd+Space Terminal
114 | # VSCode: Ctrl+` (backtick)
115 | # GitHub Codespaces: Bottom part of the screen
116 | 
117 | # Once we have a shell open, we have a "command prompt" where
118 | # we can run arbitrary commands/programs on the computer (so it's
119 | # like an admin window into your machine.)
120 | 
121 | """
122 | Questions:
123 | 
124 | + If I can run commands from Python (we'll see that you can), then why should I use the shell?
125 | 
126 | + If I can use a well-designed GUI app (such as my Git installation), why should I use the shell?
127 | 
128 | Examples where programmers and data scientists regularly use the shell:
129 | 
130 | - You have bought a new server machine from Dell, and you want to connect to
131 |   it to install some software on it.
132 | 
133 | - You bought a virtual machine instance on AWS, and you want to connect to it
134 |   to run your program.
135 | 
136 | - You want to set up a Docker container with your application so that anyone
137 |   can run and interact with it. You need to write a Dockerfile to do this.
138 | 
139 | - Debugging software installations - missing dependencies, missing libraries
140 | 
141 |   (I have Python3 installed, but my program isn't recognizing it)
142 |   Where is the software? Where is it expected to be?
143 |   ---> move it to the correct location
144 | 
145 | - You want to compile and run an experimental tool that was published on GitHub
146 | 
147 | Or even, simply:
148 | 
149 | - You have written some code, you want to send it to me so I can try it out.
150 | 
151 | Shell on different operating systems?
152 | 
153 | - Mac, Linux: Terminal app
154 | - Windows is a bit different, commands by default are very different
155 |   option 1:
156 |   recommend the most: WSL (Windows Subsystem for Linux)
157 |   dropdown next to your shell window -> choose which type of
158 |   shell you want
159 |   With WSL, should be able to select a Ubuntu shell.
160 | 
161 |   option 2:
162 |   Use the shell built into VSCode
163 | 
164 | (don't recommend powershell)
165 | """
166 | 
167 | # python3 lecture.py
168 | print("Welcome to ECS 119 Lecture 2.")
169 | 
170 | # python3 -i lecture.py
171 | # Quitting: Ctrl-C, Ctrl-D
172 | 
173 | """
174 | === What is the Shell? ===
175 | 
176 | The shell is a way to talk to your computer, run arbitrary commands,
177 | and connect those commands together.
178 | 
179 | Examples we have seen:
180 | 
181 | - ls (stands for "list")
182 |   Show all files/folders in the current folder
183 | 
184 | - cd: change directory
185 | 
186 | (ls/cd often work together)
187 | 
188 | - python3 <code>.py: Run the python code found in <code>.py
189 | 
190 | NOTE: tab-autocomplete: very useful
191 |   (saves keystrokes)
192 |   (will cycle through theh options if there's more than one.)
193 | 
194 | Very quick recap:
195 | We introduced the shell/terminal/command prompt as a way to solve the "glue code" problem
196 | 
197 | We went through some motivation for when data scientists might need to use the shell (esp. to interact with things like remote servers), and saw some basic commands.
198 | 
199 | We'll pick this up on Wednesday, and remember that we will be at
200 | 11am on Zoom, with discussion section in the usual classroom/class time.
201 | 
202 | ***** Where we ended for today. *****
203 | """
204 | 


--------------------------------------------------------------------------------
/lecture2/parts/3-informational.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Friday, October 17
  3 | 
  4 | Part 3: Informational Commands
  5 | 
  6 | Discussion Question & Poll:
  7 | 
  8 | 1. Which of the following are a good use cases for things to list in .gitignore? (Select all that apply)
  9 | <options cut>
 10 | 
 11 | 2. Platform-specific things to be aware of could include... (Select all that apply)
 12 | <options cut>
 13 | 
 14 | https://forms.gle/HmmT8BjXtiBvferRA
 15 | 
 16 | Some notes:
 17 | 
 18 | Q1:
 19 | - Not all hidden files are unimportant!
 20 |   Some, like .gitignore may be important, or may be useful to track
 21 |   with the repository.
 22 | 
 23 |   Other hidden files, like .DS_Store are not important and can be ignored.
 24 | 
 25 | Q2:
 26 | - Python is cross-platform (at least for things like a Hello World! program)
 27 |   and will work on any operating system
 28 | 
 29 | - definition: what is a Platform anyway?
 30 | 
 31 |   Platform = The operating system + the architecture + any libraries or other environment packages that are installed
 32 | 
 33 | Review poll answers: exams/poll_answers.md
 34 | 
 35 | Continuing the shell:
 36 | 
 37 | Last time:
 38 | 
 39 | I introduced a model for interacting with the shell which I called the
 40 | 3-part model:
 41 | - Informational commands
 42 | - Help commands
 43 | - Doing stuff commands
 44 | 
 45 | Analogy:
 46 | I mentioned this is kind of like playing a text-based adventure game
 47 | like "Zork" (1970s), many other old games
 48 | 
 49 | === Informational commands ===
 50 | 
 51 | Just as in a text-based adventure,
 52 | the most important thing you need to know when opening a shell is
 53 | how to "look" around. What do you see?
 54 | 
 55 | Key features of such commands:
 56 | - Don't modify your system state at all
 57 | - Might tell you some information about your system and things around you,
 58 |   and what you might want to do next.
 59 | 
 60 | The same approach to progressing the game in Zork applies to the shell!
 61 | Including external tools people have built, and even commands outside of the shell, like
 62 | functions in Python:
 63 | knowing how to "see" the current
 64 | state relevant to your command is often the first step to get more comfortable with the command.
 65 | 
 66 | So how do we "look around"?
 67 | 
 68 | - ls
 69 |   we have already seen - list files/directories in the current location
 70 | 
 71 |   "current working directory"
 72 | 
 73 | - echo $PWD -- get our current location
 74 |   PWD = Print Working Directory
 75 |   echo = Repeat whatever I said
 76 |   echo "text" -- repeat text
 77 | 
 78 |   $VAR - means a variable with name VAR
 79 | 
 80 |   These are called "environment variables"
 81 | 
 82 | - echo $PATH -- get other locations where the shell looks for programs
 83 |   If you've had any difficulties installing software, you may have heard of
 84 |   the path!
 85 | 
 86 |   In order for software to actually run, you need to add it to your PATH.
 87 |   python3 -- "command not found"???
 88 |   conda -- "command not found"???
 89 |   It's possible that you need to add something to your PATH.
 90 | 
 91 |   When we do something like `python3 --version` -- we're checking that you
 92 |   have the software installed, AND that it's been added to your path.
 93 | 
 94 |   It's a common source of confusion to have multiple installations of the same
 95 |   software on your machine -- this is common, for example, with Python
 96 |   You need to add all installations, or, only the most relevant/recent installation
 97 |   to your PATH to ensure that that's the one that gets run.
 98 | 
 99 |   Programs in $PATH are available both programmatically (to other programs),
100 |   and to the user.
101 | 
102 | - echo $SHELL -- get the current shell we are using
103 | 
104 |   Point:
105 |   There are different shells and terminal implementations.
106 |   The default one on MacOS these days is zsh
107 |   bash is another common/very well-used shell (for example, on Linux systems)
108 | 
109 |   If you're on Windows I recommend using WSL so that you have access to a
110 |   similar shell (usually bash)
111 | 
112 |   I can run one shell from another shell. Try:
113 |   - bash
114 |   - zsh
115 |   - Kind of like the Python3 command prompt
116 | 
117 |   Are there advantages to one shell over another?
118 | 
119 |   Yes, there are also some advanced/modern shells that you can install
120 | 
121 |   - maybe with some interesting graphical interface
122 |   - maybe with some interesting color coding
123 |   - maybe with some AI support
124 | 
125 |   Most shells try to support a similar syntax so that people don't get
126 |   confused going from one shell to another.
127 | 
128 |   ---- other possible answers (skipped) ----
129 | 
130 |   Usability: Some modern shells have fancy things like syntax highlighting,
131 |   GUIs that you can click around in, etc.
132 | 
133 |   Portability: You'll want a shell that sort of behaves as a Unix-like shell
134 |   Avoid: PowerShell (Windows syntax)
135 | 
136 |   (Mac and Linux systems are Unix-based. Windows is based on a totally different
137 |   OS architecture.)
138 | 
139 |   A system will come with a built-in shell that you would start out with
140 |   If you want a different one you could use that shell to install another shell.
141 | 
142 | === Environment variables ===
143 | 
144 | The $ are called "environment variables" -- there are others!
145 | These represent the current state of our shell environment.
146 | 
147 | When I write x = 3 in Python, x becomes a local variable assigned to the integer "3"
148 | Similarly, $PATH and $PWD are local variables in the shell.
149 | They're assigned to some values, and when written the shell will expand them out
150 | to whatever values they're assigned to.
151 | 
152 | When we run `echo $PWD` what's actually happening:
153 | - $PWD gets expanded out to its value (/Users/..../119/lecture2)
154 | - This value gets printed back out to the shell output by `echo`
155 | 
156 | You can also define and set your own environment variables.
157 | 
158 | Are environment variables local? Will they persist after the shell session terminates?
159 | 
160 | A:
161 | No, they won't
162 | But there is a way to make things persist and these are the shell config files like
163 | - .bash_profile, .bashrc, .zshrc, ...
164 | - These are files with random code in them that gets executed whenever you open a shell.
165 |   + For zsh, every time I open a shell, .zshrc is executed
166 | - This is why we don't have to keep adding Python, conda, etc. to the $PATH every
167 |   time we open a new shell.
168 | 
169 | This system - of environment variables and $PATH and .zshrc, etc. is
170 | the precarious fabric on which all software installation is working under the
171 | hood.
172 | 
173 | In case you need to access similar functionality from a Python script:
174 | """
175 | 
176 | # with a built-in Python library
177 | def pwd_1():
178 |     print(os.getcwd())
179 | 
180 | # pwd_1()
181 | 
182 | # with subprocess (run an arbitrary command)
183 | # This one is a bit harder
184 | def pwd_2():
185 |     # os.environ is the Python equivalent of the shell $ indicator.
186 |     subprocess.run(["echo", os.environ["PWD"]])
187 | 
188 | # pwd_2()
189 | 
190 | # In fact, we could just use this directly as well, and this offers a third way
191 | def pwd_3():
192 |     print(os.environ["PWD"])
193 | 
194 | # pwd_3()
195 | 
196 | # Q: What happens when we run this from a different folder?
197 | 
198 | # It matters what folder you run a program or command from!
199 | 
200 | """
201 | Recap:
202 | 
203 | - We talked about informational commands - ways to get the state of the system
204 | 
205 |   + Current working directory
206 |   + Files/folders in the system (or in the current woroking directory)
207 |   + The shell that's running
208 | 
209 | - We talked about environment variables
210 | 
211 |   $PWD, $PATH
212 | 
213 |   These are important pieces of system information
214 | 
215 | - We talked a little bit about .zshrc, .bash_profile, etc. which are
216 |   shell configuration files
217 | 
218 |   + Lists of shell commands that run when you open a shell.
219 | 
220 |     BTW, virtual machines and things like Docker also have similar such config files
221 | 
222 |     Dockerfile -- list of shell commands that gets run.
223 | 
224 | Next time we will talk about:
225 | - help commands, doing stuff commands
226 | """
227 | 


--------------------------------------------------------------------------------
/lecture1/parts/3-dataflow-graphs.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Friday, October 3
  3 | 
  4 | Part 3:
  5 | From ETL to Dataflow Graphs.
  6 | 
  7 | === Poll ===
  8 | 
  9 | Which of the following are most likely advantages of writing or rewriting
 10 | a data processing pipeline as well-structured software (separated into modules, classes, and functions)?
 11 | 
 12 | https://forms.gle/akNYHe8SY1CSU5KT9
 13 | 
 14 | === Dataflow graphs ===
 15 | 
 16 | We can view all of the above steps as something called a dataflow graph.
 17 | 
 18 | ETL jobs can be thought of visually like this:
 19 | 
 20 |     (Extract) -> (Transform) -> (Load)
 21 | 
 22 | This is a Directed Acyclic Graph (DAG)
 23 | 
 24 | A graph is set a nodes and a set of edges
 25 | 
 26 |     ()       () -> ()
 27 |         () -> ()
 28 | 
 29 | Nodes = points
 30 | Edges = arrows between points
 31 | 
 32 | We're going to generalize the ETL model to allow an arbitrary
 33 | number of nodes and edges.
 34 | 
 35 | Q: What are the nodes? What are the edges?
 36 | 
 37 |     - Nodes:
 38 |         Each node is a function that takes input data and produces output data.
 39 | 
 40 |         Such a function is called an *operator.*
 41 | 
 42 |         In Python:
 43 | """
 44 | 
 45 | # example: operator with 3 inputs:
 46 | def stage(input1, input2, input3):
 47 |     # do some processing
 48 |     # return the output
 49 |     return output
 50 | 
 51 | # example: operator with 0 inputs:
 52 | def stage():
 53 |     # do some processing
 54 |     # return the output
 55 |     return output
 56 | 
 57 | """
 58 |     - Edges:
 59 |         Edges are datasets!
 60 |         They can either be:
 61 |         - individual rows of data, or
 62 |         - multiple rows of data...
 63 | 
 64 |     - More specifically, we draw an edge
 65 |         from a node (A) to a node (B) if
 66 |         the operator B directly uses the output of operator (A).
 67 | 
 68 | In our previous example?
 69 | 
 70 | (1) Extract = loading the set of websites sessions into a Pandas dataframe
 71 |     + Input: None (because we loaded from a file)
 72 |     + Output: A Pandas DataFrame
 73 | 
 74 | (2) Transform = taking in the Pandas DataFrame and returning the session with the
 75 |   maximum time spent
 76 |     + Input: The pandas DataFrame from the Extract stage
 77 |     + Output: the maximum session
 78 | 
 79 | (3) Load = taking the maximum session and saving that to a file (in our case, save.txt)
 80 |     + Input: the maximum session from the Transform stage
 81 |     + Output: None (because we saved to a file)
 82 | 
 83 | Graph:
 84 | 
 85 |     (1) -> (2) -> (3)
 86 | 
 87 | Questions:
 88 | 
 89 | - Why is there an edge from (1) to (2)?
 90 |     Because the Transform stage uses the output from the Extract stage
 91 | 
 92 | - Why is there NOT an edge from (2) to (1)?
 93 |     Stage (1) doesn't use the output from stage (2)
 94 | 
 95 | - Why is there NOT an edge from (1) to (3)?
 96 |     Stage (3) doesn't directly use the output from stage (1).
 97 | 
 98 | The graph is **acyclic,** meaning it does not have loops.
 99 | 
100 |     (This is why it's a "directed acyclic graph" (DAG)).
101 | 
102 | - (Why can we assume this?)
103 |     + It doesn't appear to make sense for stage (1) to use stage (2)'s output,
104 |       and stage (2) to use stage (1)'s output.
105 |     + Generalizations of this are possible, but we will not get into this now.
106 | 
107 | === Q: Do all data processing pipelines look like series of stages? ===
108 | 
109 | A "very complicated" data processing job:
110 | a long series of stages
111 | 
112 |     (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> ... -> (10)
113 | 
114 | But not all data processing dataflow graphs have this form!
115 | Let's do a quick example
116 | 
117 | Suppose that in addition to the maximum session, we want the minimum session.
118 | """
119 | 
120 | def stage1():
121 |     data = {
122 |         "User": ["Alice", "Alice", "Charlie"],
123 |         "Website": ["Google", "Reddit", "Wikipedia"],
124 |         "Time spent (seconds)": [120, 300, 240],
125 |     }
126 |     df = pd.DataFrame(data)
127 |     return df
128 | 
129 | def stage2(df):
130 |     t = df["Time spent (seconds)"]
131 |     # Max of t
132 |     max = t.max()
133 |     # Filter
134 |     # This syntax in Pandas for filtering rows
135 |     # df[colname]
136 |     # df[row filter] (row filter is some sort of Boolean condition)
137 |     return df[df["Time spent (seconds)"] == max]
138 | 
139 | def stage3(df):
140 |     # Save the dataframe somewhere
141 |     with open("save.txt", "w") as f:
142 |         print(df, file=f)
143 | 
144 | # New stage:
145 | # Also compute the minimum session
146 | def stage4(df):
147 |     t = df["Time spent (seconds)"]
148 |     min = t.min()
149 |     return df[df["Time spent (seconds)"] == min]
150 | 
151 | # Finally, print the output from stage4
152 | def stage5(df):
153 |     # print: the .head() of the dataframe, which will give you the first
154 |     # few rows.
155 |     print(df.head())
156 | 
157 | # Try running the pipeline
158 | df1 = stage1()
159 | df2 = stage2(df1)
160 | df3 = stage3(df2) # uses output from stage 2
161 | df4 = stage4(df1)
162 | df5 = stage5(df4) # uses output from stage 4
163 | 
164 | """
165 | As a dataflow graph:
166 | 
167 |     (1) -> (2) -> (3)
168 |         -> (4) -> (5)
169 | 
170 | This is our dataflow graph for this example.
171 | 
172 | Seems like a simple idea, but this can be done for any data processing pipeline!
173 | 
174 | We will see that this is a very powerful abstraction.
175 | 
176 | === Why is this useful? ===
177 | 
178 | - We'll use this to think about development, testing, and validation
179 | - We'll use this to think about parallelism
180 | - We'll use this to think about performance.
181 | """
182 | 
183 | """
184 | A slightly more realistic example
185 | 
186 | What I want to practice:
187 | 1. Thinking about the stages involved in a data processing computation as separate stages
188 |     (List out all of the data processing stages)
189 | 2. Writing down the dataflow graph
190 | 3. Translating that to Python code using PySpark by writing a separate Python function
191 |    for each stage
192 | 
193 | Let's consider how to write a minimal data processing pipeline
194 | as a more structured software pipeline.
195 | 
196 | First thing we need is a dataset!
197 | 
198 | Useful sites:
199 | - https://ourworldindata.org/data
200 | - https://datasetsearch.research.google.com/
201 | - sklearn: https://scikit-learn.org/stable/api/sklearn.datasets.html
202 | 
203 | The dataset we are going to use:
204 |     life-expectancy.csv
205 | 
206 | We will also use Pandas, as we have been using:
207 | 
208 | Useful tutorial on Pandas:
209 | https://pandas.pydata.org/docs/user_guide/10min.html
210 | https://pandas.pydata.org/docs/user_guide/indexing.html
211 | """
212 | 
213 | # Step 1: Getting a data source
214 | # creating a DataFrame
215 | # DataFrame is just a table: it has rows and columns, and importantly,
216 | # each column has a type (all items in the column must share the same
217 | # type, e.g., string, number, etc.)
218 | df = pd.read_csv("life-expectancy.csv")
219 | 
220 | # To play around with our dataset:
221 | # python3 -i lecture.py
222 | 
223 | # We find that our data contains:
224 | # - country, country code, year, life expectancy
225 | 
226 | # What should we compute about this data?
227 | 
228 | # # A simple example:
229 | # min_year = df["Year"].min()
230 | # max_year = df["Year"].max()
231 | # print("Minimum year: ", min_year)
232 | # print("Maximum year: ", max_year)
233 | # avg = df["Period life expectancy at birth - Sex: all - Age: 0"].mean()
234 | # print("Average life expectancy: ", avg)
235 | 
236 | # # Tangent:
237 | # # We can do all of the above with df.describe()
238 | 
239 | # # Save the output
240 | # out = pd.DataFrame({"Min year": [min_year], "Max year": [max_year], "Average life expectancy": [avg]})
241 | # out.to_csv("output.csv", index=False)
242 | 
243 | """
244 | Q for next time: rewrite this as a Dataflow graph using the steps above
245 | 
246 | === Recap ===
247 | 
248 | We learned that ETL jobs are a special case of dataflow graphs,
249 | where we have a set of nodes (operators/stages) and edges (which are drawn when the output
250 | of one operator or stage depends on the output of the previous operator or stage)
251 | 
252 | Revisiting the steps above:
253 | 1. Write down all the stages in our pipeline
254 | 2. Draw a dataflow graph (one node per stage)
255 | 3. Implement the code (one Python function per stage)
256 | 
257 | We have done 1 and (sort of) 3, we will do 2 at the start of class next time.
258 | 
259 | ********* Where we ended for today **********
260 | """
261 | 


--------------------------------------------------------------------------------
/lecture2/parts/5-git.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Wednesday, October 22
  3 | 
  4 | Part 5: Git
  5 | 
  6 | Poll quesiton:
  7 | 
  8 | 1. Give an example of a command that uses a positional argument
  9 | 
 10 | 2. Give an example of a command that uses a named argument
 11 | 
 12 | 3. Why do you think that commands have both positional and named arguments?
 13 | 
 14 | A) There is no reason for this, it's a historical accident
 15 | B) Positional arguments are more often optional, named arguments are more often required
 16 | C) Named arguments are more often optional, positional arguments are more often required
 17 | D) Named arguments can be combined with positional arguments to specify options or modify command behavior
 18 | E) Named arguments emphasize the intended purpose of the argument for readability purposes
 19 | F) Named arguments allow easily specifying Boolean flags (like turning debug mode on or off)
 20 | 
 21 | https://forms.gle/UNCmxWcRE53MkLNv7
 22 | 
 23 | Comments:
 24 | 
 25 | Analogy:
 26 | positional argument
 27 |     cd dir
 28 |     def cd(dir)
 29 | 
 30 | named argument
 31 |     python --version
 32 |     def python(version=True)
 33 | 
 34 | Correct answers: C, D, E, F
 35 | 
 36 | Named arguments are more often optional
 37 | 
 38 | Named arguments are often used as configuration flags
 39 | 
 40 |   python3 -i <--- modifies the "way" that we run Python
 41 | 
 42 | Another difference that wasn't mentioned:
 43 | 
 44 | Typically the order does not matter in named arguments!
 45 | 
 46 |   ls -alh
 47 |   ls -lah
 48 |   ls -a -l -h
 49 | 
 50 | If the user might not want to remember which order to call
 51 | the args in, another good use case for named.
 52 | 
 53 | ===== Git =====
 54 | 
 55 | Git follows the same model as other shell commands!
 56 | 
 57 | Informational commands:
 58 | - git status
 59 | 
 60 | Info returned by git status?
 61 |   + "On branch" main
 62 |   + Whether my branch is up to date
 63 |   + Info about modified files
 64 |   + Do I have any changes to commit
 65 | 
 66 | - git log
 67 | 
 68 |   (down arrow, up arrow, q to quit)
 69 | 
 70 |   Shows a list of commits that have been made to the repository.
 71 | 
 72 | Mental model of git: it's a tree
 73 | 
 74 |   root:
 75 |   (Instructor initialized the repository)
 76 | 
 77 | Every time a change is made to the code, it grows the tip of the tree
 78 | After instructor posts lecture 1 and 2,
 79 | 
 80 |   root
 81 |   |
 82 |   lecture1
 83 |   |
 84 |   lecture2
 85 | 
 86 | If the TA simultaneously makes changes to lecture1
 87 | Two diverging branches:
 88 | 
 89 |   root
 90 |   |
 91 |   lecture1
 92 |   |                               \
 93 |   lecture2 (instructor's changes)    lecture1 (TA's changes)
 94 | 
 95 | When you check out the code, you're at a particular point in
 96 | the tree.
 97 | 
 98 | Git is sort of the opposite of keeping everything up to date :-)
 99 | 
100 |   Modern webapp philosophy: everything should sync automatically!
101 | 
102 |   Git philosophy: nothing should sync automatically!
103 | 
104 |   That means that everyone opts in to what changes they do/do not
105 |   want for the code.
106 | 
107 |   Side note: interesting questions about philosophy of collaboration
108 |   and why we may or may not want to share our work/progress with others.
109 | 
110 |   Imagine two people working on the same branch at once,
111 |   why would that be a problem?
112 |   One person's work could break another person's work :-(
113 | 
114 |     + This gets more common the more people are working on a shared
115 |       project.
116 | 
117 |     + Unit tests fail, code fails to compile, etc.
118 | 
119 |     + In this scenario, Git saves you: it says, you get to "check
120 |       out" your copy of the code and be assured that your copy is
121 |       yours to play around with.
122 | 
123 | Corollaries:
124 | 
125 |   - Everyone can be at a different point of the tree
126 | 
127 |   - Different people can work on different branches
128 | 
129 |   - If two people try to "push" the code - publishing it
130 |     for others, we need to have some way of determining whose
131 |     code wins the race, or how to combine the different changes
132 |     to the code.
133 | 
134 |     ===> "merge conflict problem"
135 | 
136 |   - Each individual working on a branch may not need to see
137 |     the entire tree at once to do their work.
138 | 
139 |     ======> You need the list of changes up until your point in
140 |             the tree
141 | 
142 |     ======> You only need a "local view" or local window into
143 |             the tree, which contains a copy of some of the changes.
144 | 
145 | When you git clone, or "git stash, git pull" the lecture notes,
146 | you are creating an instance of this philosophy - essentially,
147 | you're working on your own little branch of the tree.
148 | 
149 | In that context:
150 | 
151 |   First two lines of git status: where I am in the tree
152 | 
153 |   Changes: changes I've made on my local copy of the tree.
154 | 
155 | - git log --oneline
156 | 
157 |   List of changes, one per line
158 | 
159 | - git branch -v
160 | 
161 |   More information about the branch you're on
162 | 
163 | - git diff
164 | 
165 |   Most useful second to git status -- what changes you've made
166 |   to the code on your local copy.
167 | 
168 | What about help commands? Try:
169 | - man git
170 | - git status --help
171 | - git log --help
172 | - git add --help
173 | - git commit --help
174 | 
175 |   BTW: log, add, commit -- "subcommands"
176 |       You can think of them like a special type of positional argument
177 | 
178 | Finally, doing stuff:
179 | 
180 | For getting others' changes:
181 | - git pull
182 | 
183 |   Pull the latest code from the "published" version of the branch
184 | 
185 | - git push
186 | 
187 |   Attempt to push your code to the "published" version of the branch
188 | 
189 | If one branch is behind the other, great! we can just grow that
190 | branch to make it equal to the other one
191 | 
192 | We're gonna have a problem if they are on diverging or two separate
193 | branches, git does a "merge"
194 | and try to shove the two things together.
195 | 
196 |   Sometimes it works, sometimes it doesn't and you have to debug.
197 | 
198 |   Two things could fail:
199 | 
200 |   1. Git could not know how to merge the changes
201 |      (usually happens if both people modified)
202 |       Results in: "Merge conflict"
203 | 
204 |   2. Merge succeeds, but the code breaks
205 | 
206 |      Two conflicting features, one feature breaks someone else's
207 |      unit test, etc.
208 | 
209 | (Related commands -- not as worried about:)
210 | - git fetch
211 | - git checkout
212 | 
213 | For sharing/publishing your own changes
214 | (a common sequence of three to run):
215 | - git add .
216 | 
217 |   After a git add, I usually do a:
218 |   git status
219 |   git diff --staged
220 | 
221 |   AND:
222 |   Run the code again just to make sure everything looks good
223 | 
224 | - git commit -m "Commit message"
225 | 
226 |   Modify what you just did:
227 |   git commit --amend
228 | 
229 |   Then I would do a git status again
230 | 
231 | **Most important:**
232 | To publish your code to the main branch
233 | 
234 |   Magic sequence of 3 commands:
235 |   git add, git commit, git push.
236 | 
237 | This is a multi-step process because git wants you to
238 | be deliberate about all changes.
239 | 
240 |   git add = what changes you want
241 | 
242 |   git commit = why?
243 | 
244 |   git push = publish it.
245 | 
246 | git add, git commit = your local branch only
247 | 
248 | git push = shares it publicly.
249 | 
250 | =========================================
251 | 
252 | ===== Other miscellaneous things =====
253 | 
254 | i) Text editors in the shell
255 | 
256 | Running `git commit` without the `-m` option opens up a text
257 | editor!
258 | 
259 | Vim: dreaded program for many new command line users
260 | 
261 |   Get stuck -- don't know how to quit vim!
262 | 
263 |   :q + enter
264 | 
265 | The most "accessible" of these is probably nano.
266 | 
267 | Sometimes files open by default in vim and you have to
268 | know how to close the file.
269 | 
270 | Use nano (most accessible), don't use vim and emacs.
271 | """
272 | 
273 | def edit_file(file):
274 |     subprocess.run(["nano", file])
275 | 
276 | # Let's edit the lecture file and add something here.
277 | # print("Hello, world!")
278 | 
279 | """
280 | Text editors get opened when you run git commands
281 | like git commit without a message.
282 | 
283 | ii) Variations of git diff
284 | 
285 |   git diff
286 |   git diff --word-diff (word level diff)
287 |   git diff --word-diff-regex=. (character level diff)
288 | 
289 |   git diff --staged -- after you do a git add, shows diff from green
290 |   changes
291 | 
292 | iii) Other git commands (selected most useful):
293 | 
294 | - git merge -- merge together different conflicting versions of the code
295 | - git rebase
296 | - git rebase -i -- often useful for modifying commit messages
297 | - git branch -- create a new branch, often useful for developing new features.
298 | 
299 | Just like before, we can also run these commands in Python.
300 | """
301 | 
302 | def git_status():
303 |     # TODO
304 |     raise NotImplementedError
305 | 


--------------------------------------------------------------------------------
/lecture4/parts/2-parallelism.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 2: Definitions and Parallel/Concurrent Distinction
  3 | 
  4 | Parallel computing: speeding up our pipeline by doing more than
  5 | one thing at a time!
  6 | 
  7 | === Getting Started ===
  8 | 
  9 | Parallel, concurrent, and distributed computing
 10 | 
 11 | They're different, but often get (incorrectly) used synonymously!
 12 | 
 13 | What is the difference between the three?
 14 | 
 15 | Let's make a toy example.
 16 | 
 17 | We will forgo Pandas for this lecture
 18 | and just work in plain Python for the sake of clarity,
 19 | but all of this applies to a data processing pipeline written using
 20 | vectorized operations as well (as we will see soon).
 21 | 
 22 | Baseline:
 23 | (sequential pipeline)
 24 | 
 25 |     Sequential means "not parallel", i.e. one thing happening at a time,
 26 |     run on a single worker.
 27 | 
 28 | (Rule 0 of parallel computing:
 29 | Any time we're measuring parallelism we want to start
 30 | with a sequential version of the code!)
 31 | 
 32 |     Scalability! But at what COST?
 33 |     https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
 34 | 
 35 | Q: can I think of sequential computing like a monolithic application,
 36 | and parallel computing like microservices?
 37 | 
 38 |     That's related to the distibuted computing part - we'll talk more about
 39 |     the difference between parallel/distributed soon.
 40 | 
 41 | For now think of:
 42 | 
 43 |     Sequential baseline = 1 machine, only 1 CPU runs (1 worker)
 44 | 
 45 |     Parallel = multiple workers (machines or CPUs)
 46 | """
 47 | 
 48 | def average_numbers(N):
 49 |     sum = 0
 50 |     count = 0
 51 |     for i in range(N):
 52 |         # Busy loop with some computation in it
 53 |         sum += i
 54 |         count += 1
 55 |     return sum / count
 56 | 
 57 | # Uncomment to run:
 58 | # N = 200_000_000
 59 | # result = average_numbers(N)
 60 | # print(f"Result: {result}")
 61 | 
 62 | # baseline (Sequential performance) is at 9.07s
 63 | 
 64 | # From the command line:
 65 | # time python3 lecture.py
 66 | 
 67 | # How are we doing on CPU usage:
 68 | # Activity Monitor in MacOS (Window -> CPU Usage)
 69 | 
 70 | """
 71 | Task gets moved from CPUs from time to time, making it a little difficult to see,
 72 | but at any given time one CPU is being used to run our program.
 73 | 
 74 | What if we want to do more than one thing at a time?
 75 | 
 76 | Let's say we want to make our pipeline twice as fast.
 77 | 
 78 | We're adding the numbers from 1 to N... so we could:
 79 | 
 80 | - Have worker1 add up the first half of the numbers
 81 | 
 82 | - Have worker2 add up the second half
 83 | 
 84 | At the end, combine worker1's and worker2's results
 85 | 
 86 | Our hope: we take about half the time to complete the computation.
 87 | 
 88 | """
 89 | 
 90 | # **************************************
 91 | # ********** Ignore this part **********
 92 | # **************************************
 93 | 
 94 | # NOTE: Python has something called a global interpreter lock (GIL)
 95 | # which often prevents code from running in parallel (via threads)
 96 | # We are using this purely for illustration, but Python is generally
 97 | # not a good fit for parallel and concurrent code.
 98 | 
 99 | from multiprocessing import Process, freeze_support
100 | 
101 | def run_in_parallel(*tasks):
102 |     running_tasks = [Process(target=task) for task in tasks]
103 |     for running_task in running_tasks:
104 |         running_task.start()
105 |     for running_task in running_tasks:
106 |         result = running_task.join()
107 | 
108 | # **************************************
109 | # **************************************
110 | # **************************************
111 | 
112 | def worker1():
113 |     sum = 0
114 |     count = 0
115 |     for i in range(N // 2):
116 |         sum += i
117 |         count += 1
118 |     print(f"Worker 1 result: {sum} {count}")
119 |     # return (sum, count)
120 | 
121 | def worker2():
122 |     sum = 0
123 |     count = 0
124 |     for i in range(N // 2, N):
125 |         sum += i
126 |         count += 1
127 |     print(f"Worker 2 result: {sum} {count}")
128 |     # return (sum, count)
129 | 
130 | def average_numbers_parallel():
131 |     results = run_in_parallel(worker1, worker2)
132 |     print(f"Computation finished")
133 | 
134 | # Uncomment to run
135 | # N = 200_000_000
136 | # if __name__ == '__main__':
137 | #     freeze_support() # another boilerplate line to ignore
138 | #     average_numbers_parallel()
139 | 
140 | # time python3 lecture.py: 5.2s
141 | # CPU usage
142 | 
143 | """
144 | New result: roughly half the time! (Twice as fast)
145 | 
146 | Not exactly twice as fast - why?
147 | 
148 |     One reason is because of the additional boilerplate required
149 |     to run multiple workers and combine the results.
150 | 
151 |     We actually didn't combine the results!
152 |     (We should add the results from worker1 and worker2)
153 |     This would also add a small amount of overhead
154 | 
155 | We've successfully achieved parallelism!
156 | We have two workers running at the same time.
157 | 
158 | === What is parallelism? ===
159 | 
160 | Imagine a conveyor belt, where our numbers are coming in on the belt...
161 | 
162 | ASCII art:
163 | 
164 |     ==>   | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |   ==>
165 |     ...   ==========================================  worker1
166 |                                                       worker2
167 | 
168 | Our worker takes the items off the belt and
169 | adds them up as they come by.
170 | 
171 | Worker:
172 |     (sum, count)
173 |     (0, 0) -> (1, 1) -> (3, 2) -> (6, 3) -> (10, 4) -> ...
174 | 
175 | When is this parallel?
176 | 
177 |     There are multiple workers working at the same time.
178 | 
179 | The workers could be working on the same conveyor belt
180 | or two different conveyor belts
181 | 
182 | Worker1 and worker2 are working on separate conveyer belts!
183 | 
184 |     ==>   | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |   ==>
185 |     ...   ==========================================  worker1
186 | 
187 |     ==>   |  | ... | 1000002 | 1000001 | 1000000 |   ==>
188 |     ...   ==========================================  worker2
189 | 
190 | === What is concurrency? ===
191 | 
192 | Concurrency is when there are multiple tasks happening that might overlap or conflict.
193 | 
194 | - If the workers are working on the same conveyer belt ... then operations might conflict
195 | 
196 | - If the workers are working on different conveyer belts ... then operations won't conflict!
197 | 
198 | ----- Where we ended for today -----
199 | 
200 | -----------------------------
201 | 
202 | Oct 29
203 | (Finishing up a few things)
204 | 
205 | Recap:
206 | 
207 | - Once our pipeline is working sequentially, we want to figure
208 |   out how to **scale** to more data and more frequent updates
209 | 
210 | - We talked about parallelism: multiple workers working at once
211 | 
212 | - Conveyer belt analogy:
213 |     parallel = multiple workers working at the samt eim.
214 | 
215 | === Related definitions and sneak peak ===
216 | 
217 | Difference between parallelism & concurrency & distribution:
218 | 
219 | - Parallelism: multiple workers working at the same time
220 | - Concurrency: multiple workers accessing the same data (even at different times) by performing potentially conflicting operations
221 | 
222 |     For now: think about it as multiple workers modifying or moving items
223 |     on the same conveyer belt.
224 | 
225 | - Distribution: workers operating over multiple physical devices which are independently controlled and may fail independently.
226 | 
227 | Good analogy:
228 |     Distributed computing is like multiple warehouses, each with its own
229 |     workers and conveyer belts.
230 | 
231 | For the purposes of this class: if code is running on multiple devices,
232 | it is distributed; otherwise it's not.
233 | 
234 | In the conveyor belt analogy, this means...
235 | 
236 | - Parallelism can exist without concurrency!
237 |   (How?)
238 | 
239 |     Multiple belts, each worker has its own belt
240 | 
241 | - Concurrency can exist without parallelism!
242 |   (How?)
243 | 
244 |     Operations can conflict even if the two workers
245 |     are not working at the same time!
246 | 
247 |     Worker1 takes an item off the belt
248 |     Worker1 makes some modifications and puts it back
249 |     Then Worker 1 goes on break
250 |     Worker2 comes to the belt
251 |     Doesn't realize that worker 1 was doing anything here
252 |     Takes the item off the belt
253 |     Worker 2 tries to make the same modifications.
254 | 
255 |     We have a conflict!
256 | 
257 |     In fact, this is what happens in Python if you
258 |     use threads.
259 |     (Threads vs. processes)
260 | 
261 |     Multiple workers working concurrently, only one
262 |     active at a given time.
263 | 
264 | - Both parallelism and concurrency can exist with/without distribution!
265 | 
266 |     Are the different conveyer belts operated by different computers?
267 |     Do they function and fail independently?
268 | 
269 |     Are the different workers running on different computers?
270 |     Do they function and fail independently?
271 | """
272 | 


--------------------------------------------------------------------------------
/lecture6/parts/2-microbatching.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Part 2:
  3 | Spark Streaming and Microbatching
  4 | 
  5 | === Poll ===
  6 | 
  7 | Which of the following are most likely application scenarios for which latency is a primary concern?
  8 | 
  9 | .
 10 | .
 11 | .
 12 | 
 13 | https://forms.gle/Le4NZTDEujzcqmg47
 14 | 
 15 | === Spark Streaming ===
 16 | 
 17 | In particular: Structured Streaming
 18 | Structured = using relational and SQL abstractions
 19 | Structured Streaming syntax is similar (often almost identical) to Spark DataFrames
 20 | 
 21 | There's an analogy going on here!
 22 | Batch processing application using DataFrames <---> Streaming application using Structured Streaming
 23 | 
 24 | Let's see our streaming example in more detail.
 25 | 
 26 | (We demoed this example last time)
 27 | """
 28 | 
 29 | # Old imports
 30 | import pyspark
 31 | from pyspark.sql import SparkSession
 32 | spark = SparkSession.builder.appName("OrderProcessing").getOrCreate()
 33 | sc = spark.sparkContext
 34 | 
 35 | # New imports
 36 | from pyspark.sql.functions import array_repeat, from_json, col, explode
 37 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType
 38 | 
 39 | # Define the schema of the incoming JSON data
 40 | schema = StructType([
 41 |     StructField("order_number", IntegerType()),
 42 |     StructField("item", StringType()),
 43 |     StructField("timestamp", StringType()),
 44 |     StructField("qty", IntegerType())
 45 | ])
 46 | 
 47 | def process_orders_stream(order_stream):
 48 |     """
 49 |     important:
 50 |     order_stream: now a stream handle, instead of a list of plain data!
 51 |     """
 52 | 
 53 |     # Parse the JSON data
 54 |     df0 = order_stream.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
 55 | 
 56 |     # First cut: just return the parsed orders
 57 |     # return df0
 58 | 
 59 |     ### Full computation
 60 | 
 61 |     # df0 is all bunched up in a single column, can we expand it?
 62 | 
 63 |     # Yes: select the data we want
 64 |     df1 = df0.select(
 65 |         col("parsed_value.order_number").alias("order_number"),
 66 |         col("parsed_value.item").alias("item"),
 67 |         col("parsed_value.timestamp").alias("timestamp"),
 68 |         col("parsed_value.qty").alias("qty")
 69 |     )
 70 | 
 71 |     # (Notice this looks very similar to SQL!
 72 |     # Structured Streams uses an almost identical API to Spark DataFrames.)
 73 | 
 74 |     # return df1
 75 | 
 76 |     # Create a new field which is a list [item, item, ...] for each qty
 77 |     df2 = df1.withColumn("order_numbers", array_repeat(col("order_number"), col("qty")))
 78 | 
 79 |     # Explode the list into separate rows
 80 |     df3 = df2.select(explode(col("order_numbers")).alias("order_number"), col("item"), col("timestamp"))
 81 | 
 82 |     return df3
 83 | 
 84 | """
 85 | We need to decide where to get our input! Spark supports getting input from:
 86 | - Apache Kafka (good production option)
 87 | - Files on a distributed file system (HDFS (Hadoop File system), S3)
 88 | - A network socket (basically a connection to some other worker process or network service)
 89 | 
 90 | We're looking for a toy example, so let's use a plain socket.
 91 | 
 92 | This will require us to open up another terminal and run the following command:
 93 | 
 94 |         nc -lk 9999
 95 | 
 96 | """
 97 | 
 98 | # We need an input (source)
 99 | 
100 | # (Uncomment to run)
101 | # Set up the input stream using a local network socket
102 | # One of the ways to get input - from a network socket
103 | order_stream = spark.readStream.format("socket") \
104 |     .option("host", "localhost") \
105 |     .option("port", 9999) \
106 |     .load()
107 | 
108 | # Call the function
109 | out_stream = process_orders_stream(order_stream)
110 | 
111 | # We need an output (sink)
112 | 
113 | # Print the output stream and run the computation
114 | out = out_stream.writeStream.outputMode("append").format("console").start()
115 | 
116 | # Run the pipeline
117 | 
118 | # Run until the connection closes.
119 | out.awaitTermination()
120 | 
121 | """
122 | There are actually two streaming APIs in Spark,
123 | the old DStream API and the newer Structured Streaming API.
124 | 
125 | Above uses the Structured Streaming API (which is more modern and flexible
126 | and solves some problems with DStreams, I also personally found it better
127 | to work with on my machine.)
128 | 
129 | === Q + A ===
130 | 
131 | Q: How is the syntax different from a batch pipeline?
132 | 
133 | A: It's nearly identical to DataFrames, except the input/output
134 |     input: pass in a stream instead of a dataframe
135 |     output: we call .writeStream.outputMode(...)
136 | 
137 | Q: How is the behavior different from a batch pipeline?
138 | 
139 | A:
140 | It groups events into "microbatches", and processes each microbatch
141 | in real time (aiming to achieve low latency)
142 | 
143 | You can also set the microbatch duration
144 | (1 batch every 1 second, 1 batch every 0.5 seconds, 1 batch every 0.25 seconds, ...)
145 | depending on your application needs
146 | 
147 | Q: How does Spark determine when to "run" the pipeline?
148 | 
149 | A: Calling .start()
150 | 
151 | Q: How do we know when the pipeline is finished?
152 | 
153 | A:
154 | With .awaitTermination() we just wait for the user to terminate the connection
155 | (ctrl-C)
156 | You can also configure different options, for example terminate after inactivity
157 | for 1 hour, etc.
158 | 
159 | Upshot:
160 | input/output configuration is different,
161 | actual application logic of the pipline is the same.
162 | 
163 | === Microbatching ===
164 | 
165 | To process a streaming pipeline, Spark groups several recent orders into something
166 | called a "microbatch", and processes that all at once.
167 | 
168 | (Side note: this isn't how all streaming frameworks work, but this is
169 | the main idea behind how Spark Streaming works.)
170 | 
171 | Why does Spark do this?
172 | 
173 | - If you process every item one at a time, you get minimum latency!
174 | 
175 |   Every user will get their order processed right away.
176 | 
177 |   But, throughput will suck.
178 | 
179 |   - We never benefit from parallelism (because we're never doing more than one order at the same time)
180 | 
181 |   - We never benefit from techniques like vectorization (we can't put multiple orders into vector operations on a CPU or GPU)
182 | 
183 |     (Recall: turning operatinos into vector or matrix multiplications is often faster and
184 |      I can only do that if I have many rows that look the same)
185 | 
186 | So: by grouping items into these "microbatches", Spark hopes to still provide good latency (by processing microbatches frequently, e.g., every half a second) but at the same time benefit from
187 | throughput optimizations, e.g., parallelism and vectorization.
188 | 
189 | That leads to an interesting question: how do we determine the microbatches?
190 | 
191 | - There is a tension here; the smaller the batches, the better the latency;
192 |   but the worse the throughput!
193 | 
194 | Possible ways?
195 | 
196 | 1. Wait 0.5 seconds, collect all items during that second and group it into a batch
197 | 2. Set a limit on the number of items per batch (e.g. 100), once you get 100
198 |    items, close the batch
199 | 3. Set a limit on the number of items, OR wait for a 2 second timeout
200 | 
201 | Some observations:
202 | 
203 | - Suggestion 1 will always result in latency < 0.5 s, but batches may be small
204 | 
205 | - Suggestion 2 has a serious problem, on a non-busy day, imagine there's only 1 Amazon
206 |   user.
207 | 
208 |   Amazon user submits their order -> we wait for the other 99 orders to come in (which
209 |   never happens, or takes hours)
210 | 
211 |   Our one user will never get their results.
212 | 
213 | - Suggestion 3 fixes the problem with suggestion 2, by imposing a timeout.
214 | 
215 |   Suggestion 3 tries to achieve large batch sizes, but caps out at a certain maximum;
216 |   latency will always be at most 2 seconds (often smaller if there are many orders).
217 | 
218 | === Another way of thinking about this ===
219 | 
220 | It comes down to a question of "time" and how to measure progress.
221 | 
222 | All distributed systems measure progress by enforcing some notion of "time"
223 | 
224 | Suggestion #1 measures time in terms of the operating system clock
225 |     (e.g., time.time())
226 | 
227 | Suggestions #2 measures time in terms of how many items arrive in the pipeline,
228 | and uses that to decide when to move forward.
229 | 
230 |     This is related to something called "logical time", and we will cover it in
231 |     the following part 3.
232 | 
233 | Both of these suggestions become more interesting/complicated when you consider
234 | a distributed application, where you might have (say) 5-10 different machines
235 | taking in input requests, and all of them have their own notion of time that
236 | they are measuring and enforcing.
237 | 
238 | This turns out to be very important, so it is the next thing we will
239 | cover in the context of streaming pipelines.
240 | 
241 | It is also important to how we measure latency and can be important
242 | to the actual behavior of our pipeline.
243 | 
244 | That's a bit about why time is important, and we'll get into
245 | different notions of time in this context next time.
246 | """
247 | 


--------------------------------------------------------------------------------
/lecture1/extras/failures.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """
  3 | === Failures and risks ===
  4 | 
  5 | Failures and risks are problems
  6 | which might invalidate our pipeline (wrong results)
  7 | or cause it to misbehave (crash or worse).
  8 | 
  9 | What could go wrong in our pipeline above?
 10 | Let's go through each stage at a time:
 11 | 
 12 | 1. Input stage
 13 | 
 14 | What could go wrong here?
 15 | 
 16 | - Malformed data and type mismatches
 17 | - Wrong data
 18 | - Missing data
 19 | - Private data
 20 | """
 21 | 
 22 | """
 23 | Problem: input data could be malformed
 24 | """
 25 | 
 26 | # Exercise 4: Insert a syntax error by adding an extra comma into the CSV file. What happens?
 27 | 
 28 | # A: that row gets shifted over by one
 29 | # All data in each column is now misaligned;
 30 | # some columns contain a mix of year and life expectancy data.
 31 | 
 32 | # Exercise 5: Insert a row with a value that is not a number. What happens?
 33 | 
 34 | # A: changing the year on just one entry to a string,
 35 | # the "Year" field turned into a string field.
 36 | 
 37 | # Reminder about dataframes: every column has a uniform
 38 | # type. (Integer, string, real/float value, etc.)
 39 | 
 40 | # Take home point: even a single mislabeled or
 41 | # malformed row can mess up the entire DataFrame
 42 | 
 43 | # Solutions?
 44 | 
 45 | # - be careful about input data (get your data from
 46 | # a good source and ensure that it's well formed)
 47 | 
 48 | # - validation: write and run unit tests to check
 49 | #   check that the input data has the properties we
 50 | #   want.
 51 | 
 52 | # e.g.: write a test_year function that goes through
 53 | # the year column and checks that we have integers.
 54 | 
 55 | """
 56 | Problem: input data could be wrong
 57 | """
 58 | 
 59 | # Example:
 60 | # Code with correct input data:
 61 | # avg. 61.61799192059744
 62 | # Code with incorrect input data:
 63 | # avg.: 48242.7791579047
 64 | 
 65 | # Exercise 6: Delete a country or add a new country. What happens?
 66 | 
 67 | # Deleting a country:
 68 | # 61.67449487559832 instead of 61.61799192059744
 69 | # (very slightly different)
 70 | 
 71 | # Solutions?
 72 | 
 73 | # Put extra effort into validating your data!
 74 | 
 75 | """
 76 | Discussion questions:
 77 | - If we download multiple versions of this data
 78 |   from different sources (for example, from Wikipedia, from GitHub,
 79 |   etc.) are they likely to have the same countries? Why or why not?
 80 | 
 81 | - What can be done to help validate our data has the right set
 82 |   of countries?
 83 | 
 84 | - How might choosing a different set of countries affect the
 85 |   app we are using?
 86 | 
 87 | Recap from today:
 88 | 
 89 | - Python main functions (ways to run code: python3 lecture.py (main function), python3 -i lecture.py (main function + interactive), pytest lecture.py to run unit tests)
 90 | - what can go wrong in a pipeline?
 91 | - input data issues & validation.
 92 | 
 93 | ===============================================================
 94 | 
 95 | === Poll ===
 96 | 
 97 | 1. Which of the following are common problems with input data that you might encounter in the real world and in industry?
 98 | 
 99 | - (poll options cut)
100 | 
101 | 2. How many countries are there in the world?
102 | 
103 | Common answers:
104 | 
105 | - 193: UN Members
106 | - 195: UN Members + Observers
107 | - 197: UN Members + Observers + Widely recognized
108 | - 200-300something: if including all partially recognized countries or territories.
109 | 
110 | As we saw before, our dataset happens to have 261.
111 | - e.g.: our dataset did not include all countries with some form of
112 |   limited recognition, e.g. Somaliland
113 |   but it would include the 193, 195, or 197 above.
114 | 
115 | Further resources:
116 | 
117 | - https://en.wikipedia.org/wiki/List_of_states_with_limited_recognition
118 | 
119 | - CGP Grey: How many countries are there? https://www.youtube.com/watch?v=4AivEQmfPpk
120 | 
121 | In any dataset in the real world, it is common for there to be some
122 | subjective inclusion criteria or measurement choices.
123 | 
124 | """
125 | 
126 | """
127 | 2. Processing stage
128 | 
129 | What could go wrong here?
130 | 
131 | - Software bugs -- pipeline is not correct (gives the wrong answer)
132 | - Performance bugs -- pipeline is correct but is slow
133 | - Nondeterminism -- pipelines to produce different answers on different runs
134 | 
135 |     This is actually very common in the data processing world!
136 |     - e.g.: your pipeline may be a stream of data and every time you run
137 |     it you are running on a different snapshot, or "window" of the data
138 |     - e.g.: your pipeline measures something in real time, such as timing
139 |     - a calculation that requires a random subset of the data (e.g.,
140 |       statistical random sample)
141 |     - Neural network?
142 |     - Running a neural network or large language model with different versions
143 |       (e.g., different version of GPT every time you call the GPT API)
144 |     - ML model with stochastic components
145 |     - Due to parallel and distributed computing
146 |         If you decide to parallelize your pipeline, and you do it incorrectly,
147 |         depending on the order in which different operations complete you
148 |         might get a different answer.
149 | """
150 | 
151 | """
152 | 3. Output stage
153 | 
154 | What could go wrong here?
155 | 
156 | - System errors and exceptions
157 | - Output formatting
158 | - Readability
159 | - Usability
160 | 
161 | Often output might be: saving to a file or saving to a database, or even
162 | saving data to a cloud framework or cloud provider;
163 | and all of three of these cases could fail.
164 | e.g. error: you don't have permissions to the file; file already exists;
165 | not enough memory on the machine/cloud instance; etc.
166 | 
167 | Summary: whenever saving output, there is the possibility that the save operation
168 | might fail
169 | 
170 | Output formatting: make sure to use a good library!
171 | Things like Pandas will help here -- formatting requirements already solved
172 | 
173 | When displaying output directly to the user:
174 | - Are you displaying the most relevant information?
175 | - Are you displaying too much information?
176 | - Are you displaying too little information?
177 | - Are you displaying confusing/incomprehensible information?
178 | 
179 | e.g.: displaying 10 items we might have a different strategy than if
180 | we want to display 10,000
181 | 
182 | example: review dataframe display function
183 |     - dataframe: display header row, first 5 rows, last 5 rows
184 |     - shrink the window size ==> fields get replaced by "..."
185 | 
186 | There are some exercises at the bottom of the file.
187 | """
188 | 
189 | """
190 | === Poll ===
191 | 
192 | 1. Which stage do you think is likely to be the most computationally intensive part of a data processing pipeline?
193 | 
194 | 2. Which stage do you think is likely to present the biggest opportunity for failure cases, including crashes, ethical concerns or bias, or unexpected/wrong/faulty data?
195 | 
196 | ===============================================================
197 | """
198 | 
199 | """
200 | === Rewriting our pipeline one more time ===
201 | 
202 | Before we continue, let's rewrite our pipeline one last time as a function
203 | (I will explain why in a moment -- this is so we can easily measure its performance).
204 | """
205 | 
206 | """
207 | === Additional exercises (skip depending on time) ===
208 | """
209 | 
210 | """
211 | Problem: input data could be missing
212 | """
213 | 
214 | # Exercise 7: Insert a row with a missing value. What happens?
215 | 
216 | # Solutions?
217 | 
218 | """
219 | Problem: input data could be private
220 | """
221 | 
222 | # Exercise 8: Insert private data into the CSV file. What happens?
223 | 
224 | # Solutions?
225 | 
226 | """
227 | Problem: software bugs
228 | """
229 | 
230 | # Exercise 9: Introduce a software bug
231 | 
232 | # Solutions?
233 | 
234 | """
235 | Problem: performance bugs
236 | """
237 | 
238 | # Exercise 10: Introduce a performance bug
239 | 
240 | # Solutions?
241 | 
242 | """
243 | Problem: order-dependent and non-deterministic behavior
244 | """
245 | 
246 | # Exercise 11: Introduce order-dependent behavior into the pipeline
247 | 
248 | # Exercise 12: Introduce non-deterministic behavior into the pipeline
249 | 
250 | # Solutions?
251 | 
252 | """
253 | Problem: output errors and exceptions
254 | """
255 | 
256 | # Exercise 13: Save the output to a file that already exists. What happens?
257 | 
258 | # Exercise 14: Call the program from a different working directory (CWD)
259 | # (Note: CWD)
260 | 
261 | # Exercise 15: Save the output to a file that is read-only. What happens?
262 | 
263 | # Exercise 16: Save the output to a file outside the current directory. What happens?
264 | 
265 | # (other issues: symlinks, read permissions, busy/conflicting writes, etc.)
266 | 
267 | # Solutions?
268 | 
269 | """
270 | Problem: output formatting
271 | 
272 | Applications must agree on a common format for data exchange.
273 | """
274 | 
275 | # Exercise 17: save data to a CSV file with a wrong delimiter
276 | 
277 | # Exercise 18: save data to a CSV file without escaping commas
278 | 
279 | # Solutions?
280 | 
281 | """
282 | Problem: readability and usability concerns --
283 |     too much information, too little information, unhelpful information
284 | """
285 | 
286 | # Exercise 19: Provide too much information as output
287 | 
288 | # Exercise 20: Provide too little information as output
289 | 
290 | # Exercise 21: Provide unhelpful information as output
291 | 
292 | # Solutions?
293 | 


--------------------------------------------------------------------------------
/lecture2/parts/4-help-and-doing-stuff.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Monday, October 20
  3 | 
  4 | Part 4: Help, Doing Stuff, anatomy of shell commands
  5 | 
  6 | -----
  7 | 
  8 | Continuing with the shell!
  9 | 
 10 | Showing two more commands before the discussion question:
 11 | 
 12 | - cat <path>:
 13 |   Prints out the contents of a file at <path>
 14 | 
 15 | - less <path>:
 16 |   Show "less" of the contents of the file at <path>
 17 |   u to go up, d to go down, q to quit
 18 | 
 19 | - open <path>:
 20 |   Open the file in your default program for that file.
 21 | 
 22 | (Three ways to view/open a file)
 23 | 
 24 | Discussion Question & Poll:
 25 | Which of the following are "informational" commands?
 26 | <Options cut>
 27 | 
 28 | https://forms.gle/XkbkUL2QxsLz6dLq7
 29 | 
 30 | =====
 31 | 
 32 | Informational commands (finishing up)
 33 | 
 34 | Information about the current state of our shell includes:
 35 | - what folder we are in
 36 | - what environment variables (local variables) are set (and to what values)
 37 | - other system information and system data
 38 | - file contents etc.
 39 | 
 40 | A few other commands:
 41 | 
 42 | - ls <directory>:
 43 | 
 44 |   ls ..
 45 |   ls ../lecture2
 46 | 
 47 | Examples in Python:
 48 | """
 49 | 
 50 | def cat_1():
 51 |     with open("lecture.py") as f:
 52 |         print(f.read())
 53 | 
 54 | def cat_2():
 55 |     subprocess.run(["cat", "lecture.py"])
 56 | 
 57 | def less():
 58 |     subprocess.run(["less", "lecture.py"])
 59 | 
 60 | # cat_1()
 61 | # cat_2()
 62 | 
 63 | # less()
 64 | 
 65 | """
 66 | This concludes the first part on "looking around"
 67 | 
 68 | === Getting help ===
 69 | 
 70 | Recall the three-part model: Looking around, getting help, doing something
 71 | 
 72 | Another thing that is fundamentally important -- and perhaps even more important
 73 | than the last thing -- is getting help if you *don't* know what to do.
 74 | 
 75 | One of the following 3 things usually works:
 76 | - `man cmd` or `cmd --help` or `cmd -h`
 77 | 
 78 | Examples:
 79 | - ls: has a man entry, but no --help or -h
 80 | - python3: has all three options
 81 | 
 82 | Some ways to get help (examples running these from Python):
 83 | """
 84 | 
 85 | def get_help_for_command(cmd):
 86 |     subprocess.run([cmd, "--help"])
 87 |     subprocess.run([cmd, "-h"])
 88 |     subprocess.run(["man", cmd])
 89 | 
 90 | # get_help_for_command("python3")
 91 | 
 92 | """
 93 | Other ways to get help:
 94 | 
 95 | Using Google/StackOverflow/AI can also be really useful for a number of reasons!
 96 | 
 97 | - A more recent development:
 98 |   AI tools in the shell: e.g. https://github.com/ibigio/shell-ai
 99 |   (use at your own risk)
100 | 
101 |   Example: q make a new git branch -> returns the right git syntax
102 | 
103 | to determine the right command to run for what you want to do.
104 | 
105 | Important caveat: you need to know what it is you want to do first!
106 | """
107 | 
108 | # Example:
109 | # how to find all files matching a name unix?
110 | # https://www.google.com/search?client=firefox-b-1-d&q=how+to+find+all+files+matching+a+name+unix
111 | # https://stackoverflow.com/questions/3786606/find-all-files-matching-name-on-linux-system-and-search-with-them-for-text
112 | # find ../lecture1 -type f -name lecture.py -exec grep -l "=== Poll ===" {} +
113 | 
114 | """
115 | Some observations:
116 | Using AI doesn't obliviate the need to understand things ourselves.
117 | - we still needed to know how to modify the command for your own purposes
118 | - we still needed to know the platform we are on (Unix)
119 | - (for the AI tool) you still need to figure out how to install it (:
120 |   + as some of you have noticed (especially on Windows), installing some software dev tools
121 |     can seem like even more work than using/understanding the program itself.
122 | 
123 | === Doing stuff ===
124 | 
125 | Once we know how to "look around", and how to "get help",
126 | we can make a plan for what to do.
127 | 
128 | The same advice applies to all commands: knowing how to "modify" the current
129 | state relevant to your command is often the second step to get a grip on how
130 | the command works.
131 | (In the context of a Python library such as Pandas:
132 |  python3 -i to interactively "look around"
133 |  the values of variables, the online documentation to see the
134 |  different functions available, actually write code to do what
135 |  you want.)
136 | 
137 | (And, once again, this is also exactly what we would do in a text-based adventure :))
138 | 
139 | So what should we do?
140 | We need a way to move around and modify stuff:
141 | 
142 | - cd -- change directory
143 |   This modifies the state of the system by changing the current
144 |   working directory
145 | 
146 | - mkdir -- make a new (empty) directory in the current locaiton
147 |   (current working directory)
148 | 
149 | - cp -- copy a file from one place to another
150 | 
151 | (demo: copy folder.txt to ../folder.txt)
152 | 
153 | (I follow this pattern a lot -- information first, then do something, then information again)
154 | 
155 | - touch <file or path> -- make a new file
156 | 
157 |   Create a new empty file at <file or path>
158 | 
159 | (Another example - creating a new Python module)
160 | - mkdir subfolder
161 | - cd subfolder
162 | - touch mod.py
163 | - open mod.py
164 | 
165 | - mv <file1> <file2>:
166 |   Move a file from one path to another, or rename it
167 |   from one file name to another.
168 | 
169 | Examples of how to accomplish similar purposes in Python:
170 | """
171 | 
172 | def cd(dir):
173 |     # Sometimes necessary to change the directory from which your
174 |     # script was called
175 |     os.chdir(dir)
176 | 
177 | def touch(file):
178 |     with open(file, 'w') as fh:
179 |         fh.write("\n")
180 | 
181 | # touch("mod-2.py")
182 | 
183 | """
184 | === Anatomy of a shell command ===
185 | 
186 | Commands are given arguments, like this:
187 | 
188 | cmd -<argument name> <argument value>
189 | cmd --<argument name> <argument value>
190 | 
191 | Some arguments don't have values:
192 | 
193 | cmd -<argument flag>
194 | 
195 | You can chain together any number of arguments:
196 | 
197 | cmd -<arg1> <val1> -<arg2> <val2> ...
198 | 
199 | Example:
200 |   git --version to get the version of git
201 |   git -v : equivalent to the above
202 | 
203 | (Informational commands for git)
204 | 
205 | This is typical: usually we use a single dash + a single letter
206 | as a shortcut for a double dash plus a long argument name.
207 | 
208 | We have seen some of these already.
209 | 
210 | Commands also have "positional" arguments, which don't use - or -- flags
211 | 
212 |   - cd <val>
213 | 
214 |   - cp <val1> <val2>
215 | 
216 | (More examples in Python:)
217 | """
218 | 
219 | def run_git_version():
220 |     # Both of these are equivalent
221 |     subprocess.run(["git", "--version"])
222 |     subprocess.run(["git", "-v"])
223 | 
224 | # run_git_version()
225 | 
226 | def run_python3_file_interactive(file):
227 |     subprocess.run(["python3", "-i", file])
228 | 
229 | # run_python3_file_interactive("subfolder/mod.py")
230 | 
231 | """
232 | === I/O & Composing Shell Commands ===
233 | 
234 | What about I/O?
235 | Remember that one of the primary reasons for the shell's existence is to
236 | "glue" different programs together. What does that mean?
237 | 
238 | Selected list of important operators
239 | (also called shell combinators):
240 | - |, ||, &&, >, >>, <, <<
241 | 
242 | Most useful:
243 | - Operator >
244 |   Ends the output into a file.
245 |   (This is called redirection)
246 | 
247 | - Operator >>
248 |   Instead of replacing the file, append new content to the end of it
249 | 
250 | - || and &&
251 |   Behave like "or" and "and" in regular programs
252 |   Useful for error handling
253 | 
254 |   cmd1 || cmd2 -- do cmd1, if it fails, do command 2
255 |   cmd1 && cmd2 -- do cmd1, if it succeeds, do command 2
256 | 
257 |   These are "shortcircuiting" boolean operations,
258 |   just as in most programming languages, but based
259 |   on whether the command succeeds or fails.
260 | 
261 | Examples:
262 |   python3 lecture.py || echo "Hello"
263 |   python3 lecture.py && echo "Hello"
264 | 
265 | ===== Skip the following for time =====
266 | 
267 | - |
268 |   Chains together two commands
269 | 
270 | Exercises:
271 | 
272 | - cat followed by ls
273 | 
274 |   Fixed example from class: cat folder.txt | xargs ls
275 | 
276 |   Better example (more common):
277 |   Using "grep" to search for a particular pattern
278 | 
279 |   Example, find all polls in lecture 1:
280 | 
281 |     cat ../lecture1/lecture.py | grep "forms.gle"
282 | 
283 |   Find all packages installed with conda that contain the word "data":
284 | 
285 |     conda list | grep "data"
286 | 
287 |   Output:
288 | 
289 |     astropy-iers-data         0.2024.6.3.0.31.14 py312hca03da5_0
290 |     datashader                0.16.2          py312hca03da5_0
291 |     importlib-metadata        7.0.1           py312hca03da5_0
292 |     python-tzdata             2023.3             pyhd3eb1b0_0
293 |     stack_data                0.2.0              pyhd3eb1b0_0
294 |     tzdata                    2024a                h04d1e81_0
295 |     unicodedata2              15.1.0          py312h80987f9_0
296 | 
297 | - ls followed by cat
298 |   (equivalent to just ls)
299 | - cat followed by cd
300 |   (using xargs)
301 | - ls, save the results to a file
302 |   (using >)
303 | - python3, save the results to a file
304 |   (using >)
305 | - (Hard) cat followed by cd into the first directory of interest
306 | 
307 | Recap:
308 | 
309 | Help commands: see a command usage & options
310 | 
311 | Doing stuff commands:
312 |   various ways of creating files, moving files,
313 |   copying files, etc.
314 | 
315 | Anatomy of commands:
316 |   cmd <val1> <val2> ... or
317 |   cmd -<option1> <val1> -<option2> <val2> etc.
318 | 
319 | We saw various ways of combining and composing
320 | different commands, which can be used for
321 | advanced shell programming to write arbitrary
322 | scripts in the shell.
323 | 
324 | ****** Where we ended for today ******
325 | """
326 | 


--------------------------------------------------------------------------------
/lecture5/parts/1-RDDs.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Lecture 5: Distributed Pipelines
  3 | 
  4 | Part 1: Introduction to Spark: Scalable collection types and RDDs
  5 | 
  6 | === Poll ===
  7 | 
  8 | Speedup through parallelism alone (vertical scaling) is most significantly limited by...
  9 | (Select all that apply)
 10 | 
 11 | 1. The number of lines in the Python source code
 12 | 2. The version of the operating system (e.g., MacOS Sonoma)
 13 | 3. The number of CPU cores on the machine
 14 | 4. The number of wire connections on the computer's motherboard
 15 | 5. The amount of RAM (memory) and disk space (storage) available
 16 | 
 17 | https://forms.gle/LUsqdy7YYKy7JFVH6
 18 | 
 19 | === Apache Spark (PySpark) ===
 20 | 
 21 | In this lecture, we will use Apache Spark (PySpark).
 22 | 
 23 | Spark is a parallel and distributed data processing framework.
 24 | 
 25 | (Note: Spark also has APIs in several other languages, most typically
 26 | Scala and Java. The Python version aligns best with the sorts of
 27 | code we have been writing so far and is generally quite accessible.)
 28 | 
 29 | Documentation:
 30 | https://spark.apache.org/docs/latest/api/python/index.html
 31 | 
 32 | To test whether you have PySpark installed successfully, try running
 33 | the lecture now:
 34 | 
 35 | python3 1-RDDs.py
 36 | """
 37 | 
 38 | # Test whether import works
 39 | import pyspark
 40 | 
 41 | # All spark code generally starts with the following setup code:
 42 | # (Boiler plate code - ignore for now)
 43 | from pyspark.sql import SparkSession
 44 | spark = SparkSession.builder.appName("SparkExample").getOrCreate()
 45 | sc = spark.sparkContext
 46 | 
 47 | """
 48 | Motivation: last lecture (over the last couple of weeks) we saw that:
 49 | 
 50 | - Parallelism can exist in many forms (hard to identify and exploit!)
 51 | 
 52 | - Concurrency can lead to a lot of trouble (naively trying to write concurrent code can lead to bugs)
 53 | 
 54 | - Parallelism alone (without distribution) can only scale your compute, and only by a limited amount (limited by your CPU bandwidth, # cores, and amount of RAM on your laptop!)
 55 | 
 56 |     + e.g.: I have 800 GB available, but if I want to work with a dataset
 57 |       bigger than that, I'm out of luck
 58 | 
 59 |     + e.g.: I have 16 CPU cores available, but if I want more than 16X
 60 |       speedup, I'm out of luck
 61 | 
 62 | We want to be able to scale pipelines automatically to larger datasets.
 63 | How?
 64 | 
 65 | Idea:
 66 | 
 67 | - **Build our pipelines** at a higher level of abstraction -- build data sets and operators over data sets
 68 | 
 69 | - **Deploy our pipelines** using a framework or library that will automatically scale and take advantage of parallel and distributed compute resources.
 70 | 
 71 | Analogy: kind of like a compiler or interpreter!
 72 |     (A long time ago, people use to write all code in assembly language/
 73 |     machine code)
 74 | 
 75 | We say "what" we want, the distributed data processing software framework will
 76 | handle the "how"
 77 | 
 78 | So what is that higher level abstraction?
 79 | 
 80 | Spoiler:
 81 | It's dataflow graphs!
 82 | 
 83 | (With one additional thing)
 84 | """
 85 | 
 86 | """
 87 | === Introduction to distributed programming ===
 88 | 
 89 | What is a scalable collection type?
 90 | 
 91 | What is a collection type? A set, a list, a dictionary, a table,
 92 | a DataFrame, a database (for example), any collection of objects, rows,
 93 | or data items.
 94 | 
 95 |     - Pandas DataFrame is one example.
 96 | 
 97 | When we talk about collection types, we usually assume the whole
 98 | thing is stored in memory. (Refer to 800GB limit comment above.)
 99 | 
100 | A: "Scalable" part means the collection is automatically distributed
101 | and parallelized over many different workers and/or computers or devices.
102 | 
103 | The magic of this is that we can think of it just like a standard
104 | collection type!
105 | 
106 | If I have a scalable set, I can just think of that as a set
107 | 
108 | If I have a scalable DataFrame, I can just think of that as a DataFrame
109 | 
110 | Basic scalable collection types in Spark:
111 | 
112 | - RDD
113 |     Resilient Distributed Dataset
114 | 
115 | - PySpark DataFrame API
116 |     Will bear resemblance to DataFrames in Pandas (and Dask)
117 | """
118 | 
119 | # Uncomment to run
120 | # # RDD - scalable version of a Python set of integers
121 | # basic_rdd = sc.parallelize(range(0, 1_000))
122 | 
123 | # print(basic_rdd)
124 | 
125 | # # --- run some commands on the RDD ---
126 | # mapped_rdd = basic_rdd.map(lambda x: x + 2)
127 | # filtered_rdd = mapped_rdd.filter(lambda x: x > 500)
128 | # result = filtered_rdd.collect()
129 | 
130 | # print(result)
131 | 
132 | """
133 | We can visualize our pipeline!
134 | 
135 | Open up your browser to:
136 | http://localhost:4040/
137 | 
138 | === More examples ===
139 | 
140 | Scalable collection types are just like normal collection types!
141 | 
142 | Let's show this:
143 | 
144 | Exercises:
145 | 1.
146 | Write a function
147 | a) in Python
148 | b) in PySpark using RDDs
149 | that takes an input list of integers,
150 | and finds only the integers x such that x * x is exactly 3 digits...
151 | 
152 | - .map
153 | - .filter
154 | - .collect
155 | """
156 | 
157 | def ex1_python(l1):
158 |     # anonymous functions with map filter!
159 |     l2 = map(lambda x: x * x, l1)
160 |     # ^^ equivalent to
161 |     # def anon_function_square(x):
162 |     #     return x * x
163 |     # l2 = map(anon_function_square, l1)
164 |     # list comprehension syntax:
165 |     # [x * x for x in l1]
166 |     l3 = filter(lambda x: 100 <= x <= 999, l2)
167 |     print(list(l3))
168 | 
169 | INPUT_EXAMPLE = list(range(100))
170 | 
171 | # ex1_python(INPUT_EXAMPLE)
172 | 
173 | # Output:
174 | # [100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961]
175 | # All the 3 digit square numbers!
176 | 
177 | def ex1_rdd(list):
178 |     l1 = sc.parallelize(list) # how you construct an RDD
179 |     l2 = l1.map(lambda x: x * x)
180 |     # BTW: equivalent to:
181 |     # def square(x):
182 |     #     return x * x
183 |     # l2 = l1.map(square)
184 |     l3 = l2.filter(lambda x: 100 <= x <= 999)
185 |     print(l3.collect())
186 | 
187 | # ex1_rdd(INPUT_EXAMPLE)
188 | 
189 | """
190 | 2.
191 | Write a function
192 | a) in Python
193 | b) in PySpark using RDDs
194 | that takes as input a list of integers,
195 | and adds up all the even integers and all the odd integers
196 | 
197 | - .groupBy
198 | - .reduceBy
199 | - .reduceByKey
200 | - .partitionBy
201 | """
202 | 
203 | def ex2_python(l1):
204 |     # (Skip: leave as exercise)
205 |     # TODO
206 |     raise NotImplementedError
207 | 
208 | def ex2_rdd(l1):
209 |     l2 = sc.parallelize(l1)
210 |     l3 = l2.groupBy(lambda x: x % 2)
211 |     l4 = l3.flatMapValues(lambda x: x)
212 |     # ^^ needed for technical reasons
213 |     # actually, would be easier just to run a map to (x % 2, x)
214 |     # then call reduceByKey, but I wanted to conceptually separate
215 |     # out the groupBy step from the sum step.
216 |     l5 = l4.reduceByKey(lambda x, y: x + y)
217 |     for key, val in l5.collect():
218 |         print(f"{key}: {val}")
219 |     # Uncomment to inspect l1, l2, l3, l4, and l5
220 |     # breakpoint()
221 | 
222 | # ex2_rdd(INPUT_EXAMPLE)
223 | 
224 | """
225 | Good! But there's one thing left -- we haven't really measured
226 | that our pipeline is actually getting run in parallel.
227 | 
228 | Q: Can we check that?
229 | 
230 |     Test: parallel_test.py
231 | 
232 | A: Tools:
233 | 
234 |     time (doesn't work)
235 | 
236 |     Activity monitor
237 | 
238 |     localhost:4040
239 |     (see Executors tab)
240 | 
241 | Q: what is localhost? What is going on behind the scenes?
242 | 
243 | A: Spark is running a local cluster on our machine to schedule and run
244 |    tasks (batch jobs).
245 | 
246 | Q: Why do we need sc.context?
247 | 
248 | A:
249 | Not locally using Python compute, so any operation we do
250 | needs to get submitted and run as a job through the cluster.
251 | 
252 | Q: What does RDD stand for?
253 | 
254 | RDD means Resilient Distributed Dataset.
255 | https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
256 | 
257 | === That's Just Data Parallelism! (TM) ===
258 | 
259 | Yes, the following is a punchline:
260 | 
261 |     scalable collection type == data parallelism.
262 | 
263 | They are really the same thing.
264 | 
265 | Task, pipeline parallelism are limited by the # of nodes in the graph!
266 | Data parallelism = arbitrary scaling, so it's what enables scalable
267 | collectiont types.
268 | 
269 | Brings us to: how can we tell from looking at a dataflow graph if it can
270 | be parallelized and distributed automatically in a framework like PySpark?
271 | 
272 |     A: All tasks must be data-parallel.
273 | 
274 | === Summary ===
275 | 
276 | We saw scalable collection types
277 | (with some initial RDD examples)
278 | 
279 | Scalable collection types are just like normal collection types,
280 | but they behave (behind the scenes) like they work in parallel!
281 | 
282 | They do this by automatically exploiting data parallelism.
283 | 
284 | Behind the scenes, both vertical scaling and horizontal scaling
285 | can be performed automatically by the underlying data processing
286 | engine (in our case, Spark).
287 | 
288 | This depends on the engine to do its job well -- for the most part,
289 | we will assume in this class that the engine does a better job than
290 | we do, but we will get to some limitations later on.
291 | 
292 | Many other data processing engines exist...
293 | (to name a few, Hadoop, Google Cloud Dataflow, Materialize, Storm, Flink)
294 | (we will discuss more later on and the technology behind these.)
295 | 
296 | === Plan for remaining parts ===
297 | 
298 | Overall plan for Lecture 5:
299 | 
300 | - Scalable collection types
301 | 
302 | - Programming over collection types
303 | 
304 | - Important properties: immutability, laziness
305 | 
306 | - MapReduce
307 | 
308 |     Simpler abstraction underlying RDDs and Spark
309 | 
310 | - Partitioning in RDDs and collection types
311 | 
312 | Possible topics/optional:
313 | 
314 | - Distributed consistency: crashes, failures, duplicated/dropped/reorder messages
315 | 
316 | - Pitfalls.
317 | """
318 | 


--------------------------------------------------------------------------------
/lecture4/parts/5-quantifying.py:
--------------------------------------------------------------------------------
  1 | 
  2 | """
  3 | Part 5: Quantifying Parallelism and Amdahl's Law.
  4 | 
  5 | Content from Nov 7 poll was moved to end of Part 4 lecture.
  6 | 
  7 | === Quantifying parallelism ===
  8 | 
  9 | We know how to tell *if* there's parallelism.
 10 | What about *how much*?
 11 | 
 12 |     i.e.: What amount of parallelism is available in a system?
 13 | 
 14 | Definition:
 15 | **Speedup** is defined by:
 16 |     (running time of sequential code) / (running time of parallel code)
 17 | 
 18 | example:
 19 |     4.6s for parallel impl
 20 |     9.2s for sequential impl
 21 | 
 22 | Speedup would be 2x.
 23 | (We can run 2-parallelism.py to check; and we might get different numbers
 24 | on different platforms or machines, for example, if your machine has only one CPU
 25 | you might not see any speedup.)
 26 | 
 27 | Re-running:
 28 | Speedup = 9.6 / 5.2 = 1.84x speedup.
 29 | 
 30 | You could run with four workers, and get up to a 4x speedup
 31 | ... or with 8 workers, and get up to an 8x speedup ...
 32 | 
 33 | You might wonder, how much can I keep speeding up this computation,
 34 | won't this stop working at some point?
 35 | 
 36 | At some point ... we hit a bottleneck
 37 | 
 38 | Fundamental law of parallelism:
 39 | Amdahl's law:
 40 | https://en.wikipedia.org/wiki/Amdahl%27s_law
 41 | 
 42 | Amdahl's law gives a theoretical upper bound on the amount of speedup that is possible for any task (in arbitrary code, but also applying specifically
 43 | to data processing code).
 44 | 
 45 | It's a useful way to quantify parallelism & see how useful it would be.
 46 | 
 47 | === Amdahl's Law ===
 48 | 
 49 | We're interested in knowing: how much speedup is possible?
 50 | 
 51 | Standard form of the law:
 52 | 
 53 | Suppose we have a computation that I think could benefit from one or more types of parallelism.
 54 | The amount of speedup in a computation is at most
 55 | 
 56 |     Speedup <= 1 / (1 - p)
 57 | 
 58 | where:
 59 | 
 60 |     p is the percentage of the task (in running time) that can be parallelized.
 61 | 
 62 | === Example with a simple task ===
 63 | 
 64 | We have written a complex combination of C and Python code to train our ML model.
 65 | Based on profiling the code (callgrind or some other profiling tool), we believe that
 66 | 95% of the code can be fully parallelized, however there is a 5% of the time of the code
 67 | that is spent parsing the input model file and producing as output an output model file
 68 | that we have determined cannot be parallelized.
 69 | 
 70 | Q: What is the maximum speedup for our code?
 71 | 
 72 | Applying Amdahl's law:
 73 | 
 74 |     p = .95
 75 | 
 76 |     Speedup <= 1 / (1 - .95) = 1 / .05 = 20x.
 77 | 
 78 | Pretty good - but not infinite!
 79 | 
 80 | Example: the best we can get is from 100 hours to 5 hours, a 20x speedup.
 81 | 
 82 | How to apply this knowledge?
 83 | I was considering purchasing a supercomputer server machine with 160 cores.
 84 | Based on the above calculation, I realize that I'm only going to effectively
 85 | be able to make use of an at most 20x speedup,
 86 | so I think my 160 cores may not be useful, and I buy a smaller machine
 87 | with 24 cores.
 88 | 
 89 | === Alternate form ===
 90 | 
 91 | Here is an alternative form of the law that is equivalent, but sometimes a bit more useful.
 92 | Let:
 93 | - T be the total amount of time to complete a task sequentially
 94 |   (without any parallelism)
 95 |   (in our example: 100 hours)
 96 | 
 97 | - S be the amount of time to compute some inherently sequential bottleneck
 98 |     --> We don't believe it's possible to do any part of S in parallel
 99 |   (in our example: 5 hours)
100 | 
101 | Then the maximum speedup of the task is at most
102 | 
103 |     speedup <= (T / S) = 100 hours / 5 hours = 20x.
104 | 
105 | Note: this applies to distributed computations as well!
106 | 
107 | This is giving a theoretical upper bound, not taking into account
108 | other overheads (for example, it doesn't take into account
109 | communication overhead between threads, processes or distributed devices).
110 | 
111 | So it's not an actual number on what speedup we will get, but it still can be a
112 | useful upper bound.
113 | 
114 | Recap:
115 | 
116 | - We reviewed 3 type of parallelism in dataflow graphs
117 | 
118 | - We define speedup
119 | 
120 | - We talked about estimating the "maximum speedup" in a pipeline, using
121 |   a law called Amdahl's Law
122 | 
123 | - We saw two forms of the law:
124 | 
125 |     Speedup <= 1 / (1 - p)
126 | 
127 |     Speedup <= T / S
128 | 
129 |     where:
130 |         T is running time of sequential code
131 |         S is running time of a bottleneck that can't be parallelized
132 |         p is % of code that can be parallelized
133 | 
134 |         p = (T - S) / T.
135 | 
136 | ---- where we ended for Nov 7 ----
137 | 
138 | Recall formulas from last time
139 | 
140 | === Example ===
141 | 
142 | 1. SQL query example
143 | 
144 | - imagine an SQL query where you need to match
145 |   the employee name with their salary and produce a joined table
146 |   (join on name_table and salary_table)
147 | 
148 | Assume that all operations take 1 ms per row:
149 |     - 1 ms to load each input row from name_table
150 |     - 1 ms to load each input row from salary_table
151 |     - 1 ms to join -- per row in the joined table
152 | 
153 | Also assume that there are 100 employees in name_table,
154 | 100 in salary_table, and 100 in the joined table.
155 | 
156 | Q: What is the maximum speedup here?
157 | 
158 | Dataflow graph:
159 | 
160 |     (load name_table) ----|
161 |                           |---> (join tables)
162 |     (load salary_table) --|
163 | 
164 |     speedup <= (T / S)
165 | 
166 |     What are T and S?
167 | 
168 |     T = ?
169 |         300ms =
170 |             100ms to load first table
171 |             100ms to load second table,
172 |             100ms to calculate joined table.
173 | 
174 |     with no parallelism!
175 | 
176 |     S = what cannot be parallelized?
177 | 
178 |         Idea: view at the level of input rows!
179 | 
180 |         Let's identify what needs to happen for some specific employee
181 | 
182 |         - I need to load the employee name before I produce the particular output row in joined table for that employee
183 | 
184 |         - I need to load the employee salary before I produce the particular output row in joined table for that employee
185 | 
186 |         x I need to load the employee name, then load the employee salary, then produce the particular output row
187 | 
188 |           ^^^ not really a sequential bottleneck
189 | 
190 |     Minimum "sequential bottleneck" is 2 ms!
191 | 
192 |     Therefore:
193 | 
194 |         Speedup <= T / S = 300 ms / 2 ms = 150x.
195 | 
196 | === Poll ===
197 | 
198 | Use Amdahl's law to estimate the maximum speedup in the following scenario.
199 | 
200 | As in last Monday's poll, a Python script needs to:
201 | - load a dataset into Pandas: students.csv, with 100 rows
202 | - calculate a new column which is the total course load for each student
203 | - send an email to each student with their total course load
204 | 
205 | Assume that it takes 1 ms (per row) to read in each input row, 1 ms (per row) to calculate the new column, and 1 ms (per row) to send an email.
206 | 
207 | Q: What is the theoretical bound on the maximum speedup in the pipeline?
208 | 
209 | https://forms.gle/W5NpbuZGs4Se45VCA
210 | 
211 | DFG:
212 | 
213 |     (load) -> (calculate col) -> (send email)
214 |     100 ms         100 ms           100ms
215 | 
216 |     T = 100 + 100 + 100 = 300 ms
217 | 
218 | S = 3ms -- need to perform 3 actions in sequence for a single student - can't be parallelized!
219 | 
220 |     300 ms / 3 ms = 100x.
221 | 
222 | === More examples and exercises ===
223 | (Skip - may do in discussion section)
224 | 
225 | 2. Let's take our data parallelism example:
226 | 
227 |     We had an employee database, and tasks:
228 | 
229 |     1. load employee dataset
230 | 
231 |     2. strip the spaces from employee names
232 | 
233 |     3. extract the first/given name
234 | 
235 |     with dataflow graph:
236 | 
237 |         (1) -> (2) -> (3)
238 | 
239 | Again assume 1ms for each task per input row.
240 | What are T and S here?
241 | 
242 | 3. An extended version of the table join example.
243 | We have two tables, of employee names and employee salaries.
244 | We want to compute which employees make more than 1 million Euros.
245 | The employee salaries are listed in dollars.
246 | 
247 | We are given as input the CEO name.
248 | We want to get the salary associated with the CEO,
249 | convert it from USD to Euros, and filter only the rows where the
250 | result is over 1 million.
251 | Assume all basic operations take 1 unit of time per row.
252 | 
253 | === Additional notes ===
254 | 
255 | Note 1:
256 | You can think of this as the limit as number of cores/processes
257 | goes to infinity.
258 | 
259 | T = time it takes to complete with 1 worker
260 | S = time it takes to complete the task with a theoretically infinite number of workers and no cost of overhead when communicating between workers.
261 | 
262 | **Advanced topics note:**
263 | There's a version of the law that takes the number of processes into account.
264 | 
265 |     - basically the law would take the portion of the pipeline that *can* be parallelized, and divide it by
266 |       # of processors
267 | 
268 |       S portion - cannot be parallelized
269 |       T - S - can be parallelized
270 | 
271 |     - Don't need to know this for this class.
272 | 
273 | Note 2:
274 | How Amdahl's law applies to aggregation cases.
275 | 
276 | average_numbers example
277 | 
278 | Our average_numbers example is slightly more complex than above as it involves an aggregation
279 | (group-by).
280 | Aggregation can be parallelized.
281 |     (Why? What type of parallelism?)
282 | 
283 | For the purposes of Amdahl's law, let's think of aggregation as requiring at least 1 operation
284 | (1 unit of time to compute the total).
285 | 
286 | Q: What does Amdahl's law say the maximum speedup for our simple average_numbers pipeline?
287 | 
288 | === Connection to throughput & latency ===
289 | 
290 | Let's also connect Amdahl's law back to throughput & latency.
291 | 
292 | Given T and S...
293 | 
294 | 1. Rephrase in terms of throughput:
295 | 
296 |     If there are N input items, then the _maximum_ throughput is
297 | 
298 |     throughput <= (N / S)
299 | 
300 |     (Num input items) / (running time of pipeline)
301 | 
302 |     Amdahl's law is just assuming that the minimum running time of the pipeline is S - "maximum speedup" case.
303 | 
304 | 2. Rephrase in terms of latency:
305 | 
306 |     Observation:
307 |     In the above examples the "sequential bottleneck" we chose
308 |     is essentially the latency of a single item!
309 | 
310 |     Therefore we have:
311 | 
312 |     latency >= S,
313 | 
314 |     if S is computed in the way that we have computed above.
315 | """
316 | 


--------------------------------------------------------------------------------