├── .gitignore
├── .gitattributes
├── data.zip
├── LICENSE.md
├── README.md
├── installation.sh
├── scala
    ├── lab3-startups.md
    ├── lab1-wordcount.md
    ├── lab4-propprices.md
    ├── lab0-scala.md
    ├── lab6-pagerank.md
    └── lab2-airlines.md
└── python
    ├── lab3-startups.md
    ├── lab0-python.md
    ├── lab5-streaming.md
    ├── lab4-propprices.md
    ├── lab7-plagiarism.md
    ├── lab1-wordcount.md
    ├── lab6-pagerank.md
    └── lab2-airlines.md


/.gitignore:
--------------------------------------------------------------------------------
1 | data/


--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | data.zip filter=lfs diff=lfs merge=lfs -text
2 | 


--------------------------------------------------------------------------------
/data.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:cd16a232434ae7e84c41b806c61d91047c5335dbca26d47d6d2b38417dcc70cc
3 | size 80991892
4 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Sasha Goldshtein
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### Spark Workshop
 2 | 
 3 | This repository contains hands-on labs and data files for a full-day Apache Spark workshop. This file is also the index for the hands-on labs.
 4 | 
 5 | ____
 6 | 
 7 | #### Python Labs
 8 | 
 9 | 1. [Lab 0 - Python Fundamentals](python/lab0-python.md)
10 | 
11 | 1. [Lab 1 - Multi-File Word Count](python/lab1-wordcount.md)
12 | 
13 | 1. [Lab 2 - Analyzing Flight Delays](python/lab2-airlines.md)
14 | 
15 | 1. [Lab 3 - Analyzing Startup Companies](python/lab3-startups.md)
16 | 
17 | 1. [Lab 4 - Analyzing UK Property Prices](python/lab4-propprices.md)
18 | 
19 | 1. [Lab 5 - Streaming Tweet Analysis](python/lab5-streaming.md)
20 | 
21 | 1. [Lab 6 - PageRank over Movie References](python/lab6-pagerank.md)
22 | 
23 | 1. [Lab 7 - Plagiarism Detection](python/lab7-plagiarism.md)
24 | 
25 | ____
26 | 
27 | #### Scala Labs (under development)
28 | 
29 | 1. [Lab 0 - Scala Fundamentals](scala/lab0-scala.md)
30 | 
31 | 1. [Lab 1 - Multi-File Word Count](scala/lab1-wordcount.md)
32 | 
33 | 1. [Lab 2 - Analyzing Flight Delays](scala/lab2-airlines.md)
34 | 
35 | 1. [Lab 3 - Analyzing Startup Companies](scala/lab3-startups.md)
36 | 
37 | 1. [Lab 4 - Analyzing UK Property Prices](scala/lab4-propprices.md)
38 | 
39 | 1. [Lab 6 - PageRank over Movie References](scala/lab6-pagerank.md)
40 | 
41 | ____
42 | 
43 | Copyright (C) Sasha Goldshtein, 2016. All rights reserved.
44 | 


--------------------------------------------------------------------------------
/installation.sh:
--------------------------------------------------------------------------------
 1 | # oracle java 8
 2 | echo "\n" | sudo add-apt-repository ppa:openjdk-r/ppa
 3 | sudo apt-get update -y
 4 | sudo apt-get install -y openjdk-8-jdk
 5 | 
 6 | # spark download and setup
 7 | wget https://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz -O /tmp/spark-1.6.1.tgz
 8 | sudo ufw disable
 9 | sudo mkdir -p /usr/lib/spark
10 | sudo tar -xf /tmp/spark-1.6.1.tgz --strip 1 -C /usr/lib/spark
11 | echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bash_profile
12 | echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> ~/.bash_profile
13 | echo "export SPARK_HOME=/usr/lib/spark" >> ~/.bash_profile
14 | echo "export PATH=\$SPARK_HOME/bin:\$PATH" >> ~/.bash_profile
15 | source ~/.bash_profile
16 | 
17 | # spark log config
18 | sudo rm /usr/lib/spark/conf/log4j.properties
19 | sudo touch /usr/lib/spark/conf/log4j.properties
20 | sudo bash -c 'cat << EOF > /usr/lib/spark/conf/log4j.properties
21 | # Set everything to be logged to the console
22 | log4j.rootCategory=WARN, console
23 | log4j.appender.console=org.apache.log4j.ConsoleAppender
24 | log4j.appender.console.target=System.err
25 | log4j.appender.console.layout=org.apache.log4j.PatternLayout
26 | log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
27 | 
28 | # Settings to quiet third party logs that are too verbose
29 | log4j.logger.org.spark-project.jetty=WARN
30 | log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
31 | log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
32 | log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN
33 | log4j.logger.org.apache.parquet=ERROR
34 | log4j.logger.parquet=ERROR
35 | 
36 | # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
37 | log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
38 | log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
39 | EOF'
40 | 
41 | # spark default config
42 | sudo rm /usr/lib/spark/conf/spark-defaults.conf
43 | sudo touch /usr/lib/spark/conf/spark-defaults.conf
44 | sudo bash -c 'cat << EOF > /usr/lib/spark/conf/spark-defaults.conf
45 | spark.master                     spark://$(hostname):7077
46 | spark.eventLog.enabled           true
47 | spark.eventLog.dir               file:///usr/lib/spark/logs/eventlog
48 | EOF'
49 | sudo mkdir -p /usr/lib/spark/logs/eventlog
50 | sudo chmod -R 777 /usr/lib/spark/logs
51 | 
52 | # zeppelin setup
53 | wget http://apache.mivzakim.net/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0-bin-all.tgz -O /tmp/zeppelin-0.6.0.tgz
54 | sudo mkdir -p /usr/lib/zeppelin
55 | sudo tar -xf /tmp/zeppelin-0.6.0.tgz --strip 1 -C /usr/lib/zeppelin
56 | 
57 | # zeppelin config
58 | sudo rm /usr/lib/zeppelin/conf/zeppelin-env.sh
59 | sudo touch /usr/lib/zeppelin/conf/zeppelin-env.sh
60 | sudo bash -c 'cat << EOF > /usr/lib/zeppelin/conf/zeppelin-env.sh
61 | #!/bin/bash
62 | export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
63 | export MASTER=spark://$(hostname):7077
64 | export SPARK_HOME=/usr/lib/spark
65 | export ZEPPELIN_PORT=9995
66 | EOF'
67 | 
68 | sudo ufw disable
69 | 
70 | # start everything up
71 | sudo /usr/lib/spark/sbin/stop-master.sh
72 | sudo /usr/lib/spark/sbin/stop-slave.sh
73 | sudo /usr/lib/spark/sbin/start-master.sh
74 | sudo bash -c '/usr/lib/spark/sbin/start-slave.sh spark://$(hostname):7077'
75 | sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart
76 | 


--------------------------------------------------------------------------------
/scala/lab3-startups.md:
--------------------------------------------------------------------------------
 1 | ### Lab 3: Analyzing Startup Companies
 2 | 
 3 | In this lab, you will analyze a real-world dataset -- information about startup companies. The source of this dataset is [jSONAR](http://jsonstudio.com/resources/).
 4 | 
 5 | ___
 6 | 
 7 | #### Task 1: Inspecting the Data
 8 | 
 9 | This time, the data is provided as a JSON document, one entry per line. You can find it in `/home/vagrant/data/companies.json`. Take a look at the first entry by using the following command:
10 | 
11 | ```
12 | head -n 1 /home/vagrant/data/companies.json
13 | ```
14 | 
15 | As you can see, the schema is fairly complicated -- it has a bunch of fields, nested objects, arrays, and so on. It describes the company's products, key people, acquisition data, and more. We are going to use Spark SQL to infer the schema of this JSON document, and then issue queries using a natural SQL syntax.
16 | 
17 | ___
18 | 
19 | #### Task 2: Parsing the Data
20 | 
21 | Create a `DataFrame` from the JSON file so that its schema is automatically inferred, print out the resulting schema, and register it as a temporary table called "companies".
22 | 
23 | **Solution**:
24 | 
25 | ```scala
26 | val companies = sqlContext.read.json("file:///home/vagrant/data/companies.json")
27 | companies.printSchema()
28 | companies.registerTempTable("companies")
29 | ```
30 | 
31 | ___
32 | 
33 | #### Task 3: Querying the Data
34 | 
35 | First, let's talk about the money; figure out what the average acquisition price was.
36 | 
37 | **Solution**:
38 | 
39 | ```scala
40 | sqlContext.sql("select avg(acquisition.price_amount) from companies").first()
41 | ```
42 | 
43 | Not too shabby. Let's get some additional detail -- print the average acquisition price grouped by number of years the company was active.
44 | 
45 | **Solution**:
46 | 
47 | ```scala
48 | sqlContext.sql(
49 |     """select   acquisition.acquired_year-founded_year as years_active,
50 |                 avg(acquisition.price_amount) as acq_price
51 |        from     companies
52 |        where    acquisition.price_amount is not null
53 |        group by acquisition.acquired_year-founded_year
54 |        order by acq_price desc""").collect()
55 | ```
56 | 
57 | Finally, let's try to figure out the relationship between the company's total funding and acquisition price. In order to do that, you'll need a UDF (user-defined function) that, given a company, returns the sum of all its funding rounds. First, build that function and register it with the name "total_funding".
58 | 
59 | **Solution**:
60 | 
61 | 
62 | ```scala
63 | import org.apache.spark.sql.Row
64 | 
65 | sqlContext.udf.register("total_funding", (investments: Seq[Row]) => {
66 |     val totals = investments.map(_.getAs[Row]("funding_round").getAs[Long]("raised_amount"))
67 |     totals.sum
68 | })
69 | ```
70 | 
71 | Test your function by retrieving the total funding for a few companies, such as Facebook, Paypal, and Alibaba. Now, find the average ratio between the acquisition price and the total funding (which, in a simplistic way, represents return on investment).
72 | 
73 | **Solution**:
74 | 
75 | ```scala
76 | sqlContext.sql(
77 |     """select avg(acquisition.price_amount/total_funding(investments))
78 |        from   companies
79 |        where  acquisition.price_amount is not null
80 |        and    total_funding(investments) != 0""").collect()
81 | ```
82 | 
83 | ___
84 | 
85 | #### Discussion
86 | 
87 | See discussion for the [next lab](lab4-propprices.md).
88 | 


--------------------------------------------------------------------------------
/python/lab3-startups.md:
--------------------------------------------------------------------------------
 1 | ### Lab 3: Analyzing Startup Companies
 2 | 
 3 | In this lab, you will analyze a real-world dataset -- information about startup companies. The source of this dataset is [jSONAR](http://jsonstudio.com/resources/).
 4 | 
 5 | ___
 6 | 
 7 | #### Task 1: Inspecting the Data
 8 | 
 9 | This time, the data is provided as a JSON document, one entry per line. You can find it in `~/data/companies.json`. Take a look at the first entry by using the following command:
10 | 
11 | ```
12 | head -n 1 ~/data/companies.json
13 | ```
14 | 
15 | As you can see, the schema is fairly complicated -- it has a bunch of fields, nested objects, arrays, and so on. It describes the company's products, key people, acquisition data, and more. We are going to use Spark SQL to infer the schema of this JSON document, and then issue queries using a natural SQL syntax.
16 | 
17 | ___
18 | 
19 | #### Task 2: Parsing the Data
20 | 
21 | Open a PySpark shell (by running `bin/pyspark` from the Spark installation directory in a terminal window). Note that you have access to a pre-initialized `SQLContext` object named `sqlContext`.
22 | 
23 | Create a `DataFrame` from the JSON file so that its schema is automatically inferred, print out the resulting schema, and register it as a temporary table called "companies".
24 | 
25 | **Solution**:
26 | 
27 | ```python
28 | companies = sqlContext.read.json("file:///home/ubuntu/data/companies.json")
29 | companies.printSchema()
30 | companies.registerTempTable("companies")
31 | ```
32 | 
33 | ___
34 | 
35 | #### Task 3: Querying the Data
36 | 
37 | First, let's talk about the money; figure out what the average acquisition price was.
38 | 
39 | **Solution**:
40 | 
41 | ```python
42 | sqlContext.sql("select avg(acquisition.price_amount) from companies").first()
43 | ```
44 | 
45 | Not too shabby. Let's get some additional detail -- print the average acquisition price grouped by number of years the company was active.
46 | 
47 | **Solution**:
48 | 
49 | ```python
50 | sqlContext.sql(
51 |     """select   acquisition.acquired_year-founded_year as years_active,
52 |                 avg(acquisition.price_amount) as acq_price
53 |        from     companies
54 |        where    acquisition.price_amount is not null
55 |        group by acquisition.acquired_year-founded_year
56 |        order by acq_price desc""").collect()
57 | ```
58 | 
59 | Finally, let's try to figure out the relationship between the company's total funding and acquisition price. In order to do that, you'll need a UDF (user-defined function) that, given a company, returns the sum of all its funding rounds. First, build that function and register it with the name "total_funding".
60 | 
61 | **Solution**:
62 | 
63 | ```python
64 | from pyspark.sql.types import IntegerType
65 | 
66 | sqlContext.registerFunction("total_funding", lambda investments: sum(
67 |       [inv.funding_round.raised_amount or 0 for inv in investments]
68 |     ), IntegerType())
69 | ```
70 | 
71 | Test your function by retrieving the total funding for a few companies, such as Facebook, Paypal, and Alibaba. Now, find the average ratio between the acquisition price and the total funding (which, in a simplistic way, represents return on investment).
72 | 
73 | **Solution**:
74 | 
75 | ```python
76 | sqlContext.sql(
77 |     """select avg(acquisition.price_amount/total_funding(investments))
78 |        from   companies
79 |        where  acquisition.price_amount is not null
80 |        and    total_funding(investments) != 0""").collect()
81 | ```
82 | 
83 | ___
84 | 
85 | #### Discussion
86 | 
87 | See discussion for the [next lab](lab4-propprices.md).
88 | 


--------------------------------------------------------------------------------
/scala/lab1-wordcount.md:
--------------------------------------------------------------------------------
 1 | ### Lab 1: Multi-File Word Count
 2 | 
 3 | In this lab, you will get familiar with Spark and run your first Spark job -- a multi-file word count.
 4 | 
 5 | ___
 6 | 
 7 | #### Task 1: Inspecting the Spark
 8 | 
 9 | Open a terminal window. Navigate to the directory where you extracted Apache Spark. On the instructor-provided virtual machine, this is `~/spark`.
10 | 
11 | Inspect the files in the `bin` directory. You will soon use `spark-shell` to launch your first Spark job. Also note `spark-submit`, which is used to submit standalone Spark programs to a cluster.
12 | 
13 | Inspect the scripts in the `sbin` directory. These scripts help with setting up a stand-alone Spark cluster, deploying Spark to EC2 virtual machines, and a bunch of additional tasks.
14 | 
15 | Finally, take a look at the `examples` directory. You can find a number of stand-alone demo programs here, covering a variety of Spark APIs.
16 | 
17 | ___
18 | 
19 | #### Task 2: Inspecting the Lab Data Files
20 | 
21 | In this lab, you will implement a multi-file word count. The texts you will use are freely available books from [Project Gutenberg](http://www.gutenberg.org), including classics such as Lewis Carroll's "Alice in Wonderland" and Jane Austin's "Pride and Prejudice".
22 | 
23 | Take a look at some of the text files in the `/home/vagrant/data` directory. From the terminal, run:
24 | 
25 | ```
26 | head -n 50 /home/vagrant/data/*.txt | less
27 | ```
28 | 
29 | This shows the first 50 lines of each file. Press SPACE to scroll, or `q` to exit `less`.
30 | 
31 | ___
32 | 
33 | #### Task 3: Implementing a Multi-File Word Count
34 | 
35 | Navigate to the Spark installation directory, and run `./bin/spark-shell`. 
36 | 
37 | In this lab, you are going to use the `sc.textFile` method. 
38 | The `textFile` method can work with a directory path or a wildcard filter such as `/home/vagrant/data/*.txt`.
39 | 
40 | > Of course, if you are not using the instructor-supplied appliance, your `data` directory might reside in a different location.
41 | 
42 | Your first task is to print out the number of lines in all the text files, combined. In general, you should try to come up with the solution yourself, and only then continue reading for the "school" solution.
43 | 
44 | **Solution**:
45 | 
46 | ```scala
47 | sc.textFile("file:///home/vagrant/data/*.txt").count()
48 | ```
49 | 
50 | Great! Your next task is to implement the actual word-counting program. You've already seen one in class, and now it's time for your own. Print the top 10 most frequent words in the provided books.
51 | 
52 | **Solution**:
53 | 
54 | ```scala
55 | val lines = sc.textFile("file:///home/vagrant/data/*.txt")
56 | val words = lines.flatMap(line => line.split(" ").filter(w => w != null && !w.isEmpty))
57 | val pairs = words.map(word => (word, 1))
58 | val freqs = pairs.reduceByKey((a, b) => a + b)
59 | val top10 = freqs.sortBy(_._2, false).take(10)
60 | top10.foreach(println)
61 | ```
62 | 
63 | To be honest, we don't really care about words like "the", "a", and "of". Ideally, we would have a list of stop words to ignore. For now, modify your solution to filter out words shorter than 4 characters.
64 | 
65 | Additionally, you might be wondering about the types of all these variables -- most of them are RDDs. To trace the lineage of an RDD, use the `toDebugString` method. For example, `freqs.toDebugString()` should display the logical plan for that RDD's evaluation. We will discuss some of these concepts later. If you have window asking to select modules to include make sure that 2 selected and click OK.
66 | 
67 | ___
68 | 
69 | #### Task 4: Run a Stand-Alone Spark Program
70 | 
71 | Open Zeppelin at port 9995. This is a scala interpreter with web UI that will be used in the labs.
72 | Create new note: Notebook -> Create new note.
73 | Now, you can copy and paste your solution into the note and run it(shift+Enter) after changing path of the files to "file:///home/data/*.txt"
74 | First lines in the result are transformations(fast computation) and later(top10) taking much more time as it is an action.
75 | ___
76 | 
77 | #### Discussion
78 | 
79 | Instead of using `reduceByKey`, you could have used a method called `countByValue`. Read its documentation, and try to understand how it works. Would using it be a good idea?
80 | 


--------------------------------------------------------------------------------
/scala/lab4-propprices.md:
--------------------------------------------------------------------------------
  1 | ### Lab 4: Analyzing UK Property Prices
  2 | 
  3 | In this lab, you will work with another real-world dataset that contains residential property sales across the UK, as reported to the Land Registry. You can download this dataset and many others from [data.gov.uk](https://data.gov.uk/dataset/land-registry-monthly-price-paid-data).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | As always, we begin by inspecting the data, which is in the `/home/vagrant/data/prop-prices.csv` file. Run the following command to take a look at some of the entries:
 10 | 
 11 | ```
 12 | head /home/vagrant/data/prop-prices.csv
 13 | ```
 14 | 
 15 | Note that this time, the CSV file does not have headers. To determine which fields are available, consult the [guidance page](https://www.gov.uk/guidance/about-the-price-paid-data).
 16 | 
 17 | ___
 18 | 
 19 | #### Task 2: Importing the Data
 20 | 
 21 | We are going to use the `com.databricks.spark.csv` library to create a `DataFrame` from CSV file.
 22 | 
 23 | First we need to restart Scala interpreter of Zeppelin:
 24 |   Interpreter -> spark box(first one) -> restart button
 25 | Then we need to import `com.databricks.spark.csv` as we did in Lab 2.
 26 | 
 27 | ```scala
 28 | %dep
 29 | z.reset()
 30 | z.load("com.databricks:spark-csv_2.11:1.4.0")
 31 | ```
 32 | 
 33 | After we will define a schema for our data.
 34 | And load the `prop-prices.csv` file as a `DataFrame` and register it as a temporary table so that you can run SQL queries:
 35 | 
 36 | 
 37 | ```scala
 38 | import org.apache.spark.sql.types._
 39 | val custSchema = StructType(Array(
 40 |     StructField("id",StringType,true), 
 41 |     StructField("price",IntegerType,true), 
 42 |     StructField("date",StringType,true), 
 43 |     StructField("zip",StringType,true),
 44 |     StructField("type",StringType,true), 
 45 |     StructField("new",StringType,true), 
 46 |     StructField("duration",StringType,true), 
 47 |     StructField("PAON",StringType,true), 
 48 |     StructField("SAON",StringType,true), 
 49 |     StructField("street",StringType,true), 
 50 |     StructField("locality",StringType,true),
 51 |     StructField("town",StringType,true),
 52 |     StructField("district",StringType,true), 
 53 |     StructField("county",StringType,true), 
 54 |     StructField("ppd",StringType,true), 
 55 |     StructField("status",StringType,true)))
 56 | 
 57 | val df = sqlContext.read
 58 |     .format("com.databricks.spark.csv")
 59 |     .schema(custSchema)
 60 |     .load("file:///home/vagrant/data/prop-prices.csv")
 61 | 
 62 | df.registerTempTable("properties")
 63 | df.persist()
 64 | ```
 65 | 
 66 | ___
 67 | 
 68 | #### Task 3: Analyzing Property Price Trends
 69 | 
 70 | First, let's do some basic analysis on the data. Find how many records we have per year, and print them out sorted by year.
 71 | 
 72 | **Solution**:
 73 | 
 74 | ```Scala
 75 | sqlContext.sql("""select   substring(date, 0, 4), count(*)
 76 |                   from     properties
 77 |                   group by substring(date, 0, 4)
 78 |                   order by substring(date, 0, 4)""").collect()
 79 | ```
 80 | 
 81 | All right, so everyone knows that properties in London are expensive. Find the average property price by county, and print the top 10 most expensive counties.
 82 | 
 83 | **Solution**:
 84 | 
 85 | ```Scala
 86 | sqlContext.sql("""select   county, avg(price)
 87 |                   from     properties
 88 |                   group by county
 89 |                   order by avg(price) desc
 90 |                   limit    10""").collect()
 91 | ```
 92 | 
 93 | Is there any trend for property sales during the year? Find the average property price in Greater London month over month in 2015 and 2016, and print it out by month.
 94 | 
 95 | **Solution**:
 96 | 
 97 | ```Scala
 98 | sqlContext.sql("""select  substring(date,0,4) as yr, substring(date,5,2) as mth, avg(price)
 99 |                   from     properties
100 |                   where    county='GREATER LONDON'
101 |                   and       substring(date,0,4) >= 2015
102 |                   group by  substring(date,0,4), substring(date,5,2)
103 |                   order by  substring(date,0,4), substring(date,5,2)""").collect()
104 | ```
105 | 
106 | 
107 | 
108 | Bonus: use the %sql to plot the property price changes month-over-month across the entire dataset.
109 | 
110 | **Solution**:
111 | 
112 | ```Scala
113 | %sql 
114 | select   year(date), month(date), avg(price) from     properties  group by year(date), month(date)  order by year(date), month(date)
115 | ```
116 | Open `settings` and in `Values` put `_c2` field
117 | ___
118 | 
119 | #### Discussion
120 | 
121 | Now that you have experience in working with Spark SQL and `DataFrames`, what are the advantages and disadvantages of using it compared to the core RDD functionality (such as `map`, `filter`, `reduceByKey`, and so on)? Consider which approach produces more maintainable code, offers more opportunities for optimization, makes it easier to solve certain problems, and so on.
122 | 


--------------------------------------------------------------------------------
/python/lab0-python.md:
--------------------------------------------------------------------------------
  1 | ### Lab 0: Python Fundamentals
  2 | 
  3 | The purpose of this lab is to make sure you are sufficiently acquainted with Python to succeed in the rest of the labs. If Python is one of your primary language, this should be smooth sailing; otherwise, please make sure you complete these tasks before moving on to the next labs.
  4 | 
  5 | This lab assumes that you have Python 2.6+ installed on your system. If you're using the instructor-provided appliance, you're all set. Otherwise, please make sure that Python is installed and is in the path, so you can type `python` to launch it from a terminal window.
  6 | 
  7 | > If you're installing Python yourself, please install Python 2.x and not 3.x. Even though everything in these labs is supposed to work just fine with Python 3, a lot of libraries and frameworks still don't support it.
  8 | 
  9 | ___
 10 | 
 11 | #### Task 1: Experimenting with the Python REPL
 12 | 
 13 | Open a terminal window and run `python`. An interactive prompt similar to the following should appear:
 14 | 
 15 | ```
 16 | Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
 17 | [GCC 4.8.2] on linux2
 18 | Type "help", "copyright", "credits" or "license" for more information.
 19 | >>> 
 20 | ```
 21 | 
 22 | This is the Python REPL -- Read, Eval, Print Loop environment. Try some basic commands to make sure everything works (do not type the `>>>` prompt):
 23 | 
 24 | ```
 25 | >>> 2 + 2
 26 | 4
 27 | >>> print("Hello, REPL")
 28 | Hello, REPL
 29 | >>> exit()
 30 | ```
 31 | 
 32 | Instead of `exit()`, you can also type Ctrl+D to leave the REPL environment.
 33 | 
 34 | ___
 35 | 
 36 | #### Task 2: Implementing Python Functions
 37 | 
 38 | Create a new file called `functions.py`. Use the following template so that when the file is executed directly, the `run` function will be called:
 39 | 
 40 | ```python
 41 | def run():
 42 |     print("Hey there!")
 43 | 
 44 | if __name__ == "__main__":
 45 |     run()
 46 | ```
 47 | 
 48 | > Which editor should you use in the appliance? If you want to get into the spirit of the course, you could use `vim`, but if you're looking for something more user-friendly, use `nano` or the built-in web-based editor.
 49 | 
 50 | To make sure everything's fine so far, run your Python program from a terminal window:
 51 | 
 52 | ```
 53 | python functions.py
 54 | ```
 55 | 
 56 | You should see "Hey there!" printed out.
 57 | 
 58 | Next, implement a function called `wordcount` that takes a list of strings, and produces a dict with the number of times each string appears. Here is an example of its invocation and expected output:
 59 | 
 60 | ```python
 61 | print(wordcount(["the", "fox", "jumped", "over", "the", "dog"]))
 62 | # Expecting { 'the': 2, 'fox': 1 }, and so on
 63 | ```
 64 | 
 65 | You might find dict's `setdefault` method useful. To find out how it works, run `help(dict.setdefault)` from the Python REPL. Alternatively, to test whether a key is present in a dictionary, use `if key in dict ...`.
 66 | 
 67 | **Solution**:
 68 | 
 69 | ```python
 70 | def wordcount(words):
 71 |     freqs = {}
 72 |     for word in words:
 73 |         freqs[word] = freqs.setdefault(word, 0) + 1
 74 |     return freqs
 75 | ```
 76 | 
 77 | ___
 78 | 
 79 | #### Task 3: Using Collection Pipelines
 80 | 
 81 | Given a collection of items, the `map`, `filter`, `reduce` and other functions we learned are very useful for transforming the collection into your desired dataset. Implement the following functions according to the instructions provided, and do not use loops in your implementation:
 82 | 
 83 | * Given a list of numbers, use `filter` to filter out only the even numbers.
 84 | 
 85 | * Given a list of numbers, use `map` to raise each number to the power of 2.
 86 | 
 87 | * Given a list of words, use `reduce` to find the average word length.
 88 | 
 89 | * Use `map` and `reduce` to solve [problem 6](https://projecteuler.net/problem=6) from Project Euler, which states:
 90 | 
 91 | > Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.
 92 | 
 93 | **Solution**:
 94 | 
 95 | ```python
 96 | def evens(numbers):
 97 |     return filter(lambda n: n % 2 == 0, numbers)
 98 | 
 99 | def squares(numbers):
100 |     return map(lambda n: n * n, numbers)
101 | 
102 | def avg_length(words):
103 |     return reduce(lambda sum, word: sum + len(word), words, 0) / \
104 |            float(len(words))
105 | 
106 | def problem6():
107 |     def _sum(numbers):
108 |         return reduce(lambda a, b: a + b, numbers)      # or use built-in sum()
109 |     def square(n):
110 |         return n * n
111 |     return _sum(squares(xrange(1, 100))) - square(_sum(xrange(1, 100)))
112 | ```
113 | 
114 | ___
115 | 
116 | #### Discussion
117 | 
118 | Why do you think Python is so successful in the data science, data analysis, machine learning, and scientific computing fields?
119 | 
120 | Compare the solutions above to your favorite programming language (or at least the one you're using in your day job). Do you feel the lack of strong typing makes Python code harder to read or write?
121 | 


--------------------------------------------------------------------------------
/python/lab5-streaming.md:
--------------------------------------------------------------------------------
 1 | ### Lab 5: Social Panic Analysis
 2 | 
 3 | In this lab, you will use Spark Streaming to analyze Twitter statuses for civil unrest and map them by the place they are coming from. This lab is based on Will Farmer's work, ["Twitter Civil Unrest Analysis with Apache Spark"](http://will-farmer.com/twitter-civil-unrest-analysis-with-apache-spark.html). It is a simplified version that doesn't have as many external dependencies.
 4 | 
 5 | > **NOTE**: If you are running this lab on your system (and not the instructor-provided appliance), you will need to install a couple of Python modules in case you don't have them already. Run the following commands from a terminal window:
 6 | 
 7 | ```
 8 | sudo easy_install requests
 9 | sudo easy_install requests_oauthlib
10 | ```
11 | 
12 | ___
13 | 
14 | #### Task 1: Creating a Twitter Application and Obtaining Credentials
15 | 
16 | Making requests to the [Twitter Streaming API](https://dev.twitter.com/streaming/overview) requires credentials. You will need a Twitter account, and you will need to create a Twitter application and connect it to your account. That's all a lot simpler than it sounds!
17 | 
18 | First, navigate to the [Twitter Application Management](https://apps.twitter.com) page. Sign in if necessary. If you do not have a Twitter account, this is the opportunity to create one.
19 | 
20 | Next, create a new app. You will be prompted for a name, a description, and a website. Fill in anything you want (the name must be unique, though), accept the developer agreement, and continue.
21 | 
22 | Switch to the **Keys and Access Tokens** tab on your new application's page. Copy the **Consumer Key** and **Consumer Secret** to a separate text file (in this order). Next, click **Create my access token** to authorize the application to access your own account. Copy the **Access Token** and **Access Token Secret** to the same text file (again, in this order). These four credentials are necessary for making requests to the Twitter Streaming API.
23 | 
24 | ___
25 | 
26 | #### Task 2: Inspecting the Analysis Program
27 | 
28 | Open the `analysis.py` file from the `~/data` folder in a text editor. This is a Spark Streaming application that connects to the Twitter Streaming API and produces a stream of (up to 50) tweets from England every 60 seconds. These tweets are then analyzed for suspicious words like "riot" and "http", and grouped by the location they are coming from.
29 | 
30 | Inspect the source code for the application -- make sure you understand what the various functions do, and how data flows through the application. Most importantly, here is the key analysis piece:
31 | 
32 | ```python
33 | stream.map(lambda line: ast.literal_eval(line))              \
34 |                 .filter(filter_posts)                        \
35 |                 .map(lambda data: (data[1]['name'], 1))      \
36 |                 .reduceByKey(lambda a, b: a + b)             \
37 |                 .pprint()
38 | ```
39 | 
40 | To make this program work with your credentials, insert the four values you copied in the previous task in the appropriate locations in the source code:
41 | 
42 | ```python
43 | auth = requests_oauthlib.OAuth1('API KEY', 'API SECRET',
44 |                                 'ACCESS TOKEN', 'ACCESS TOKEN SECRET')
45 | ```
46 | 
47 | ___
48 | 
49 | #### Task 3: Looking for Civil Unrest
50 | 
51 | You're now ready to run the program and look for civil unrest! From a terminal window, navigate to the Spark installation directory (`~/spark` on the appliance) and run:
52 | 
53 | ```
54 | bin/spark-submit ~/data/analysis.py
55 | ```
56 | 
57 | You should see the obtained statistics printed every 60 seconds. If you aren't getting enough results, modify the keywords the program is looking for, or modify the bounding box to a larger area.
58 | 
59 | If anything goes wrong, you should see the Twitter HTTP response details amidst the Spark log stream. For example:
60 | 
61 | ```
62 | https://stream.twitter.com/1.1/statuses/filter.json?language=en&locations=-0.489,51.28,0.236,51.686 <Response [420]>
63 | Exceeded connection limit for user
64 | ```
65 | 
66 | By the way, while we're at it, it's a good idea to learn how to configure the Spark driver's default log level. Navigate to the `~/spark/conf` directory in a terminal window, and inspect the `log4j.properties.template` file. Copy it to a file called `log4j.properties` (this is the one Spark actually reads), and in a text editor modify the following line to read "WARN" instead of "INFO":
67 | 
68 | ```
69 | log4j.rootCategory=INFO, console
70 | ```
71 | 
72 | Subsequent launches of `pyspark`, `spark-submit`, etc. will use the new log configuration, and print out only messages that have log level WARN or higher.
73 | 
74 | ___
75 | 
76 | #### Discussion
77 | 
78 | Spark Streaming is not a real-time data processing engine -- it still relies on micro-batches of elements, grouped into RDDs. Is this a serious limitation for our scenario? What are some scenarios in which it can be a serious limitation?
79 | 
80 | Bonus reading: the [Apache Flink](https://flink.apache.org) project is an alternative data processing framework that is real-time-first, batch-second. It can be a better fit in some scenarios that require real-time processing with no batching at all.
81 | 


--------------------------------------------------------------------------------
/python/lab4-propprices.md:
--------------------------------------------------------------------------------
  1 | ### Lab 4: Analyzing UK Property Prices
  2 | 
  3 | In this lab, you will work with another real-world dataset that contains residential property sales across the UK, as reported to the Land Registry. You can download this dataset and many others from [data.gov.uk](https://data.gov.uk/dataset/land-registry-monthly-price-paid-data).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | As always, we begin by inspecting the data, which is in the `~/data/prop-prices.csv` file. Run the following command to take a look at some of the entries:
 10 | 
 11 | ```
 12 | head ~/data/prop-prices.csv
 13 | ```
 14 | 
 15 | Note that this time, the CSV file does not have headers. To determine which fields are available, consult the [guidance page](https://www.gov.uk/guidance/about-the-price-paid-data).
 16 | 
 17 | ___
 18 | 
 19 | #### Task 2: Importing the Data
 20 | 
 21 | In a previous lab, we used the Python `csv` module to parse CSV files. However, because we're working with structured data, the Spark SQL framework can be easier to use and provide better performance. We are going to use the `pyspark_csv` third-party open source module to create a `DataFrame` from an RDD of CSV lines.
 22 | 
 23 | > **NOTE**: The `pyspark_csv.py` file is in the `~/externals` directory on the appliance. You can also [download it yourself](https://github.com/seahboonsiew/pyspark-csv) and place it in some directory.
 24 | > 
 25 | > This module also depends on the `dateutils` module, which typically doesn't ship with Python. It is already installed in the appliance. To install it on your own machine, run the following from a terminal window:
 26 | 
 27 | ```
 28 | sudo easy_install dateutils
 29 | ```
 30 | 
 31 | To import `pyspark_csv`, you'll need the following snippet of code that adds its path to the module search path, and adds it to the Spark executors so they can find it as well:
 32 | 
 33 | ```python
 34 | import sys
 35 | sys.path.append('/home/ubuntu/externals')   # replace as necessary
 36 | import pyspark_csv
 37 | sc.addFile('/home/ubuntu/externals/pyspark_csv.py')    # ditto
 38 | ```
 39 | 
 40 | Next, load the `prop-prices.csv` file as an RDD, and use the `csvToDataFrame` function from the `pyspark_csv` module to create a `DataFrame` and register it as a temporary table so that you can run SQL queries:
 41 | 
 42 | ```python
 43 | columns = ['id', 'price', 'date', 'zip', 'type', 'new', 'duration', 'PAON',
 44 |            'SAON', 'street', 'locality', 'town', 'district', 'county', 'ppd',
 45 |            'status']
 46 | 
 47 | rdd = sc.textFile("file:///home/ubuntu/data/prop-prices.csv")
 48 | df = pyspark_csv.csvToDataFrame(sqlContext, rdd, columns=columns)
 49 | df.registerTempTable("properties")
 50 | df.persist()
 51 | ```
 52 | 
 53 | ___
 54 | 
 55 | #### Task 3: Analyzing Property Price Trends
 56 | 
 57 | First, let's do some basic analysis on the data. Find how many records we have per year, and print them out sorted by year.
 58 | 
 59 | **Solution**:
 60 | 
 61 | ```python
 62 | sqlContext.sql("""select   year(date), count(*)
 63 |                   from     properties
 64 |                   group by year(date)
 65 |                   order by year(date)""").collect()
 66 | ```
 67 | 
 68 | All right, so everyone knows that properties in London are expensive. Find the average property price by county, and print the top 10 most expensive counties.
 69 | 
 70 | **Solution**:
 71 | 
 72 | ```python
 73 | sqlContext.sql("""select   county, avg(price)
 74 |                   from     properties
 75 |                   group by county
 76 |                   order by avg(price) desc
 77 |                   limit    10""").collect()
 78 | ```
 79 | 
 80 | Is there any trend for property sales during the year? Find the average property price in Greater London month over month in 2015 and 2016, and print it out by month.
 81 | 
 82 | **Solution**:
 83 | 
 84 | ```python
 85 | sqlContext.sql("""select   year(date) as yr, month(date) as mth, avg(price)
 86 |                   from     properties
 87 |                   where    county='GREATER LONDON'
 88 |                   and      year(date) >= 2015
 89 |                   group by year(date), month(date)
 90 |                   order by year(date), month(date)""").collect()
 91 | ```
 92 | 
 93 | Bonus: use the Python `matplotlib` module to plot the property price changes month-over-month across the entire dataset.
 94 | 
 95 | > The `matplotlib` module is installed in the instructor-provided appliance. However, there is no X environment, so you will not be able to view the actual plot. For your own system, follow the [installation instructions](http://matplotlib.org/users/installing.html).
 96 | 
 97 | **Solution**:
 98 | 
 99 | ```python
100 | monthPrices = sqlContext.sql("""select   year(date), month(date), avg(price)
101 |                                 from     properties
102 |                                 group by year(date), month(date)
103 |                                 order by year(date), month(date)""").collect()
104 | import matplotlib.pyplot as plt
105 | values = map(lambda row: row._c2, monthPrices)
106 | plt.rcdefaults()
107 | plt.scatter(xrange(0,len(values)), values)
108 | plt.show()
109 | ```
110 | 
111 | ___
112 | 
113 | #### Discussion
114 | 
115 | Now that you have experience in working with Spark SQL and `DataFrames`, what are the advantages and disadvantages of using it compared to the core RDD functionality (such as `map`, `filter`, `reduceByKey`, and so on)? Consider which approach produces more maintainable code, offers more opportunities for optimization, makes it easier to solve certain problems, and so on.
116 | 


--------------------------------------------------------------------------------
/python/lab7-plagiarism.md:
--------------------------------------------------------------------------------
  1 | ### Lab 7: Plagiarism Detection
  2 | 
  3 | In this lab, you will use Spark's Machine Learning library (MLLib) to perform plagiarism detection -- determine how similar a document is to a collection of existing documents.
  4 | 
  5 | You will use the [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) algorithm, which extracts numeric features (vectors) from text documents. TF-IDF stands for Term Frequency Inverse Document Frequency, and it is a normalized representation of how frequently a term (word) occurs in a document that belongs to a set of documents:
  6 | 
  7 | * The *TF*[*t*, *D*] -- term frequency of term *t* in a document *D* -- is simply the number of times *t* appears in *D*.
  8 | 
  9 | * The *DF*[*t*] -- document frequency of term *t* in a collection of documents *D*(1), ..., *D*(*n*) -- is the number of documents in which *t* appears.
 10 | 
 11 | * The *TFIDF*[*t*, *D*(*i*)] of term *t* in a document *D*(*i*) in a collection of documents *D*(1), ..., *D*(*n*) is *TF*[*t*, *D*(*i*)] · *log*[(*n* + 1) / (*DF*[*t*, *D*(*i*)] + 1)].
 12 | 
 13 | These values are not very hard to compute, but when the documents are very large there is a lot of room for optimization. MLLib (the machine learning library that ships with Spark) has optimized versions of these feature extraction algorithms, among many other ML algorithms for clustering, classification, dimensionality reduction, etc.
 14 | 
 15 | The similarity between two documents can be obtained by computing the cosine similarity (normalized dot product) of their TF-IDF vectors. For two documents *D*, *E* with TF-IDF vectors *t*, *s* the cosine similarity is defined as *t* ￮ *s* / |*t*| · |*s*| -- note this is a number between 0 and 1, due to normalization. If the cosine similarity is 1, the documents are identical; if the similarity is 0, the documents have nothing in common.
 16 | 
 17 | ___
 18 | 
 19 | #### Task 1: Inspecting the Data
 20 | 
 21 | In the `~/data/essays` directory you'll find a collection of 1497 essays written by students of English at the University of Uppsala, also known as the [Uppsala Student English Corpus (USE)](http://www.engelska.uu.se/Forskning/engelsk_sprakvetenskap/Forskningsomraden/Electronic_Resource_Projects/USE-Corpus/). Your task will be to determine whether another essay, in the file `~/data/essays/candidate`, has been plagiarized from one of the other essays, or whether it is original work.
 22 | 
 23 | First, let's take a look at some of the files. From a terminal window, execute the following command to inspect the first 10 files:
 24 | 
 25 | ```
 26 | ls ~/data/essays/*.txt | head -n 10 | xargs less
 27 | ```
 28 | 
 29 | In the resulting `less` window, use `:n` to move to the next file, and `q` to quit. As you can see, these are student essays on various topics. Now take a look at the candidate file:
 30 | 
 31 | ```
 32 | less ~/data/essays/candidate
 33 | ```
 34 | 
 35 | ___
 36 | 
 37 | #### Task 2: Detecting Document Similarity
 38 | 
 39 | First, you need to load the documents to an RDD of word vectors, one per document. Note that the documents are need to be cleaned up so that we indeed produce a vector per document. These will be processed by MLLib to obtain an RDD of TF-IDF vectors.
 40 | 
 41 | ```python
 42 | import re
 43 | 
 44 | # An even better cleanup would include stemming,
 45 | # careful punctuation removal, etc.
 46 | def clean(doc):
 47 |     return filter(lambda w: len(w) > 2,
 48 |                   map(lambda s: s.lower(), re.split(r'\W+', doc)))
 49 | 
 50 | essays = sc.wholeTextFiles("file:///home/ubuntu/data/essays/*.txt")    \
 51 |            .mapValues(clean)                                            \
 52 |            .cache()
 53 | essayNames = essays.map(lambda (filename, contents): filename).collect()
 54 | docs = essays.map(lambda (filename, contents): contents)
 55 | ```
 56 | 
 57 | Next, you can compute the TF vectors for all the document vectors using the `HashingTF` algorithm:
 58 | 
 59 | ```python
 60 | from pyspark.mllib.feature import HashingTF, IDF
 61 | 
 62 | hashingTF = HashingTF()
 63 | tf = hashingTF.transform(docs)
 64 | tf.cache()      # we will reuse it twice for TF-IDF
 65 | ```
 66 | 
 67 | And now you can find the TF-IDF vectors -- this requires two passes: one to find the IDF vectors and another to scale the terms in the vectors.
 68 | 
 69 | ```python
 70 | idf = IDF().fit(tf)
 71 | tfidf = idf.transform(tf)
 72 | ```
 73 | 
 74 | Now that you have the TF-IDF vectors for the entire dataset, you can compute the similarity of a new document, `candidate`, to all the existing documents. To do so, you need to find that document's TF-IDF vector, and then find the cosine similarity of that vector with all the existing TF-IDF vectors:
 75 | 
 76 | ```python
 77 | candidate = clean(open('/home/ubuntu/data/essays/candidate').read())
 78 | candidateTf = hashingTF.transform(candidate)
 79 | candidateTfIdf = idf.transform(candidateTf)
 80 | similarities = tfidf.map(lambda v: v.dot(candidateTfIdf) /
 81 |                                    (v.norm(2) * candidateTfIdf.norm(2)))
 82 | ```
 83 | 
 84 | All that's left is pick the most similar documents and see if there's high similarity:
 85 | 
 86 | ```python
 87 | topFive = sorted(enumerate(similarities.collect()), key=lambda (k, v): -v)[0:5]
 88 | for idx, val in topFive:
 89 |     print("doc '%s' has score %.4f" % (essayNames[idx], val))
 90 | ```
 91 | 
 92 | You can experiment with slight modifications to the text of `candidate` and see if our naive algorithm can still detect its origin.
 93 | 
 94 | ___
 95 | 
 96 | #### Discussion
 97 | 
 98 | Why did we use `similarities.collect()` to bring the dataset to the driver program and then sort the results?
 99 | 
100 | Which parts of working with MLLib do you find particularly useful, and which parts seem confusing?
101 | 


--------------------------------------------------------------------------------
/scala/lab0-scala.md:
--------------------------------------------------------------------------------
  1 | In this lab, you will become acquainted with your Spark installation.
  2 | 
  3 | > The instructor should have explained how to install Spark on your machine. One option is to use the instructor's VirtualBox appliance, which you can import in the VirtualBox application. The appliance has Spark 1.6.1 installed, and has all the necessary data files for this and subsequent exercises in the `~/data` directory.
  4 | > 
  5 | > Alternatively, you can install Spark yourself. Download it from [spark.apache.org](http://spark.apache.org/downloads.html) -- make sure to select a prepackaged binary version, such as [Spark 1.6.1 for Hadoop 2.6](http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz). Extract the archive to some location on your system. Then, download the [data files](https://www.dropbox.com/s/un1zr1jg6buoe3a/data.zip?dl=0) for the labs and place them in `~/data`.
  6 | > 
  7 | > **NOTE**: If you install Spark on Windows (not in a virtual machine), many things are going to be more difficult. Ask the instructor for advice if necessary.
  8 | 
  9 | The purpose of this lab is to make sure you are sufficiently acquainted with Scala to succeed in the rest of the labs. 
 10 | If Scala is one of your primary language, this should be smooth sailing; otherwise, please make sure you complete these 
 11 | tasks before moving on to the next labs.
 12 | 
 13 | This lab assumes that you have Spark 1.6+ installed on your system. 
 14 | If you're using the instructor-provided VirtualBox appliance, you're all set. 
 15 | 
 16 | 
 17 | 
 18 | #### Task 1: Experimenting with the Spark REPL
 19 | 
 20 | Open a terminal window navigate to Spark/bin and run `./spark-shell`. An interactive prompt similar to the following should appear:
 21 | 
 22 | ```
 23 | Welcome to
 24 |       ____              __
 25 |      / __/__  ___ _____/ /__
 26 |     _\ \/ _ \/ _ `/ __/  '_/
 27 |    /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
 28 |       /_/
 29 | 
 30 | Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_71)
 31 | 
 32 | 
 33 | SQL context available as sqlContext.
 34 | 
 35 | scala>
 36 | ```
 37 | 
 38 | This is the scala REPL -- Read, Eval, Print Loop environment. Try some basic commands to make sure everything works:
 39 | ```
 40 | scala> 2+2
 41 | res0: Int = 4
 42 | 
 43 | scala> println("Hello")
 44 | Hello
 45 | ```
 46 | ___
 47 | 
 48 | #### Task 2: Scala basics
 49 | 
 50 | Scala is object-oriented language and everything is an object, including numbers or functions.
 51 | The expresion: 1 + 2 * 3
 52 | is equivalent to: (1).+((2).*(3)) here we used unary numerical methods.
 53 | 
 54 | Functions are objects. They can be passed into functions as arguments, stored in variable or return them from function.
 55 | This is the core of the paradigm called "Functional programming".
 56 | 
 57 | #### Object
 58 | ```
 59 | Object ScalaBasics {
 60 |   def Foo(bar: () => Unit) {
 61 |     bar()
 62 |   } 
 63 |   
 64 |   def Bar() {
 65 |     println("This is Bar")
 66 |   }
 67 | 
 68 |   def main(args: Array[String]){
 69 |     Foo(Bar)
 70 |   }
 71 | }
 72 | ```
 73 | 
 74 | * In first line we see "Object" keyword. This is declaration of class with a single instance (commonly known as singelton object). 
 75 | * Function parameter declaration in line 2 "() => Unit" translated as: no input parameters and the function returns nothing (like void in C#)
 76 | 
 77 | #### Variables (val vs. var)
 78 | 
 79 |  Both vals and vars must be initialized when defined, but only vars can be later reassigned to refer to a different object. Both are evaluated once.
 80 | ```
 81 |  val x = 3
 82 |  x: Int = 3
 83 | 
 84 |  x = 4
 85 |  error: reassignment to val
 86 | 
 87 |  var y = 5
 88 |  y: Int = 5
 89 | 
 90 |  y = 6
 91 |  y: Int = 6
 92 | ```
 93 | 
 94 | #### Case Class
 95 | 
 96 | This is regular class that export constuctor parameters and provide a decomposition mechanism via "pattern matching"
 97 | ```
 98 |  abstract class Employee
 99 |  case class Worker(name: String, managerName: String) extends Employee
100 |  case class Manager(name: String) extends Employee
101 | ```
102 | 
103 |  The constuctor parameters can be accessed directly
104 | ```
105 |  val emp1 = Manager("Dan")
106 |  emp1.name
107 |    res0: String = Dan
108 | 
109 |  def IsWorkerOrManager(emp: Employee): String = {
110 |  	val result = emp match {
111 | 	 	case Worker(name, _) => { 
112 | 	 		println("Worker: " + name)
113 | 	 		"Worker"		
114 | 	 	}
115 | 	 	case Manager(name) => {
116 | 	 		println("Manager: " + name)
117 | 	 		"Manager"
118 | 	 	}
119 | 	 }
120 | 	 result
121 |  }
122 | 
123 | IsWorkerOrManager(emp1)
124 | Manager: Dan
125 | res1: String = Manager
126 | ```
127 | 
128 | #### Tuples
129 | 
130 |  Tuples are collection of items not of the same types, but they are immutable. 
131 | ```
132 | val t = (1, "Hello", 3.0)
133 | t: (Int, String, Double) = (1,Hello,3.0)
134 | ```
135 | 
136 | The access to elemets done by ._<index> of element.
137 | ```
138 | scala> println(t._1)
139 | 1
140 | 
141 | scala> println(t._2)
142 | Hello
143 | 
144 | scala> println(t._3)
145 | 3.0
146 | ```
147 | 
148 | #### Lambda
149 | ```
150 | def fun1 = (x: Int) => println(x)
151 | fun1(3)
152 | 3
153 | 
154 | def f1 = () => "Hello"
155 | f1()
156 | res3: String = Hello
157 | ```
158 | 
159 | #### Using "_" (Underscore)
160 | 
161 | In Scala we can replace variables by "_"
162 | ```
163 | val intList=List(1,2,3,4)
164 | intList.map(_ + 1) is equivalent to following: 
165 | intList.map(x => x + 1) 
166 | res4: List[Int] = List(2, 3, 4, 5)
167 | 
168 | intList.reduce(_ + _) is equivalent to following: 
169 | intList.reduce((acc, x) => acc + x)
170 | ```
171 | 
172 | In pattern matching the use of "_" is done when we do not care about the variable.
173 | Review of the match from IsWorkerOrManager function.
174 | 
175 | ```
176 | 	...
177 |  	emp match {
178 | 	 	case Worker(name, _) => { 
179 | 	 		println("Worker: " + name)
180 | 	 		"Worker"		
181 | 	 	}
182 | 	 	...
183 | 	 }
184 | 	 ...
185 | ```
186 | 
187 | We want to know the name of the worker, but we do not care about the name of manager.
188 | 


--------------------------------------------------------------------------------
/python/lab1-wordcount.md:
--------------------------------------------------------------------------------
  1 | ### Lab 1: Multi-File Word Count
  2 | 
  3 | In this lab, you will become acquainted with your Spark installation, and run your first Spark job -- a multi-file word count.
  4 | 
  5 | > The instructor should have explained how to install Spark on your machine. One option is to use the instructor's appliance, which you can access through any web browser. The appliance has Spark 1.6.2 installed, and has all the necessary data files for this and subsequent exercises in the `~/data` directory.
  6 | > 
  7 | > Alternatively, you can install Spark yourself. Download it from [spark.apache.org](http://spark.apache.org/downloads.html) -- make sure to select a prepackaged binary version, such as [Spark 1.6.1 for Hadoop 2.6](http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz). Extract the archive to some location on your system. Then, download the [data files](../data.zip) for the labs and place them in `~/data`.
  8 | > 
  9 | > **NOTE**: If you install Spark on Windows (not in a virtual machine), many things are going to be more difficult. Ask the instructor for advice if necessary.
 10 | 
 11 | ___
 12 | 
 13 | #### Task 1: Inspecting the Spark Installation
 14 | 
 15 | Open a terminal window. Navigate to the directory where you extracted Apache Spark. On the instructor-provided virtual machine, this is `~/spark`.
 16 | 
 17 | Inspect the files in the `bin` directory. You will soon use `pyspark` to launch your first Spark job. Also note `spark-submit`, which is used to submit standalone Spark programs to a cluster.
 18 | 
 19 | Inspect the scripts in the `sbin` directory. These scripts help with setting up a stand-alone Spark cluster, deploying Spark to EC2 virtual machines, and a bunch of additional tasks.
 20 | 
 21 | Finally, take a look at the `examples` directory. You can find a number of stand-alone demo programs here, covering a variety of Spark APIs.
 22 | 
 23 | ___
 24 | 
 25 | #### Task 2: Inspecting the Lab Data Files
 26 | 
 27 | In this lab, you will implement a multi-file word count. The texts you will use are freely available books from [Project Gutenberg](http://www.gutenberg.org), including classics such as Lewis Carroll's "Alice in Wonderland" and Jane Austin's "Pride and Prejudice".
 28 | 
 29 | Take a look at some of the text files in the `~/data` directory. From the terminal, run:
 30 | 
 31 | ```
 32 | head -n 50 ~/data/*.txt | less
 33 | ```
 34 | 
 35 | This shows the first 50 lines of each file. Press SPACE to scroll, or `q` to exit `less`.
 36 | 
 37 | ___
 38 | 
 39 | #### Task 3: Implementing a Multi-File Word Count
 40 | 
 41 | Navigate to the Spark installation directory, and run `bin/pyspark`. After a few seconds, you should see an interactive Python shell, which has a pre-initialized `SparkContext` object called `sc`.
 42 | 
 43 | ```
 44 | Welcome to
 45 |       ____              __
 46 |      / __/__  ___ _____/ /__
 47 |     _\ \/ _ \/ _ `/ __/  '_/
 48 |    /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
 49 |       /_/
 50 | 
 51 | Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
 52 | SparkContext available as sc, HiveContext available as sqlContext.
 53 | >>>
 54 | ```
 55 | 
 56 | To explore the available methods, run the following command:
 57 | 
 58 | ```python
 59 | dir(sc)
 60 | ```
 61 | 
 62 | In this lab, you are going to use the `sc.textFile` method. To figure out what it does, run the following command:
 63 | 
 64 | ```python
 65 | help(sc.textFile)
 66 | ```
 67 | 
 68 | Note that even though it's not mentioned in the short documentation snippet you just read, the `textFile` method can also work with a directory path or a wildcard filter such as `/home/ubuntu/data/*.txt`.
 69 | 
 70 | > Of course, if you are not using the instructor-supplied appliance, your `data` directory might reside in a different location.
 71 | 
 72 | Your first task is to print out the number of lines in all the text files, combined. In general, you should try to come up with the solution yourself, and only then continue reading for the "school" solution.
 73 | 
 74 | **Solution**:
 75 | 
 76 | ```python
 77 | sc.textFile("/home/ubuntu/data/*.txt").count()
 78 | ```
 79 | 
 80 | Great! Your next task is to implement the actual word-counting program. You've already seen one in class, and now it's time for your own. Print the top 10 most frequent words in the provided books.
 81 | 
 82 | **Solution**:
 83 | 
 84 | ```python
 85 | lines = sc.textFile("/home/ubuntu/data/*.txt")
 86 | words = lines.flatMap(lambda line: line.split())
 87 | pairs = words.map(lambda word: (word, 1))
 88 | freqs = pairs.reduceByKey(lambda a, b: a + b)
 89 | top10 = freqs.sortBy(lambda (word, count): -count).take(10)
 90 | for (word, count) in top10:
 91 |     print("the word '%s' appears %d times" % (word, count))
 92 | ```
 93 | 
 94 | To be honest, we don't really care about words like "the", "a", and "of". Ideally, we would have a list of stop words to ignore. For now, modify your solution to filter out words shorter than 4 characters.
 95 | 
 96 | Additionally, you might be wondering about the types of all these variables -- most of them are RDDs. To trace the lineage of an RDD, use the `toDebugString` method. For example, `print(freqs.toDebugString())` should display the logical plan for that RDD's evaluation. We will discuss some of these concepts later.
 97 | 
 98 | ___
 99 | 
100 | #### Task 4: Run a Stand-Alone Spark Program
101 | 
102 | You're now ready to convert your multi-file word count into a stand-alone Spark program. Create a new file called `wordcount.py`.
103 | 
104 | Initialize a `SparkContext` as follows:
105 | 
106 | ```python
107 | from pyspark import SparkContext
108 | 
109 | def run():
110 |     sc = SparkContext()
111 |     # TODO Your code goes here
112 | 
113 | if __name__ == "__main__":
114 |     run()
115 | ```
116 | 
117 | Now, you can copy and paste your solution in the `run` method. Congratulations -- you have a stand-alone Spark program! To run it, navigate back to the Spark installation directory in your terminal, and run the following command:
118 | 
119 | ```
120 | bin/spark-submit --master 'local[*]' path/to/wordcount.py
121 | ```
122 | 
123 | You should replace `path/to/wordcount.py` with the actual path on your system. If everything went fine, you should see a lot of diagnostic output, but somewhere buried in it would be your top 10 words.
124 | 
125 | ___
126 | 
127 | #### Discussion
128 | 
129 | Instead of using `reduceByKey`, you could have used a method called `countByValue`. Read its documentation, and try to understand how it works. Would using it be a good idea?
130 | 


--------------------------------------------------------------------------------
/scala/lab6-pagerank.md:
--------------------------------------------------------------------------------
  1 | ### Lab 6: Movie PageRank
  2 | 
  3 | In this lab, you will run the [PageRank](https://en.wikipedia.org/wiki/PageRank) algorithm on a dataset of movie references, and try to identify the most popular movies based on how many references they have. The dataset you'll be working with is [provided by IMDB](http://www.imdb.com/interfaces).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | The original IMDB dataset is not very friendly for automatic processing. You can find it in the `~/data` folder of the VirtualBox appliance, or download it yourself from the IMDB FTP website -- it's the `movie-links.list` dataset. Here's a sampler:
 10 | 
 11 | ```
 12 | "#1 Single" (2006)
 13 |   (referenced in "Howard Stern on Demand" (2005) {Lisa Loeb & Sister})
 14 | 
 15 | "#LawstinWoods" (2013) {The Arrival (#1.1)}
 16 |   (references "Lost" (2004))
 17 |   (references Kenny Rogers and Dolly Parton: Together (1985) (TV))
 18 |   (references The Grudge (2004))
 19 |   (references The Ring (2002))
 20 | ```
 21 | 
 22 | Instead of using this raw dataset, there's a pre-processed one available in the `processed-movie-links.txt` file (it doesn't contain all the information from the first one, but we can live with that). Again, here's a sample:
 23 | 
 24 | ```
 25 | $ head processed-movie-links.txt
 26 | #LawstinWoods --> Lost
 27 | #LawstinWoods --> Kenny Rogers and Dolly Parton: Together
 28 | #LawstinWoods --> The Grudge
 29 | #LawstinWoods --> The Ring
 30 | #MonologueWars --> Trainspotting
 31 | Community --> $#*! My Dad Says
 32 | Conan --> $#*! My Dad Says
 33 | Geeks Who Drink --> $#*! My Dad Says
 34 | Late Show with David Letterman --> $#*! My Dad Says
 35 | ```
 36 | 
 37 | ___
 38 | 
 39 | #### Task 2: Finding Top Movies
 40 | 
 41 | Now it's time to implement the PageRank algorithm. It's probably the most challenging task so far, so here are some instructions that might help.
 42 | 
 43 | > **NOTE**: This is a very naive implementation of PageRank, which doesn't really try to optimize and minimize data shuffling. The GraphX library, which is also part of Spark, has a [native implementation of PageRank](https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html#pagerank). You can try it in Task 3.
 44 | 
 45 | Begin by parsing the movie references to an RDD called `links` (using `SparkContext.textFile` and `map`) and processing it into key/value pairs where the key is the movie and the value is a list of all movies referenced by it.
 46 | 
 47 | Next, create an RDD called `ranks` of key/value pairs where the key is the movie and the value is its rank, set to 1.0 initially for all movies.
 48 | 
 49 | Next, write a function `computeContribs` that takes a list of referenced movies and the referencing movie's rank, and returns a list of key/value pairs where the key is the movie and the value is its rank contribution. Each of the referenced movies gets an equal portion of the referencing movie's rank. For example, if "Star Wars" currently has rank 1.0 and references "Wizard of Oz" and "Star Trek", then the function should return two pairs: `("Wizard of Oz", 0.5)` and `("Star Trek", 0.5)`.
 50 | 
 51 | Next, we're getting to the heart of the algorithm. In a loop that repeats 10 times, compute a new RDD called `contribs` which is formed by joining `links` and `ranks` (the join is on the movie name). Use `flatMap` to collect the results from `computeContribs` on each key/value pair in the result of the join. To understand what we're doing, consider that joining `links` and `ranks` produces a pair RDD whose elements look like this:
 52 | 
 53 | ```scala
 54 | ("Star Wars", ("Wizard of Oz", "Star Trek", 0.8))
 55 | ```
 56 | 
 57 | Now, invoking `computeContribs` on the value of this pair produces a list of pairs:
 58 | 
 59 | ```scala
 60 | Array(("Wizard of Oz", 0.4), ("Star Trek", 0.4))
 61 | ```
 62 | 
 63 | By applying `computeContribs` and collecting the results with `flatMap`, we get a pair RDD that has, for each movie, its contribution from each of its neighbors. You should now sum (reduce) this pair RDD by key, so we get the sum of each movie's contributions from its neighbors.
 64 | 
 65 | Next, the PageRank algorithm dictates that we should recompute each movie's rank from the `ranks` RDD as 0.15 + 0.85 times its neighbors' contribution (you can use `mapValues` for this). This recomputation produces a new input value for `ranks`.
 66 | 
 67 | Finally, when your loop is done, display the 10 highest-ranked movies and their PageRank.
 68 | 
 69 | **Solution**:
 70 | 
 71 | ```scala
 72 | // links is RDD of (movie, [referenced movies])
 73 | val links = sc.textFile("file:///home/vagrant/data/processed-movie-links.txt") 
 74 | 			  .map(line => line.split("-->"))                             
 75 |               .map(x => (x(0).trim, x(1).trim))                      
 76 |               .distinct()                                                      
 77 |               .groupByKey()                                                    
 78 |               .cache()
 79 | 
 80 | // ranks is RDD of (movie, 1.0)
 81 | var ranks = links.map(movie => (movie._1, 1.0))
 82 | 
 83 | // each of our references gets a contribution of our rank divided by the
 84 | // total number of our references
 85 | def computeContribs(referenced : Array[String], rank : Double) ={
 86 |     val count = referenced.length
 87 |     referenced.map(movie => (movie, rank / count))
 88 | }
 89 | 
 90 | for (a <- 0 to 10)
 91 | {
 92 | 	// recompute each movie's contributions from its referencing movies
 93 |     val contribs = links.join(ranks).flatMap(x => computeContribs(x._2._1.toArray, x._2._2))
 94 | 	
 95 | 	// recompute the movie's ranks by accounting all its referencing
 96 |     // movies' contributions
 97 |     ranks = contribs.reduceByKey(_ + _)                       
 98 |                     .mapValues(rank => rank*0.85 + 0.15)
 99 | }
100 | 
101 | 
102 | ranks.sortBy(x => -1*x._2).take(10).foreach(println)
103 | ```
104 | 
105 | ___
106 | 
107 | #### Task 3: GraphX PageRank
108 | 
109 | The PageRank algorithm we implemented in the previous task is not very efficient. For example, running it on our dataset for 100 iterations took approximately 15 minutes on a 4-core machine. Considering that there are "just" about 25,000 movies ranked, this is not a very good result.
110 | 
111 | Spark ships with a native graph algorithm library called GraphX. Unfortunately, it doesn't yet have a Python binding -- you can only use it from Scala and Java. But we're not going to let that stop us!
112 | 
113 | Navigate to the Spark installation directory (`~/spark` in the VirtualBox appliance) and run `bin/spark-shell`. This is the Spark Scala REPL, which is very similar to PySpark, except it uses Scala. First, you're going to need a couple of import statements:
114 | 
115 | ```scala
116 | import org.apache.spark._
117 | import org.apache.spark.graphx._
118 | import org.apache.spark.graphx.lib._
119 | ```
120 | 
121 | Next, load the graph edges from the supplied `~/data/movie-edges.txt` file:
122 | 
123 | ```scala
124 | val graph = GraphLoader.edgeListFile(sc,
125 |     "file:///home/vagrant/data/movie-edges.txt")
126 | ```
127 | 
128 | This file was generated from the same dataset, but it has a format that GraphX natively supports. You can check out the format by running the following commands:
129 | 
130 | ```
131 | $ head ~/data/movie-edges.txt
132 | 0 1
133 | 2 3
134 | 2 4
135 | 2 5
136 | 2 6
137 | 7 8
138 | 9 10
139 | 11 10
140 | 12 10
141 | 13 10
142 | $ head ~/data/movie-vertices.txt
143 | 0 Howard Stern on Demand
144 | 1 #1 Single
145 | 2 #LawstinWoods
146 | 3 Lost
147 | 4 Kenny Rogers and Dolly Parton: Together
148 | 5 The Grudge
149 | 6 The Ring
150 | 7 #MonologueWars
151 | 8 Trainspotting
152 | 9 Community
153 | ```
154 | 
155 | That's it -- we can run PageRank. Instead of working with a set number of iterations, the PageRank implementation in GraphX can run until the ranks converge (stop changing). We'll set the tolerance threshold to 0.0001, which means we're waiting for convergence up to that threshold. This computation took just under 2 minutes on the same machine!
156 | 
157 | ```scala
158 | val pageRank = PageRank.runUntilConvergence(graph, 0.0001).vertices.map(
159 |     p => (p._1.toInt, p._2)).cache()
160 | ```
161 | 
162 | > The resulting graph vertices are pairs of the vertex id and its rank. We use `toInt` to convert it to an int for the subsequent join operation.
163 | 
164 | Next, load the vertices file that specifies the movie title for each id:
165 | 
166 | ```scala
167 | val titles = sc.textFile("file:///home/vagrant/data/movie-vertices.txt").map(
168 |     line => {
169 |         val parts = line.split(" ")
170 |         (parts(0).toInt, parts.drop(1).mkString(" "))
171 |     }
172 | )
173 | ```
174 | 
175 | Finally, join the ranks and the titles and sort the result to print the top 10 movies by rank:
176 | 
177 | ```scala
178 | titles.join(pageRank).sortBy(-_._2._2).map(_._2).take(10)
179 | ```
180 | 
181 | ___
182 | 
183 | #### Discussion
184 | 
185 | Besides being easier to use than implementing your own algorithms, why do you think GraphX has potential for being faster than something you'd roll by hand?
186 | 


--------------------------------------------------------------------------------
/python/lab6-pagerank.md:
--------------------------------------------------------------------------------
  1 | ### Lab 6: Movie PageRank
  2 | 
  3 | In this lab, you will run the [PageRank](https://en.wikipedia.org/wiki/PageRank) algorithm on a dataset of movie references, and try to identify the most popular movies based on how many references they have. The dataset you'll be working with is [provided by IMDB](http://www.imdb.com/interfaces).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | The original IMDB dataset is not very friendly for automatic processing. You can find it in the `~/data` folder of the VirtualBox appliance, or download it yourself from the IMDB FTP website -- it's the `movie-links.list` dataset. Here's a sampler:
 10 | 
 11 | ```
 12 | "#1 Single" (2006)
 13 |   (referenced in "Howard Stern on Demand" (2005) {Lisa Loeb & Sister})
 14 | 
 15 | "#LawstinWoods" (2013) {The Arrival (#1.1)}
 16 |   (references "Lost" (2004))
 17 |   (references Kenny Rogers and Dolly Parton: Together (1985) (TV))
 18 |   (references The Grudge (2004))
 19 |   (references The Ring (2002))
 20 | ```
 21 | 
 22 | Instead of using this raw dataset, there's a pre-processed one available in the `processed-movie-links.txt` file (it doesn't contain all the information from the first one, but we can live with that). Again, here's a sample:
 23 | 
 24 | ```
 25 | $ head processed-movie-links.txt
 26 | #LawstinWoods --> Lost
 27 | #LawstinWoods --> Kenny Rogers and Dolly Parton: Together
 28 | #LawstinWoods --> The Grudge
 29 | #LawstinWoods --> The Ring
 30 | #MonologueWars --> Trainspotting
 31 | Community --> $#*! My Dad Says
 32 | Conan --> $#*! My Dad Says
 33 | Geeks Who Drink --> $#*! My Dad Says
 34 | Late Show with David Letterman --> $#*! My Dad Says
 35 | ```
 36 | 
 37 | ___
 38 | 
 39 | #### Task 2: Finding Top Movies
 40 | 
 41 | Now it's time to implement the PageRank algorithm. It's probably the most challenging task so far, so here are some instructions that might help.
 42 | 
 43 | > **NOTE**: This is a very naive implementation of PageRank, which doesn't really try to optimize and minimize data shuffling. The GraphX library, which is also part of Spark, has a [native implementation of PageRank](https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html#pagerank). You can try it in Task 3.
 44 | 
 45 | Begin by parsing the movie references to an RDD called `links` (using `SparkContext.textFile` and `map`) and processing it into key/value pairs where the key is the movie and the value is a list of all movies referenced by it.
 46 | 
 47 | Next, create an RDD called `ranks` of key/value pairs where the key is the movie and the value is its rank, set to 1.0 initially for all movies.
 48 | 
 49 | Next, write a function `computeContribs` that takes a list of referenced movies and the referencing movie's rank, and returns a list of key/value pairs where the key is the movie and the value is its rank contribution. Each of the referenced movies gets an equal portion of the referencing movie's rank. For example, if "Star Wars" currently has rank 1.0 and references "Wizard of Oz" and "Star Trek", then the function should return two pairs: `("Wizard of Oz", 0.5)` and `("Star Trek", 0.5)`.
 50 | 
 51 | Next, we're getting to the heart of the algorithm. In a loop that repeats 10 times, compute a new RDD called `contribs` which is formed by joining `links` and `ranks` (the join is on the movie name). Use `flatMap` to collect the results from `computeContribs` on each key/value pair in the result of the join. To understand what we're doing, consider that joining `links` and `ranks` produces a pair RDD whose elements look like this:
 52 | 
 53 | ```python
 54 | ("Star Wars", ("Wizard of Oz", "Star Trek", 0.8))
 55 | ```
 56 | 
 57 | Now, invoking `computeContribs` on the value of this pair produces a list of pairs:
 58 | 
 59 | ```python
 60 | [("Wizard of Oz", 0.4), ("Star Trek", 0.4)]
 61 | ```
 62 | 
 63 | By applying `computeContribs` and collecting the results with `flatMap`, we get a pair RDD that has, for each movie, its contribution from each of its neighbors. You should now sum (reduce) this pair RDD by key, so we get the sum of each movie's contributions from its neighbors.
 64 | 
 65 | Next, the PageRank algorithm dictates that we should recompute each movie's rank from the `ranks` RDD as 0.15 + 0.85 times its neighbors' contribution (you can use `mapValues` for this). This recomputation produces a new input value for `ranks`.
 66 | 
 67 | Finally, when your loop is done, display the 10 highest-ranked movies and their PageRank.
 68 | 
 69 | **Solution**:
 70 | 
 71 | ```python
 72 | # links is RDD of (movie, [referenced movies])
 73 | links = sc.textFile("file:///home/ubuntu/data/processed-movie-links.txt") \
 74 |           .map(lambda line: line.split("-->"))                             \
 75 |           .map(lambda (a, b): (a.strip(), b.strip()))                      \
 76 |           .distinct()                                                      \
 77 |           .groupByKey()                                                    \
 78 |           .cache()
 79 | 
 80 | # ranks is RDD of (movie, 1.0)
 81 | ranks = links.map(lambda (movie, _): (movie, 1.0))
 82 | 
 83 | # each of our references gets a contribution of our rank divided by the
 84 | # total number of our references
 85 | def computeContribs(referenced, rank):
 86 |     count = len(referenced)
 87 |     for movie in referenced:
 88 |         yield (movie, rank / count)
 89 | 
 90 | for _ in range(0, 10):
 91 |     # recompute each movie's contributions from its referencing movies
 92 |     contribs = links.join(ranks).flatMap(lambda (_, (referenced, rank)):
 93 |         computeContribs(referenced, rank)
 94 |                                         )
 95 |     # recompute the movie's ranks by accounting all its referencing
 96 |     # movies' contributions
 97 |     ranks = contribs.reduceByKey(lambda a, b: a + b)                       \
 98 |                     .mapValues(lambda rank: rank*0.85 + 0.15)
 99 | 
100 | for movie, rank in ranks.sortBy(lambda (_, rank): -rank).take(10):
101 |     print('"%s" has rank %2.2f' % (movie, rank))
102 | ```
103 | 
104 | ___
105 | 
106 | #### Task 3: GraphX PageRank
107 | 
108 | The PageRank algorithm we implemented in the previous task is not very efficient. For example, running it on our dataset for 100 iterations took approximately 15 minutes on a 4-core machine. Considering that there are "just" about 25,000 movies ranked, this is not a very good result.
109 | 
110 | Spark ships with a native graph algorithm library called GraphX. Unfortunately, it doesn't yet have a Python binding -- you can only use it from Scala and Java. But we're not going to let that stop us!
111 | 
112 | Navigate to the Spark installation directory (`~/spark` in the appliance) and run `bin/spark-shell`. This is the Spark Scala REPL, which is very similar to PySpark, except it uses Scala. First, you're going to need a couple of import statements:
113 | 
114 | ```scala
115 | import org.apache.spark._
116 | import org.apache.spark.graphx._
117 | import org.apache.spark.graphx.lib._
118 | ```
119 | 
120 | Next, load the graph edges from the supplied `~/data/movie-edges.txt` file:
121 | 
122 | ```scala
123 | val graph = GraphLoader.edgeListFile(sc,
124 |     "file:///home/ubuntu/data/movie-edges.txt")
125 | ```
126 | 
127 | This file was generated from the same dataset, but it has a format that GraphX natively supports. You can check out the format by running the following commands:
128 | 
129 | ```
130 | $ head ~/data/movie-edges.txt
131 | 0 1
132 | 2 3
133 | 2 4
134 | 2 5
135 | 2 6
136 | 7 8
137 | 9 10
138 | 11 10
139 | 12 10
140 | 13 10
141 | $ head ~/data/movie-vertices.txt
142 | 0 Howard Stern on Demand
143 | 1 #1 Single
144 | 2 #LawstinWoods
145 | 3 Lost
146 | 4 Kenny Rogers and Dolly Parton: Together
147 | 5 The Grudge
148 | 6 The Ring
149 | 7 #MonologueWars
150 | 8 Trainspotting
151 | 9 Community
152 | ```
153 | 
154 | That's it -- we can run PageRank. Instead of working with a set number of iterations, the PageRank implementation in GraphX can run until the ranks converge (stop changing). We'll set the tolerance threshold to 0.0001, which means we're waiting for convergence up to that threshold. This computation took just under 2 minutes on the same machine!
155 | 
156 | ```scala
157 | val pageRank = PageRank.runUntilConvergence(graph, 0.0001).vertices.map(
158 |     p => (p._1.toInt, p._2)).cache()
159 | ```
160 | 
161 | > The resulting graph vertices are pairs of the vertex id and its rank. We use `toInt` to convert it to an int for the subsequent join operation.
162 | 
163 | Next, load the vertices file that specifies the movie title for each id:
164 | 
165 | ```scala
166 | val titles = sc.textFile("file:///home/ubuntu/data/movie-vertices.txt").map(
167 |     line => {
168 |         val parts = line.split(" ");
169 |         (parts(0).toInt, parts.drop(1).mkString(" "))
170 |     }
171 | )
172 | ```
173 | 
174 | Finally, join the ranks and the titles and sort the result to print the top 10 movies by rank:
175 | 
176 | ```scala
177 | titles.join(pageRank).sortBy(-_._2._2).map(_._2).take(10)
178 | ```
179 | 
180 | ___
181 | 
182 | #### Discussion
183 | 
184 | Besides being easier to use than implementing your own algorithms, why do you think GraphX has potential for being faster than something you'd roll by hand?
185 | 


--------------------------------------------------------------------------------
/scala/lab2-airlines.md:
--------------------------------------------------------------------------------
  1 | ### Lab 2: Flight Delay Analysis
  2 | 
  3 | In this lab, you will analyze a real-world dataset -- information about US flight delays in January 2016, courtesy of the United States Department of Transportation. You can [download additional datasets](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) later. Here's another example you might find interesting -- [US border crossing/entry data per port of entry](http://transborder.bts.gov/programs/international/transborder/TBDR_BC/TBDR_BCQ.html).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | This dataset ships with two files (in the `/home/vagrant/data` directory, if you are using the instructor-provided VirtualBox appliance). First, the `airline-format.html` file contains a brief description of the dataset, and the various data fields. For example, the `ArrDelay` field is the flight's arrival delay, in minutes. Second, the `airline-delays.csv` file is a comma-separated collection of flight records, one record per line.
 10 | 
 11 | Inspect the fields described in the `airline-format.html` file. Make a note of fields that describe the flight, its origin and destination airports, and any delays encountered on departure and arrival.
 12 | 
 13 | Let's start by counting the number of records in our dataset. Run the following command in a terminal window:
 14 | 
 15 | ```
 16 | wc -l airline-delays.csv
 17 | ```
 18 | 
 19 | This dataset has hundreds of thousands of records. To sample 10 records from the dataset picked at probability 0.005%, run the following command (for convenience, its output is also quoted here):
 20 | 
 21 | ```
 22 | $ cat airline-delays.csv | cut -d',' -f1-20 | awk '{ if (rand() <= 0.00005 || FNR==1) { print $0; if (++count > 11) exit; } }'
 23 | "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac"
 24 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N3AFAA","242",14771,1477102,32457,"SFO","San Francisco, CA","CA","06","California"
 25 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N859AA","284",12173,1217302,32134,"HNL","Honolulu, HI","HI","15","Hawaii"
 26 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N3GRAA","1227",11278,1127803,30852,"DCA","Washington, DC","VA","51","Virginia"
 27 | 2016,1,1,4,1,2016-01-04,"AA",19805,"AA","N3BGAA","1450",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas"
 28 | 2016,1,1,5,2,2016-01-05,"AA",19805,"AA","N3AMAA","1616",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas"
 29 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N916US","1783",11057,1105703,31057,"CLT","Charlotte, NC","NC","37","North Carolina"
 30 | 2016,1,1,2,6,2016-01-02,"AS",19930,"AS","N517AS","879",14747,1474703,30559,"SEA","Seattle, WA","WA","53","Washington"
 31 | 2016,1,1,20,3,2016-01-20,"AS",19930,"AS","N769AS","568",14057,1405702,34057,"PDX","Portland, OR","OR","41","Oregon"
 32 | 2016,1,1,24,7,2016-01-24,"UA",19977,"UA","","706",14843,1484304,34819,"SJU","San Juan, PR","PR","72","Puerto Rico"
 33 | 2016,1,1,15,5,2016-01-15,"UA",19977,"UA","N34460","1077",12266,1226603,31453,"IAH","Houston, TX","TX","48","Texas"
 34 | 2016,1,1,12,2,2016-01-12,"UA",19977,"UA","N423UA","1253",13303,1330303,32467,"MIA","Miami, FL","FL","12","Florida"
 35 | ```
 36 | 
 37 | This displays the first 20 fields of the 10 sampled records from the file. The first line is a header line, so we printed it unconditionally. This is a typical example of structured data that we would have to parse first before analyzing it with Spark.
 38 | 
 39 | > We could examine the full dataset using shell commands, because it is not exceptionally big. For larger datasets that couldn't conceivably be processed or even stored on a single machine, we could have used Spark itself to perform the sampling. If you're interested, examine the `takeSample` method that Spark RDDs provide.
 40 | 
 41 | ___
 42 | 
 43 | #### Task 2: Parsing CSV Data
 44 | 
 45 | Next, you have to parse the CSV data. The header line provides the column names, and then each subsequent line can be parsed taking these into account. In Spark there is no built-in functionality that can parse CSV files. But there is a library(com.databricks.spark.csv) that you can use to parse CSV lines. 
 46 | 
 47 | You must run the following code before running any Spark code. 
 48 | Otherwise you restrart Zeppilin Spark Interpreter: Interpreter -> spark box(first one) -> restart button
 49 | Now go back to your note and run:
 50 | 
 51 | ```scala
 52 | %dep
 53 | z.reset()
 54 | z.load("com.databricks:spark-csv_2.11:1.4.0")
 55 | ```
 56 | 
 57 | Next, create an DataFrame based on the `airline-delays.csv` file by using this library.
 58 | Note that you have access to a pre-initialized `SQLContext` object named `sqlContext`.
 59 | 
 60 | ```scala
 61 | val flightsDF = sqlContext.read
 62 |     .format("com.databricks.spark.csv")
 63 |     .option("header", "true")      // Use first line of all files as header
 64 |     .option("inferSchema", "true") // Automatically infer data types
 65 |     .load("file:///home/vagrant/data/airline-delays.csv")
 66 | ```
 67 | 
 68 | You can check the schema by printing it:
 69 | ```
 70 | flightsDF.printSchema
 71 | ```
 72 | 
 73 | ___
 74 | 
 75 | #### Task 3: Converting dataframe to rdd
 76 | 
 77 | In this lab we want to practice RDD operations. Later in this workshop we will use dataframes to manipulate data.
 78 | First we need to create case class Flight(use only several fields that you will use in this lab) with fields: Carrier, OriginCityName, ArrDelay, DestCityName,
 79 | Distance
 80 | 
 81 | And then create FlightsRdd from flightsDF
 82 | 
 83 | **Solution**:
 84 | 
 85 | ```scala
 86 | case class Flight(Carrier: String, OriginCityName: String, ArrDelay: Double, DestCityName: String, Distance: Double)
 87 | val flightRdd = flightsDF.map(row => Flight(row.getAs("Carrier"), row.getAs("OriginCityName"), row.getAs("ArrDelay"), row.getAs("DestCityName"), row.getAs("Distance")))
 88 | ```
 89 | 
 90 | ___
 91 | 
 92 | #### Task 4: Querying Flights and Delays
 93 | 
 94 | Now that you have the flight objects, it's time to perform a few queries and gather some useful information. Suppose you're in Boston, MA. Which airline has the most flights departing from Boston?
 95 | 
 96 | 
 97 | **Solution**:
 98 | 
 99 | ```scala
100 | val carriersFromBoston = flightRdd.filter(f => f.OriginCityName == "Boston, MA").map(f => (f.Carrier, 1))
101 | val carrierWithMostFlights = carriersFromBoston.reduceByKey(_ + _).sortBy(_._2, false).take(1)
102 | ```
103 | 
104 | 
105 | Overall, which airline has the worst average delay? How bad was that delay?
106 | 
107 | > **HINT**: Use `combineByKey`.
108 | 
109 | 
110 | **Solution**:
111 | 
112 | ```scala
113 | val avgDelay = flightRdd.filter(f => f.ArrDelay > 0)
114 | .map(f => (f.Carrier, f.ArrDelay))
115 | .combineByKey(d => (d, 1),
116 |              (s:(Double, Int), d: Double) => (s._1 + d, s._2 + 1),
117 |              (s1:(Double, Int) , s2: (Double, Int)) => (s1._1 + s2._1, s1._2 + s2._2)
118 | )
119 | 
120 | val worstAirline = avgDelay.map{ case (car, (av, cnt)) => ((car, av/cnt)) }
121 | worstAirline.collect()
122 | ```
123 | 
124 | 
125 | Living in Chicago, IL, what are the farthest 10 destinations that you could fly to? (Note that our dataset contains only US domestic flights.)
126 | 
127 | **Solution**:
128 | 
129 | ```scala
130 | val chicagoFarthest = flightRdd.filter(f => f.OriginCityName == "Chicago, IL")
131 | .map(f => (f.DestCityName, f.Distance))
132 | .distinct()
133 | .sortBy(_._2, false)
134 | .take(10)
135 | ```
136 | 
137 | 
138 | Suppose you're in New York, NY and are contemplating direct flights to San Francisco, CA. In terms of arrival delay, which airline has the best record on that route?
139 | 
140 | **Solution**:
141 | 
142 | ```scala
143 | val nyToSF = flightRdd.filter(f => (f.OriginCityName == "New York, NY") && (f.DestCityName == "San Francisco, CA") && (f.ArrDelay > 0))
144 | .map(f => (f.Carrier, f.ArrDelay))
145 | .reduceByKey(_ + _)
146 | .sortBy(_._2)
147 | .take(1)
148 | ```
149 | 
150 | 
151 | Suppose you live in San Jose, CA, and there don't seem to be many direct flights taking you to Boston, MA. Of all the 1-stop flights, which would be the best option in terms of average arrival delay? (It's OK to assume that every pair of flights from San Jose to X and from X to Boston is an option that you could use.)
152 | 
153 | > **NOTE**: To answer this question, you will probably need a cartesian product of the dataset with itself. Beside the fact that it's a fairly expensive operation, we haven't learned about multi-RDD operations yet. Still, you can explore the `join` RDD method, which applies to pair (key-value) RDDs, discussed later in our workshop.
154 | 
155 | **Solution**:
156 | 
157 | ```scala
158 | val flightsByDst = flightRdd.filter(f => f.OriginCityName == "San Jose, CA")
159 | .map(f => (f.DestCityName, f))
160 | 
161 | val flightsByOrg = flightRdd.filter(f => f.DestCityName == "Boston, MA")
162 | .map(f  => (f.OriginCityName, f))
163 | 
164 | def addDelays(f1:Flight, f2:Flight) = {
165 |     var total = 0.0
166 |     total += f1.ArrDelay
167 |     total += f2.ArrDelay
168 |     total
169 | }
170 |     
171 | flightsByDst.join(flightsByOrg)
172 |             .map{ case (city, (f1, f2)) => (city, addDelays(f1, f2)) }
173 |             .combineByKey(d => (d, 1),
174 |                           (s: (Double, Int), d: Double) => (s._1 + d, s._2 + 1),
175 |                           (s1:  (Double, Int), s2:  (Double, Int)) => (s1._1 + s2._1, s1._2 + s2._2))
176 |             .map { case (city, s) => (city, s._1/s._2) }
177 |             .sortBy(_._2)
178 |             .take(1)    
179 | ```
180 | 
181 | ___
182 | 
183 | ### Discussion
184 | 
185 | Suppose you had to calculate multiple aggregated values from the `flights` RDD -- e.g., the average arrival delay, the average departure delay, and the average flight duration for flights from Boston. How would you express it using SQL, if `flights` was a table in a relational database? How would you express it using transformations and actions on RDDs? Which is easier to develop and maintain?
186 | 
187 | 
188 | 
189 | 


--------------------------------------------------------------------------------
/python/lab2-airlines.md:
--------------------------------------------------------------------------------
  1 | ### Lab 2: Flight Delay Analysis
  2 | 
  3 | In this lab, you will analyze a real-world dataset -- information about US flight delays in January 2016, courtesy of the United States Department of Transportation. You can [download additional datasets](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) later. Here's another example you might find interesting -- [US border crossing/entry data per port of entry](http://transborder.bts.gov/programs/international/transborder/TBDR_BC/TBDR_BCQ.html).
  4 | 
  5 | ___
  6 | 
  7 | #### Task 1: Inspecting the Data
  8 | 
  9 | This dataset ships with two files (in the `~/data` directory, if you are using the instructor-provided appliance). First, the `airline-format.html` file contains a brief description of the dataset, and the various data fields. For example, the `ArrDelay` field is the flight's arrival delay, in minutes. Second, the `airline-delays.csv` file is a comma-separated collection of flight records, one record per line.
 10 | 
 11 | Inspect the fields described in the `airline-format.html` file. Make a note of fields that describe the flight, its origin and destination airports, and any delays encountered on departure and arrival.
 12 | 
 13 | Let's start by counting the number of records in our dataset. Run the following command in a terminal window:
 14 | 
 15 | ```
 16 | wc -l airline-delays.csv
 17 | ```
 18 | 
 19 | This dataset has hundreds of thousands of records. To sample 10 records from the dataset picked at probability 0.005%, run the following command (for convenience, its output is also quoted here):
 20 | 
 21 | ```
 22 | $ cat airline-delays.csv | cut -d',' -f1-20 | awk '{ if (rand() <= 0.00005 || FNR==1) { print $0; if (++count > 11) exit; } }'
 23 | "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac"
 24 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N3AFAA","242",14771,1477102,32457,"SFO","San Francisco, CA","CA","06","California"
 25 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N859AA","284",12173,1217302,32134,"HNL","Honolulu, HI","HI","15","Hawaii"
 26 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N3GRAA","1227",11278,1127803,30852,"DCA","Washington, DC","VA","51","Virginia"
 27 | 2016,1,1,4,1,2016-01-04,"AA",19805,"AA","N3BGAA","1450",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas"
 28 | 2016,1,1,5,2,2016-01-05,"AA",19805,"AA","N3AMAA","1616",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas"
 29 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N916US","1783",11057,1105703,31057,"CLT","Charlotte, NC","NC","37","North Carolina"
 30 | 2016,1,1,2,6,2016-01-02,"AS",19930,"AS","N517AS","879",14747,1474703,30559,"SEA","Seattle, WA","WA","53","Washington"
 31 | 2016,1,1,20,3,2016-01-20,"AS",19930,"AS","N769AS","568",14057,1405702,34057,"PDX","Portland, OR","OR","41","Oregon"
 32 | 2016,1,1,24,7,2016-01-24,"UA",19977,"UA","","706",14843,1484304,34819,"SJU","San Juan, PR","PR","72","Puerto Rico"
 33 | 2016,1,1,15,5,2016-01-15,"UA",19977,"UA","N34460","1077",12266,1226603,31453,"IAH","Houston, TX","TX","48","Texas"
 34 | 2016,1,1,12,2,2016-01-12,"UA",19977,"UA","N423UA","1253",13303,1330303,32467,"MIA","Miami, FL","FL","12","Florida"
 35 | ```
 36 | 
 37 | This displays the first 20 fields of the 10 sampled records from the file. The first line is a header line, so we printed it unconditionally. This is a typical example of structured data that we would have to parse first before analyzing it with Spark.
 38 | 
 39 | > We could examine the full dataset using shell commands, because it is not exceptionally big. For larger datasets that couldn't conceivably be processed or even stored on a single machine, we could have used Spark itself to perform the sampling. If you're interested, examine the `takeSample` method that Spark RDDs provide.
 40 | 
 41 | ___
 42 | 
 43 | #### Task 2: Parsing CSV Data
 44 | 
 45 | Next, you have to parse the CSV data. The header line provides the column names, and then each subsequent line can be parsed taking these into account. Python has a built-in `csv` module (unrelated to Spark) that you can use to parse CSV lines. Try it out in a Python shell (by running `python` in a terminal window) or the Pyspark shell (by running `bin/pyspark` from Spark's installation directory in a terminal window):
 46 | 
 47 | ```python
 48 | import csv
 49 | from StringIO import StringIO
 50 | 
 51 | si = StringIO('"Alice",14,"panda"')
 52 | fields = ["name", "age", "favorite animal"]
 53 | csv.DictReader(si, fieldnames=fields).next()
 54 | ```
 55 | 
 56 | Great! Next, write a function that parses one line from the flight delays CSV file. You can call that function `parseLine`, and it should return the Python dict that `DictReader.next` returns.
 57 | 
 58 | **Solution**:
 59 | 
 60 | ```python
 61 | def parseLine(line, fieldnames):
 62 |     si = StringIO(line)
 63 |     return csv.DictReader(si, fieldnames=fieldnames).next()
 64 | ```
 65 | 
 66 | Next, create an RDD based on the `airline-delays.csv` file, and map each line of that file using the `parseLine` function you wrote. The result should be an RDD of Python dicts representing the flight delay data. Note that the first line (the header line) should be discarded.
 67 | 
 68 | **Solution**:
 69 | 
 70 | ```python
 71 | rdd = sc.textFile("file:////home/ubuntu/data/airline-delays.csv")
 72 | headerline = rdd.first()
 73 | fieldnames = filter(lambda field: len(field) > 0,
 74 |                 map(lambda field: field.strip('"'), headerline.split(',')))
 75 | flights = rdd.filter(lambda line: line != headerline)       \
 76 |              .map(lambda line: parseLine(line, fieldnames))
 77 | flights.persist()
 78 | ```
 79 | 
 80 | ___
 81 | 
 82 | #### Task 3: Querying Flights and Delays
 83 | 
 84 | Now that you have the flight objects, it's time to perform a few queries and gather some useful information. Suppose you're in Boston, MA. Which airline has the most flights departing from Boston?
 85 | 
 86 | **Solution**:
 87 | 
 88 | ```python
 89 | flightsByCarrier = flights.filter(
 90 |       lambda flight: flight['OriginCityName'] == "Boston, MA")    \
 91 |                            .map(lambda flight: flight['Carrier']) \
 92 |                            .countByValue()
 93 | sorted(flightsByCarrier.items(), key=lambda p: -p[1])[0]
 94 | ```
 95 | 
 96 | Overall, which airline has the worst average delay? How bad was that delay?
 97 | 
 98 | > **HINT**: Use `combineByKey`.
 99 | 
100 | **Solution**:
101 | 
102 | ```python
103 | flights.filter(lambda f: f['ArrDelay'] != '')                      \
104 |        .map(lambda f: (f['Carrier'], float(f['ArrDelay'])))        \
105 |        .combineByKey(lambda d: (d, 1),
106 |                      lambda s, d: (s[0]+d, s[1]+1),
107 |                      lambda s1, s2: (s1[0]+s2[0], s1[1]+s2[1]))    \
108 |        .map(lambda (k, (s, c)): (k, s/float(c)))                   \
109 |        .collect()
110 | ```
111 | 
112 | Living in Chicago, IL, what are the farthest 10 destinations that you could fly to? (Note that our dataset contains only US domestic flights.)
113 | 
114 | **Solution**:
115 | 
116 | ```python
117 | flights.filter(lambda f: f['OriginCityName'] == "Chicago, IL")      \
118 |        .map(lambda f: (f['DestCityName'], float(f['Distance'])))    \
119 |        .distinct()                                                  \
120 |        .sortBy(lambda (dest, dist): -dist)                          \
121 |        .take(10)
122 | ```
123 | 
124 | Suppose you're in New York, NY and are contemplating direct flights to San Francisco, CA. In terms of arrival delay, which airline has the best record on that route?
125 | 
126 | **Solution**:
127 | 
128 | ```python
129 | flights.filter(lambda flight: flight['OriginCityName'] == "New York, NY" and
130 |                               flight['DestCityName'] == "San Francisco, CA" and
131 |                               flight['ArrDelay'] != '')                       \
132 |        .map(lambda flight: (flight['Carrier'], float(flight['ArrDelay'])))    \
133 |        .reduceByKey(lambda a, b: a + b)                                       \
134 |        .sortBy(lambda (carrier, delay): delay)                                \
135 |        .first()
136 | ```
137 | 
138 | Suppose you live in San Jose, CA, and there don't seem to be many direct flights taking you to Boston, MA. Of all the 1-stop flights, which would be the best option in terms of average arrival delay? (It's OK to assume that every pair of flights from San Jose to X and from X to Boston is an option that you could use.)
139 | 
140 | > **NOTE**: To answer this question, you will probably need a cartesian product of the dataset with itself. Beside the fact that it's a fairly expensive operation, we haven't learned about multi-RDD operations yet. Still, you can explore the `join` RDD method, which applies to pair (key-value) RDDs, discussed later in our workshop.
141 | 
142 | **Solution**:
143 | 
144 | ```python
145 | flightsByDst = flights.filter(lambda f: f['OriginCityName'] == 'San Jose, CA')\
146 |                       .map(lambda f: (f['DestCityName'], f))
147 | flightsByOrg = flights.filter(lambda f: f['DestCityName'] == 'Boston, MA')    \
148 |                       .map(lambda f: (f['OriginCityName'], f))
149 | 
150 | def addDelays(f1, f2):
151 |     total = 0
152 |     total += float(f1['ArrDelay']) if f1['ArrDelay'] != '' else 0
153 |     total += float(f2['ArrDelay']) if f2['ArrDelay'] != '' else 0
154 |     return total
155 | 
156 | flightsByDst.join(flightsByOrg)                                       \
157 |             .map(lambda (city, (f1, f2)): (city, addDelays(f1, f2)))  \
158 |             .combineByKey(lambda d: (d, 1),
159 |                           lambda s, d: (s[0]+d, s[1]+1),
160 |                           lambda s1, s2: (s1[0]+s2[0], s1[1]+s2[1]))  \
161 |             .map(lambda (city, s): (city, s[0]/float(s[1])))          \
162 |             .sortBy(lambda (city, delay): delay)                      \
163 |             .first()
164 | ```
165 | 
166 | ___
167 | 
168 | ### Discussion
169 | 
170 | Suppose you had to calculate multiple aggregated values from the `flights` RDD -- e.g., the average arrival delay, the average departure delay, and the average flight duration for flights from Boston. How would you express it using SQL, if `flights` was a table in a relational database? How would you express it using transformations and actions on RDDs? Which is easier to develop and maintain?
171 | 


--------------------------------------------------------------------------------