├── .gitignore ├── .gitattributes ├── data.zip ├── LICENSE.md ├── README.md ├── installation.sh ├── scala ├── lab3-startups.md ├── lab1-wordcount.md ├── lab4-propprices.md ├── lab0-scala.md ├── lab6-pagerank.md └── lab2-airlines.md └── python ├── lab3-startups.md ├── lab0-python.md ├── lab5-streaming.md ├── lab4-propprices.md ├── lab7-plagiarism.md ├── lab1-wordcount.md ├── lab6-pagerank.md └── lab2-airlines.md /.gitignore: -------------------------------------------------------------------------------- 1 | data/ -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | data.zip filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /data.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:cd16a232434ae7e84c41b806c61d91047c5335dbca26d47d6d2b38417dcc70cc 3 | size 80991892 4 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Sasha Goldshtein 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Spark Workshop 2 | 3 | This repository contains hands-on labs and data files for a full-day Apache Spark workshop. This file is also the index for the hands-on labs. 4 | 5 | ____ 6 | 7 | #### Python Labs 8 | 9 | 1. [Lab 0 - Python Fundamentals](python/lab0-python.md) 10 | 11 | 1. [Lab 1 - Multi-File Word Count](python/lab1-wordcount.md) 12 | 13 | 1. [Lab 2 - Analyzing Flight Delays](python/lab2-airlines.md) 14 | 15 | 1. [Lab 3 - Analyzing Startup Companies](python/lab3-startups.md) 16 | 17 | 1. [Lab 4 - Analyzing UK Property Prices](python/lab4-propprices.md) 18 | 19 | 1. [Lab 5 - Streaming Tweet Analysis](python/lab5-streaming.md) 20 | 21 | 1. [Lab 6 - PageRank over Movie References](python/lab6-pagerank.md) 22 | 23 | 1. [Lab 7 - Plagiarism Detection](python/lab7-plagiarism.md) 24 | 25 | ____ 26 | 27 | #### Scala Labs (under development) 28 | 29 | 1. [Lab 0 - Scala Fundamentals](scala/lab0-scala.md) 30 | 31 | 1. [Lab 1 - Multi-File Word Count](scala/lab1-wordcount.md) 32 | 33 | 1. [Lab 2 - Analyzing Flight Delays](scala/lab2-airlines.md) 34 | 35 | 1. [Lab 3 - Analyzing Startup Companies](scala/lab3-startups.md) 36 | 37 | 1. [Lab 4 - Analyzing UK Property Prices](scala/lab4-propprices.md) 38 | 39 | 1. [Lab 6 - PageRank over Movie References](scala/lab6-pagerank.md) 40 | 41 | ____ 42 | 43 | Copyright (C) Sasha Goldshtein, 2016. All rights reserved. 44 | -------------------------------------------------------------------------------- /installation.sh: -------------------------------------------------------------------------------- 1 | # oracle java 8 2 | echo "\n" | sudo add-apt-repository ppa:openjdk-r/ppa 3 | sudo apt-get update -y 4 | sudo apt-get install -y openjdk-8-jdk 5 | 6 | # spark download and setup 7 | wget https://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz -O /tmp/spark-1.6.1.tgz 8 | sudo ufw disable 9 | sudo mkdir -p /usr/lib/spark 10 | sudo tar -xf /tmp/spark-1.6.1.tgz --strip 1 -C /usr/lib/spark 11 | echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bash_profile 12 | echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> ~/.bash_profile 13 | echo "export SPARK_HOME=/usr/lib/spark" >> ~/.bash_profile 14 | echo "export PATH=\$SPARK_HOME/bin:\$PATH" >> ~/.bash_profile 15 | source ~/.bash_profile 16 | 17 | # spark log config 18 | sudo rm /usr/lib/spark/conf/log4j.properties 19 | sudo touch /usr/lib/spark/conf/log4j.properties 20 | sudo bash -c 'cat << EOF > /usr/lib/spark/conf/log4j.properties 21 | # Set everything to be logged to the console 22 | log4j.rootCategory=WARN, console 23 | log4j.appender.console=org.apache.log4j.ConsoleAppender 24 | log4j.appender.console.target=System.err 25 | log4j.appender.console.layout=org.apache.log4j.PatternLayout 26 | log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n 27 | 28 | # Settings to quiet third party logs that are too verbose 29 | log4j.logger.org.spark-project.jetty=WARN 30 | log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR 31 | log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN 32 | log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN 33 | log4j.logger.org.apache.parquet=ERROR 34 | log4j.logger.parquet=ERROR 35 | 36 | # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support 37 | log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL 38 | log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR 39 | EOF' 40 | 41 | # spark default config 42 | sudo rm /usr/lib/spark/conf/spark-defaults.conf 43 | sudo touch /usr/lib/spark/conf/spark-defaults.conf 44 | sudo bash -c 'cat << EOF > /usr/lib/spark/conf/spark-defaults.conf 45 | spark.master spark://$(hostname):7077 46 | spark.eventLog.enabled true 47 | spark.eventLog.dir file:///usr/lib/spark/logs/eventlog 48 | EOF' 49 | sudo mkdir -p /usr/lib/spark/logs/eventlog 50 | sudo chmod -R 777 /usr/lib/spark/logs 51 | 52 | # zeppelin setup 53 | wget http://apache.mivzakim.net/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0-bin-all.tgz -O /tmp/zeppelin-0.6.0.tgz 54 | sudo mkdir -p /usr/lib/zeppelin 55 | sudo tar -xf /tmp/zeppelin-0.6.0.tgz --strip 1 -C /usr/lib/zeppelin 56 | 57 | # zeppelin config 58 | sudo rm /usr/lib/zeppelin/conf/zeppelin-env.sh 59 | sudo touch /usr/lib/zeppelin/conf/zeppelin-env.sh 60 | sudo bash -c 'cat << EOF > /usr/lib/zeppelin/conf/zeppelin-env.sh 61 | #!/bin/bash 62 | export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 63 | export MASTER=spark://$(hostname):7077 64 | export SPARK_HOME=/usr/lib/spark 65 | export ZEPPELIN_PORT=9995 66 | EOF' 67 | 68 | sudo ufw disable 69 | 70 | # start everything up 71 | sudo /usr/lib/spark/sbin/stop-master.sh 72 | sudo /usr/lib/spark/sbin/stop-slave.sh 73 | sudo /usr/lib/spark/sbin/start-master.sh 74 | sudo bash -c '/usr/lib/spark/sbin/start-slave.sh spark://$(hostname):7077' 75 | sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart 76 | -------------------------------------------------------------------------------- /scala/lab3-startups.md: -------------------------------------------------------------------------------- 1 | ### Lab 3: Analyzing Startup Companies 2 | 3 | In this lab, you will analyze a real-world dataset -- information about startup companies. The source of this dataset is [jSONAR](http://jsonstudio.com/resources/). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | This time, the data is provided as a JSON document, one entry per line. You can find it in `/home/vagrant/data/companies.json`. Take a look at the first entry by using the following command: 10 | 11 | ``` 12 | head -n 1 /home/vagrant/data/companies.json 13 | ``` 14 | 15 | As you can see, the schema is fairly complicated -- it has a bunch of fields, nested objects, arrays, and so on. It describes the company's products, key people, acquisition data, and more. We are going to use Spark SQL to infer the schema of this JSON document, and then issue queries using a natural SQL syntax. 16 | 17 | ___ 18 | 19 | #### Task 2: Parsing the Data 20 | 21 | Create a `DataFrame` from the JSON file so that its schema is automatically inferred, print out the resulting schema, and register it as a temporary table called "companies". 22 | 23 | **Solution**: 24 | 25 | ```scala 26 | val companies = sqlContext.read.json("file:///home/vagrant/data/companies.json") 27 | companies.printSchema() 28 | companies.registerTempTable("companies") 29 | ``` 30 | 31 | ___ 32 | 33 | #### Task 3: Querying the Data 34 | 35 | First, let's talk about the money; figure out what the average acquisition price was. 36 | 37 | **Solution**: 38 | 39 | ```scala 40 | sqlContext.sql("select avg(acquisition.price_amount) from companies").first() 41 | ``` 42 | 43 | Not too shabby. Let's get some additional detail -- print the average acquisition price grouped by number of years the company was active. 44 | 45 | **Solution**: 46 | 47 | ```scala 48 | sqlContext.sql( 49 | """select acquisition.acquired_year-founded_year as years_active, 50 | avg(acquisition.price_amount) as acq_price 51 | from companies 52 | where acquisition.price_amount is not null 53 | group by acquisition.acquired_year-founded_year 54 | order by acq_price desc""").collect() 55 | ``` 56 | 57 | Finally, let's try to figure out the relationship between the company's total funding and acquisition price. In order to do that, you'll need a UDF (user-defined function) that, given a company, returns the sum of all its funding rounds. First, build that function and register it with the name "total_funding". 58 | 59 | **Solution**: 60 | 61 | 62 | ```scala 63 | import org.apache.spark.sql.Row 64 | 65 | sqlContext.udf.register("total_funding", (investments: Seq[Row]) => { 66 | val totals = investments.map(_.getAs[Row]("funding_round").getAs[Long]("raised_amount")) 67 | totals.sum 68 | }) 69 | ``` 70 | 71 | Test your function by retrieving the total funding for a few companies, such as Facebook, Paypal, and Alibaba. Now, find the average ratio between the acquisition price and the total funding (which, in a simplistic way, represents return on investment). 72 | 73 | **Solution**: 74 | 75 | ```scala 76 | sqlContext.sql( 77 | """select avg(acquisition.price_amount/total_funding(investments)) 78 | from companies 79 | where acquisition.price_amount is not null 80 | and total_funding(investments) != 0""").collect() 81 | ``` 82 | 83 | ___ 84 | 85 | #### Discussion 86 | 87 | See discussion for the [next lab](lab4-propprices.md). 88 | -------------------------------------------------------------------------------- /python/lab3-startups.md: -------------------------------------------------------------------------------- 1 | ### Lab 3: Analyzing Startup Companies 2 | 3 | In this lab, you will analyze a real-world dataset -- information about startup companies. The source of this dataset is [jSONAR](http://jsonstudio.com/resources/). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | This time, the data is provided as a JSON document, one entry per line. You can find it in `~/data/companies.json`. Take a look at the first entry by using the following command: 10 | 11 | ``` 12 | head -n 1 ~/data/companies.json 13 | ``` 14 | 15 | As you can see, the schema is fairly complicated -- it has a bunch of fields, nested objects, arrays, and so on. It describes the company's products, key people, acquisition data, and more. We are going to use Spark SQL to infer the schema of this JSON document, and then issue queries using a natural SQL syntax. 16 | 17 | ___ 18 | 19 | #### Task 2: Parsing the Data 20 | 21 | Open a PySpark shell (by running `bin/pyspark` from the Spark installation directory in a terminal window). Note that you have access to a pre-initialized `SQLContext` object named `sqlContext`. 22 | 23 | Create a `DataFrame` from the JSON file so that its schema is automatically inferred, print out the resulting schema, and register it as a temporary table called "companies". 24 | 25 | **Solution**: 26 | 27 | ```python 28 | companies = sqlContext.read.json("file:///home/ubuntu/data/companies.json") 29 | companies.printSchema() 30 | companies.registerTempTable("companies") 31 | ``` 32 | 33 | ___ 34 | 35 | #### Task 3: Querying the Data 36 | 37 | First, let's talk about the money; figure out what the average acquisition price was. 38 | 39 | **Solution**: 40 | 41 | ```python 42 | sqlContext.sql("select avg(acquisition.price_amount) from companies").first() 43 | ``` 44 | 45 | Not too shabby. Let's get some additional detail -- print the average acquisition price grouped by number of years the company was active. 46 | 47 | **Solution**: 48 | 49 | ```python 50 | sqlContext.sql( 51 | """select acquisition.acquired_year-founded_year as years_active, 52 | avg(acquisition.price_amount) as acq_price 53 | from companies 54 | where acquisition.price_amount is not null 55 | group by acquisition.acquired_year-founded_year 56 | order by acq_price desc""").collect() 57 | ``` 58 | 59 | Finally, let's try to figure out the relationship between the company's total funding and acquisition price. In order to do that, you'll need a UDF (user-defined function) that, given a company, returns the sum of all its funding rounds. First, build that function and register it with the name "total_funding". 60 | 61 | **Solution**: 62 | 63 | ```python 64 | from pyspark.sql.types import IntegerType 65 | 66 | sqlContext.registerFunction("total_funding", lambda investments: sum( 67 | [inv.funding_round.raised_amount or 0 for inv in investments] 68 | ), IntegerType()) 69 | ``` 70 | 71 | Test your function by retrieving the total funding for a few companies, such as Facebook, Paypal, and Alibaba. Now, find the average ratio between the acquisition price and the total funding (which, in a simplistic way, represents return on investment). 72 | 73 | **Solution**: 74 | 75 | ```python 76 | sqlContext.sql( 77 | """select avg(acquisition.price_amount/total_funding(investments)) 78 | from companies 79 | where acquisition.price_amount is not null 80 | and total_funding(investments) != 0""").collect() 81 | ``` 82 | 83 | ___ 84 | 85 | #### Discussion 86 | 87 | See discussion for the [next lab](lab4-propprices.md). 88 | -------------------------------------------------------------------------------- /scala/lab1-wordcount.md: -------------------------------------------------------------------------------- 1 | ### Lab 1: Multi-File Word Count 2 | 3 | In this lab, you will get familiar with Spark and run your first Spark job -- a multi-file word count. 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Spark 8 | 9 | Open a terminal window. Navigate to the directory where you extracted Apache Spark. On the instructor-provided virtual machine, this is `~/spark`. 10 | 11 | Inspect the files in the `bin` directory. You will soon use `spark-shell` to launch your first Spark job. Also note `spark-submit`, which is used to submit standalone Spark programs to a cluster. 12 | 13 | Inspect the scripts in the `sbin` directory. These scripts help with setting up a stand-alone Spark cluster, deploying Spark to EC2 virtual machines, and a bunch of additional tasks. 14 | 15 | Finally, take a look at the `examples` directory. You can find a number of stand-alone demo programs here, covering a variety of Spark APIs. 16 | 17 | ___ 18 | 19 | #### Task 2: Inspecting the Lab Data Files 20 | 21 | In this lab, you will implement a multi-file word count. The texts you will use are freely available books from [Project Gutenberg](http://www.gutenberg.org), including classics such as Lewis Carroll's "Alice in Wonderland" and Jane Austin's "Pride and Prejudice". 22 | 23 | Take a look at some of the text files in the `/home/vagrant/data` directory. From the terminal, run: 24 | 25 | ``` 26 | head -n 50 /home/vagrant/data/*.txt | less 27 | ``` 28 | 29 | This shows the first 50 lines of each file. Press SPACE to scroll, or `q` to exit `less`. 30 | 31 | ___ 32 | 33 | #### Task 3: Implementing a Multi-File Word Count 34 | 35 | Navigate to the Spark installation directory, and run `./bin/spark-shell`. 36 | 37 | In this lab, you are going to use the `sc.textFile` method. 38 | The `textFile` method can work with a directory path or a wildcard filter such as `/home/vagrant/data/*.txt`. 39 | 40 | > Of course, if you are not using the instructor-supplied appliance, your `data` directory might reside in a different location. 41 | 42 | Your first task is to print out the number of lines in all the text files, combined. In general, you should try to come up with the solution yourself, and only then continue reading for the "school" solution. 43 | 44 | **Solution**: 45 | 46 | ```scala 47 | sc.textFile("file:///home/vagrant/data/*.txt").count() 48 | ``` 49 | 50 | Great! Your next task is to implement the actual word-counting program. You've already seen one in class, and now it's time for your own. Print the top 10 most frequent words in the provided books. 51 | 52 | **Solution**: 53 | 54 | ```scala 55 | val lines = sc.textFile("file:///home/vagrant/data/*.txt") 56 | val words = lines.flatMap(line => line.split(" ").filter(w => w != null && !w.isEmpty)) 57 | val pairs = words.map(word => (word, 1)) 58 | val freqs = pairs.reduceByKey((a, b) => a + b) 59 | val top10 = freqs.sortBy(_._2, false).take(10) 60 | top10.foreach(println) 61 | ``` 62 | 63 | To be honest, we don't really care about words like "the", "a", and "of". Ideally, we would have a list of stop words to ignore. For now, modify your solution to filter out words shorter than 4 characters. 64 | 65 | Additionally, you might be wondering about the types of all these variables -- most of them are RDDs. To trace the lineage of an RDD, use the `toDebugString` method. For example, `freqs.toDebugString()` should display the logical plan for that RDD's evaluation. We will discuss some of these concepts later. If you have window asking to select modules to include make sure that 2 selected and click OK. 66 | 67 | ___ 68 | 69 | #### Task 4: Run a Stand-Alone Spark Program 70 | 71 | Open Zeppelin at port 9995. This is a scala interpreter with web UI that will be used in the labs. 72 | Create new note: Notebook -> Create new note. 73 | Now, you can copy and paste your solution into the note and run it(shift+Enter) after changing path of the files to "file:///home/data/*.txt" 74 | First lines in the result are transformations(fast computation) and later(top10) taking much more time as it is an action. 75 | ___ 76 | 77 | #### Discussion 78 | 79 | Instead of using `reduceByKey`, you could have used a method called `countByValue`. Read its documentation, and try to understand how it works. Would using it be a good idea? 80 | -------------------------------------------------------------------------------- /scala/lab4-propprices.md: -------------------------------------------------------------------------------- 1 | ### Lab 4: Analyzing UK Property Prices 2 | 3 | In this lab, you will work with another real-world dataset that contains residential property sales across the UK, as reported to the Land Registry. You can download this dataset and many others from [data.gov.uk](https://data.gov.uk/dataset/land-registry-monthly-price-paid-data). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | As always, we begin by inspecting the data, which is in the `/home/vagrant/data/prop-prices.csv` file. Run the following command to take a look at some of the entries: 10 | 11 | ``` 12 | head /home/vagrant/data/prop-prices.csv 13 | ``` 14 | 15 | Note that this time, the CSV file does not have headers. To determine which fields are available, consult the [guidance page](https://www.gov.uk/guidance/about-the-price-paid-data). 16 | 17 | ___ 18 | 19 | #### Task 2: Importing the Data 20 | 21 | We are going to use the `com.databricks.spark.csv` library to create a `DataFrame` from CSV file. 22 | 23 | First we need to restart Scala interpreter of Zeppelin: 24 | Interpreter -> spark box(first one) -> restart button 25 | Then we need to import `com.databricks.spark.csv` as we did in Lab 2. 26 | 27 | ```scala 28 | %dep 29 | z.reset() 30 | z.load("com.databricks:spark-csv_2.11:1.4.0") 31 | ``` 32 | 33 | After we will define a schema for our data. 34 | And load the `prop-prices.csv` file as a `DataFrame` and register it as a temporary table so that you can run SQL queries: 35 | 36 | 37 | ```scala 38 | import org.apache.spark.sql.types._ 39 | val custSchema = StructType(Array( 40 | StructField("id",StringType,true), 41 | StructField("price",IntegerType,true), 42 | StructField("date",StringType,true), 43 | StructField("zip",StringType,true), 44 | StructField("type",StringType,true), 45 | StructField("new",StringType,true), 46 | StructField("duration",StringType,true), 47 | StructField("PAON",StringType,true), 48 | StructField("SAON",StringType,true), 49 | StructField("street",StringType,true), 50 | StructField("locality",StringType,true), 51 | StructField("town",StringType,true), 52 | StructField("district",StringType,true), 53 | StructField("county",StringType,true), 54 | StructField("ppd",StringType,true), 55 | StructField("status",StringType,true))) 56 | 57 | val df = sqlContext.read 58 | .format("com.databricks.spark.csv") 59 | .schema(custSchema) 60 | .load("file:///home/vagrant/data/prop-prices.csv") 61 | 62 | df.registerTempTable("properties") 63 | df.persist() 64 | ``` 65 | 66 | ___ 67 | 68 | #### Task 3: Analyzing Property Price Trends 69 | 70 | First, let's do some basic analysis on the data. Find how many records we have per year, and print them out sorted by year. 71 | 72 | **Solution**: 73 | 74 | ```Scala 75 | sqlContext.sql("""select substring(date, 0, 4), count(*) 76 | from properties 77 | group by substring(date, 0, 4) 78 | order by substring(date, 0, 4)""").collect() 79 | ``` 80 | 81 | All right, so everyone knows that properties in London are expensive. Find the average property price by county, and print the top 10 most expensive counties. 82 | 83 | **Solution**: 84 | 85 | ```Scala 86 | sqlContext.sql("""select county, avg(price) 87 | from properties 88 | group by county 89 | order by avg(price) desc 90 | limit 10""").collect() 91 | ``` 92 | 93 | Is there any trend for property sales during the year? Find the average property price in Greater London month over month in 2015 and 2016, and print it out by month. 94 | 95 | **Solution**: 96 | 97 | ```Scala 98 | sqlContext.sql("""select substring(date,0,4) as yr, substring(date,5,2) as mth, avg(price) 99 | from properties 100 | where county='GREATER LONDON' 101 | and substring(date,0,4) >= 2015 102 | group by substring(date,0,4), substring(date,5,2) 103 | order by substring(date,0,4), substring(date,5,2)""").collect() 104 | ``` 105 | 106 | 107 | 108 | Bonus: use the %sql to plot the property price changes month-over-month across the entire dataset. 109 | 110 | **Solution**: 111 | 112 | ```Scala 113 | %sql 114 | select year(date), month(date), avg(price) from properties group by year(date), month(date) order by year(date), month(date) 115 | ``` 116 | Open `settings` and in `Values` put `_c2` field 117 | ___ 118 | 119 | #### Discussion 120 | 121 | Now that you have experience in working with Spark SQL and `DataFrames`, what are the advantages and disadvantages of using it compared to the core RDD functionality (such as `map`, `filter`, `reduceByKey`, and so on)? Consider which approach produces more maintainable code, offers more opportunities for optimization, makes it easier to solve certain problems, and so on. 122 | -------------------------------------------------------------------------------- /python/lab0-python.md: -------------------------------------------------------------------------------- 1 | ### Lab 0: Python Fundamentals 2 | 3 | The purpose of this lab is to make sure you are sufficiently acquainted with Python to succeed in the rest of the labs. If Python is one of your primary language, this should be smooth sailing; otherwise, please make sure you complete these tasks before moving on to the next labs. 4 | 5 | This lab assumes that you have Python 2.6+ installed on your system. If you're using the instructor-provided appliance, you're all set. Otherwise, please make sure that Python is installed and is in the path, so you can type `python` to launch it from a terminal window. 6 | 7 | > If you're installing Python yourself, please install Python 2.x and not 3.x. Even though everything in these labs is supposed to work just fine with Python 3, a lot of libraries and frameworks still don't support it. 8 | 9 | ___ 10 | 11 | #### Task 1: Experimenting with the Python REPL 12 | 13 | Open a terminal window and run `python`. An interactive prompt similar to the following should appear: 14 | 15 | ``` 16 | Python 2.7.6 (default, Jun 22 2015, 17:58:13) 17 | [GCC 4.8.2] on linux2 18 | Type "help", "copyright", "credits" or "license" for more information. 19 | >>> 20 | ``` 21 | 22 | This is the Python REPL -- Read, Eval, Print Loop environment. Try some basic commands to make sure everything works (do not type the `>>>` prompt): 23 | 24 | ``` 25 | >>> 2 + 2 26 | 4 27 | >>> print("Hello, REPL") 28 | Hello, REPL 29 | >>> exit() 30 | ``` 31 | 32 | Instead of `exit()`, you can also type Ctrl+D to leave the REPL environment. 33 | 34 | ___ 35 | 36 | #### Task 2: Implementing Python Functions 37 | 38 | Create a new file called `functions.py`. Use the following template so that when the file is executed directly, the `run` function will be called: 39 | 40 | ```python 41 | def run(): 42 | print("Hey there!") 43 | 44 | if __name__ == "__main__": 45 | run() 46 | ``` 47 | 48 | > Which editor should you use in the appliance? If you want to get into the spirit of the course, you could use `vim`, but if you're looking for something more user-friendly, use `nano` or the built-in web-based editor. 49 | 50 | To make sure everything's fine so far, run your Python program from a terminal window: 51 | 52 | ``` 53 | python functions.py 54 | ``` 55 | 56 | You should see "Hey there!" printed out. 57 | 58 | Next, implement a function called `wordcount` that takes a list of strings, and produces a dict with the number of times each string appears. Here is an example of its invocation and expected output: 59 | 60 | ```python 61 | print(wordcount(["the", "fox", "jumped", "over", "the", "dog"])) 62 | # Expecting { 'the': 2, 'fox': 1 }, and so on 63 | ``` 64 | 65 | You might find dict's `setdefault` method useful. To find out how it works, run `help(dict.setdefault)` from the Python REPL. Alternatively, to test whether a key is present in a dictionary, use `if key in dict ...`. 66 | 67 | **Solution**: 68 | 69 | ```python 70 | def wordcount(words): 71 | freqs = {} 72 | for word in words: 73 | freqs[word] = freqs.setdefault(word, 0) + 1 74 | return freqs 75 | ``` 76 | 77 | ___ 78 | 79 | #### Task 3: Using Collection Pipelines 80 | 81 | Given a collection of items, the `map`, `filter`, `reduce` and other functions we learned are very useful for transforming the collection into your desired dataset. Implement the following functions according to the instructions provided, and do not use loops in your implementation: 82 | 83 | * Given a list of numbers, use `filter` to filter out only the even numbers. 84 | 85 | * Given a list of numbers, use `map` to raise each number to the power of 2. 86 | 87 | * Given a list of words, use `reduce` to find the average word length. 88 | 89 | * Use `map` and `reduce` to solve [problem 6](https://projecteuler.net/problem=6) from Project Euler, which states: 90 | 91 | > Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum. 92 | 93 | **Solution**: 94 | 95 | ```python 96 | def evens(numbers): 97 | return filter(lambda n: n % 2 == 0, numbers) 98 | 99 | def squares(numbers): 100 | return map(lambda n: n * n, numbers) 101 | 102 | def avg_length(words): 103 | return reduce(lambda sum, word: sum + len(word), words, 0) / \ 104 | float(len(words)) 105 | 106 | def problem6(): 107 | def _sum(numbers): 108 | return reduce(lambda a, b: a + b, numbers) # or use built-in sum() 109 | def square(n): 110 | return n * n 111 | return _sum(squares(xrange(1, 100))) - square(_sum(xrange(1, 100))) 112 | ``` 113 | 114 | ___ 115 | 116 | #### Discussion 117 | 118 | Why do you think Python is so successful in the data science, data analysis, machine learning, and scientific computing fields? 119 | 120 | Compare the solutions above to your favorite programming language (or at least the one you're using in your day job). Do you feel the lack of strong typing makes Python code harder to read or write? 121 | -------------------------------------------------------------------------------- /python/lab5-streaming.md: -------------------------------------------------------------------------------- 1 | ### Lab 5: Social Panic Analysis 2 | 3 | In this lab, you will use Spark Streaming to analyze Twitter statuses for civil unrest and map them by the place they are coming from. This lab is based on Will Farmer's work, ["Twitter Civil Unrest Analysis with Apache Spark"](http://will-farmer.com/twitter-civil-unrest-analysis-with-apache-spark.html). It is a simplified version that doesn't have as many external dependencies. 4 | 5 | > **NOTE**: If you are running this lab on your system (and not the instructor-provided appliance), you will need to install a couple of Python modules in case you don't have them already. Run the following commands from a terminal window: 6 | 7 | ``` 8 | sudo easy_install requests 9 | sudo easy_install requests_oauthlib 10 | ``` 11 | 12 | ___ 13 | 14 | #### Task 1: Creating a Twitter Application and Obtaining Credentials 15 | 16 | Making requests to the [Twitter Streaming API](https://dev.twitter.com/streaming/overview) requires credentials. You will need a Twitter account, and you will need to create a Twitter application and connect it to your account. That's all a lot simpler than it sounds! 17 | 18 | First, navigate to the [Twitter Application Management](https://apps.twitter.com) page. Sign in if necessary. If you do not have a Twitter account, this is the opportunity to create one. 19 | 20 | Next, create a new app. You will be prompted for a name, a description, and a website. Fill in anything you want (the name must be unique, though), accept the developer agreement, and continue. 21 | 22 | Switch to the **Keys and Access Tokens** tab on your new application's page. Copy the **Consumer Key** and **Consumer Secret** to a separate text file (in this order). Next, click **Create my access token** to authorize the application to access your own account. Copy the **Access Token** and **Access Token Secret** to the same text file (again, in this order). These four credentials are necessary for making requests to the Twitter Streaming API. 23 | 24 | ___ 25 | 26 | #### Task 2: Inspecting the Analysis Program 27 | 28 | Open the `analysis.py` file from the `~/data` folder in a text editor. This is a Spark Streaming application that connects to the Twitter Streaming API and produces a stream of (up to 50) tweets from England every 60 seconds. These tweets are then analyzed for suspicious words like "riot" and "http", and grouped by the location they are coming from. 29 | 30 | Inspect the source code for the application -- make sure you understand what the various functions do, and how data flows through the application. Most importantly, here is the key analysis piece: 31 | 32 | ```python 33 | stream.map(lambda line: ast.literal_eval(line)) \ 34 | .filter(filter_posts) \ 35 | .map(lambda data: (data[1]['name'], 1)) \ 36 | .reduceByKey(lambda a, b: a + b) \ 37 | .pprint() 38 | ``` 39 | 40 | To make this program work with your credentials, insert the four values you copied in the previous task in the appropriate locations in the source code: 41 | 42 | ```python 43 | auth = requests_oauthlib.OAuth1('API KEY', 'API SECRET', 44 | 'ACCESS TOKEN', 'ACCESS TOKEN SECRET') 45 | ``` 46 | 47 | ___ 48 | 49 | #### Task 3: Looking for Civil Unrest 50 | 51 | You're now ready to run the program and look for civil unrest! From a terminal window, navigate to the Spark installation directory (`~/spark` on the appliance) and run: 52 | 53 | ``` 54 | bin/spark-submit ~/data/analysis.py 55 | ``` 56 | 57 | You should see the obtained statistics printed every 60 seconds. If you aren't getting enough results, modify the keywords the program is looking for, or modify the bounding box to a larger area. 58 | 59 | If anything goes wrong, you should see the Twitter HTTP response details amidst the Spark log stream. For example: 60 | 61 | ``` 62 | https://stream.twitter.com/1.1/statuses/filter.json?language=en&locations=-0.489,51.28,0.236,51.686 63 | Exceeded connection limit for user 64 | ``` 65 | 66 | By the way, while we're at it, it's a good idea to learn how to configure the Spark driver's default log level. Navigate to the `~/spark/conf` directory in a terminal window, and inspect the `log4j.properties.template` file. Copy it to a file called `log4j.properties` (this is the one Spark actually reads), and in a text editor modify the following line to read "WARN" instead of "INFO": 67 | 68 | ``` 69 | log4j.rootCategory=INFO, console 70 | ``` 71 | 72 | Subsequent launches of `pyspark`, `spark-submit`, etc. will use the new log configuration, and print out only messages that have log level WARN or higher. 73 | 74 | ___ 75 | 76 | #### Discussion 77 | 78 | Spark Streaming is not a real-time data processing engine -- it still relies on micro-batches of elements, grouped into RDDs. Is this a serious limitation for our scenario? What are some scenarios in which it can be a serious limitation? 79 | 80 | Bonus reading: the [Apache Flink](https://flink.apache.org) project is an alternative data processing framework that is real-time-first, batch-second. It can be a better fit in some scenarios that require real-time processing with no batching at all. 81 | -------------------------------------------------------------------------------- /python/lab4-propprices.md: -------------------------------------------------------------------------------- 1 | ### Lab 4: Analyzing UK Property Prices 2 | 3 | In this lab, you will work with another real-world dataset that contains residential property sales across the UK, as reported to the Land Registry. You can download this dataset and many others from [data.gov.uk](https://data.gov.uk/dataset/land-registry-monthly-price-paid-data). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | As always, we begin by inspecting the data, which is in the `~/data/prop-prices.csv` file. Run the following command to take a look at some of the entries: 10 | 11 | ``` 12 | head ~/data/prop-prices.csv 13 | ``` 14 | 15 | Note that this time, the CSV file does not have headers. To determine which fields are available, consult the [guidance page](https://www.gov.uk/guidance/about-the-price-paid-data). 16 | 17 | ___ 18 | 19 | #### Task 2: Importing the Data 20 | 21 | In a previous lab, we used the Python `csv` module to parse CSV files. However, because we're working with structured data, the Spark SQL framework can be easier to use and provide better performance. We are going to use the `pyspark_csv` third-party open source module to create a `DataFrame` from an RDD of CSV lines. 22 | 23 | > **NOTE**: The `pyspark_csv.py` file is in the `~/externals` directory on the appliance. You can also [download it yourself](https://github.com/seahboonsiew/pyspark-csv) and place it in some directory. 24 | > 25 | > This module also depends on the `dateutils` module, which typically doesn't ship with Python. It is already installed in the appliance. To install it on your own machine, run the following from a terminal window: 26 | 27 | ``` 28 | sudo easy_install dateutils 29 | ``` 30 | 31 | To import `pyspark_csv`, you'll need the following snippet of code that adds its path to the module search path, and adds it to the Spark executors so they can find it as well: 32 | 33 | ```python 34 | import sys 35 | sys.path.append('/home/ubuntu/externals') # replace as necessary 36 | import pyspark_csv 37 | sc.addFile('/home/ubuntu/externals/pyspark_csv.py') # ditto 38 | ``` 39 | 40 | Next, load the `prop-prices.csv` file as an RDD, and use the `csvToDataFrame` function from the `pyspark_csv` module to create a `DataFrame` and register it as a temporary table so that you can run SQL queries: 41 | 42 | ```python 43 | columns = ['id', 'price', 'date', 'zip', 'type', 'new', 'duration', 'PAON', 44 | 'SAON', 'street', 'locality', 'town', 'district', 'county', 'ppd', 45 | 'status'] 46 | 47 | rdd = sc.textFile("file:///home/ubuntu/data/prop-prices.csv") 48 | df = pyspark_csv.csvToDataFrame(sqlContext, rdd, columns=columns) 49 | df.registerTempTable("properties") 50 | df.persist() 51 | ``` 52 | 53 | ___ 54 | 55 | #### Task 3: Analyzing Property Price Trends 56 | 57 | First, let's do some basic analysis on the data. Find how many records we have per year, and print them out sorted by year. 58 | 59 | **Solution**: 60 | 61 | ```python 62 | sqlContext.sql("""select year(date), count(*) 63 | from properties 64 | group by year(date) 65 | order by year(date)""").collect() 66 | ``` 67 | 68 | All right, so everyone knows that properties in London are expensive. Find the average property price by county, and print the top 10 most expensive counties. 69 | 70 | **Solution**: 71 | 72 | ```python 73 | sqlContext.sql("""select county, avg(price) 74 | from properties 75 | group by county 76 | order by avg(price) desc 77 | limit 10""").collect() 78 | ``` 79 | 80 | Is there any trend for property sales during the year? Find the average property price in Greater London month over month in 2015 and 2016, and print it out by month. 81 | 82 | **Solution**: 83 | 84 | ```python 85 | sqlContext.sql("""select year(date) as yr, month(date) as mth, avg(price) 86 | from properties 87 | where county='GREATER LONDON' 88 | and year(date) >= 2015 89 | group by year(date), month(date) 90 | order by year(date), month(date)""").collect() 91 | ``` 92 | 93 | Bonus: use the Python `matplotlib` module to plot the property price changes month-over-month across the entire dataset. 94 | 95 | > The `matplotlib` module is installed in the instructor-provided appliance. However, there is no X environment, so you will not be able to view the actual plot. For your own system, follow the [installation instructions](http://matplotlib.org/users/installing.html). 96 | 97 | **Solution**: 98 | 99 | ```python 100 | monthPrices = sqlContext.sql("""select year(date), month(date), avg(price) 101 | from properties 102 | group by year(date), month(date) 103 | order by year(date), month(date)""").collect() 104 | import matplotlib.pyplot as plt 105 | values = map(lambda row: row._c2, monthPrices) 106 | plt.rcdefaults() 107 | plt.scatter(xrange(0,len(values)), values) 108 | plt.show() 109 | ``` 110 | 111 | ___ 112 | 113 | #### Discussion 114 | 115 | Now that you have experience in working with Spark SQL and `DataFrames`, what are the advantages and disadvantages of using it compared to the core RDD functionality (such as `map`, `filter`, `reduceByKey`, and so on)? Consider which approach produces more maintainable code, offers more opportunities for optimization, makes it easier to solve certain problems, and so on. 116 | -------------------------------------------------------------------------------- /python/lab7-plagiarism.md: -------------------------------------------------------------------------------- 1 | ### Lab 7: Plagiarism Detection 2 | 3 | In this lab, you will use Spark's Machine Learning library (MLLib) to perform plagiarism detection -- determine how similar a document is to a collection of existing documents. 4 | 5 | You will use the [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) algorithm, which extracts numeric features (vectors) from text documents. TF-IDF stands for Term Frequency Inverse Document Frequency, and it is a normalized representation of how frequently a term (word) occurs in a document that belongs to a set of documents: 6 | 7 | * The *TF*[*t*, *D*] -- term frequency of term *t* in a document *D* -- is simply the number of times *t* appears in *D*. 8 | 9 | * The *DF*[*t*] -- document frequency of term *t* in a collection of documents *D*(1), ..., *D*(*n*) -- is the number of documents in which *t* appears. 10 | 11 | * The *TFIDF*[*t*, *D*(*i*)] of term *t* in a document *D*(*i*) in a collection of documents *D*(1), ..., *D*(*n*) is *TF*[*t*, *D*(*i*)] · *log*[(*n* + 1) / (*DF*[*t*, *D*(*i*)] + 1)]. 12 | 13 | These values are not very hard to compute, but when the documents are very large there is a lot of room for optimization. MLLib (the machine learning library that ships with Spark) has optimized versions of these feature extraction algorithms, among many other ML algorithms for clustering, classification, dimensionality reduction, etc. 14 | 15 | The similarity between two documents can be obtained by computing the cosine similarity (normalized dot product) of their TF-IDF vectors. For two documents *D*, *E* with TF-IDF vectors *t*, *s* the cosine similarity is defined as *t* ○ *s* / |*t*| · |*s*| -- note this is a number between 0 and 1, due to normalization. If the cosine similarity is 1, the documents are identical; if the similarity is 0, the documents have nothing in common. 16 | 17 | ___ 18 | 19 | #### Task 1: Inspecting the Data 20 | 21 | In the `~/data/essays` directory you'll find a collection of 1497 essays written by students of English at the University of Uppsala, also known as the [Uppsala Student English Corpus (USE)](http://www.engelska.uu.se/Forskning/engelsk_sprakvetenskap/Forskningsomraden/Electronic_Resource_Projects/USE-Corpus/). Your task will be to determine whether another essay, in the file `~/data/essays/candidate`, has been plagiarized from one of the other essays, or whether it is original work. 22 | 23 | First, let's take a look at some of the files. From a terminal window, execute the following command to inspect the first 10 files: 24 | 25 | ``` 26 | ls ~/data/essays/*.txt | head -n 10 | xargs less 27 | ``` 28 | 29 | In the resulting `less` window, use `:n` to move to the next file, and `q` to quit. As you can see, these are student essays on various topics. Now take a look at the candidate file: 30 | 31 | ``` 32 | less ~/data/essays/candidate 33 | ``` 34 | 35 | ___ 36 | 37 | #### Task 2: Detecting Document Similarity 38 | 39 | First, you need to load the documents to an RDD of word vectors, one per document. Note that the documents are need to be cleaned up so that we indeed produce a vector per document. These will be processed by MLLib to obtain an RDD of TF-IDF vectors. 40 | 41 | ```python 42 | import re 43 | 44 | # An even better cleanup would include stemming, 45 | # careful punctuation removal, etc. 46 | def clean(doc): 47 | return filter(lambda w: len(w) > 2, 48 | map(lambda s: s.lower(), re.split(r'\W+', doc))) 49 | 50 | essays = sc.wholeTextFiles("file:///home/ubuntu/data/essays/*.txt") \ 51 | .mapValues(clean) \ 52 | .cache() 53 | essayNames = essays.map(lambda (filename, contents): filename).collect() 54 | docs = essays.map(lambda (filename, contents): contents) 55 | ``` 56 | 57 | Next, you can compute the TF vectors for all the document vectors using the `HashingTF` algorithm: 58 | 59 | ```python 60 | from pyspark.mllib.feature import HashingTF, IDF 61 | 62 | hashingTF = HashingTF() 63 | tf = hashingTF.transform(docs) 64 | tf.cache() # we will reuse it twice for TF-IDF 65 | ``` 66 | 67 | And now you can find the TF-IDF vectors -- this requires two passes: one to find the IDF vectors and another to scale the terms in the vectors. 68 | 69 | ```python 70 | idf = IDF().fit(tf) 71 | tfidf = idf.transform(tf) 72 | ``` 73 | 74 | Now that you have the TF-IDF vectors for the entire dataset, you can compute the similarity of a new document, `candidate`, to all the existing documents. To do so, you need to find that document's TF-IDF vector, and then find the cosine similarity of that vector with all the existing TF-IDF vectors: 75 | 76 | ```python 77 | candidate = clean(open('/home/ubuntu/data/essays/candidate').read()) 78 | candidateTf = hashingTF.transform(candidate) 79 | candidateTfIdf = idf.transform(candidateTf) 80 | similarities = tfidf.map(lambda v: v.dot(candidateTfIdf) / 81 | (v.norm(2) * candidateTfIdf.norm(2))) 82 | ``` 83 | 84 | All that's left is pick the most similar documents and see if there's high similarity: 85 | 86 | ```python 87 | topFive = sorted(enumerate(similarities.collect()), key=lambda (k, v): -v)[0:5] 88 | for idx, val in topFive: 89 | print("doc '%s' has score %.4f" % (essayNames[idx], val)) 90 | ``` 91 | 92 | You can experiment with slight modifications to the text of `candidate` and see if our naive algorithm can still detect its origin. 93 | 94 | ___ 95 | 96 | #### Discussion 97 | 98 | Why did we use `similarities.collect()` to bring the dataset to the driver program and then sort the results? 99 | 100 | Which parts of working with MLLib do you find particularly useful, and which parts seem confusing? 101 | -------------------------------------------------------------------------------- /scala/lab0-scala.md: -------------------------------------------------------------------------------- 1 | In this lab, you will become acquainted with your Spark installation. 2 | 3 | > The instructor should have explained how to install Spark on your machine. One option is to use the instructor's VirtualBox appliance, which you can import in the VirtualBox application. The appliance has Spark 1.6.1 installed, and has all the necessary data files for this and subsequent exercises in the `~/data` directory. 4 | > 5 | > Alternatively, you can install Spark yourself. Download it from [spark.apache.org](http://spark.apache.org/downloads.html) -- make sure to select a prepackaged binary version, such as [Spark 1.6.1 for Hadoop 2.6](http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz). Extract the archive to some location on your system. Then, download the [data files](https://www.dropbox.com/s/un1zr1jg6buoe3a/data.zip?dl=0) for the labs and place them in `~/data`. 6 | > 7 | > **NOTE**: If you install Spark on Windows (not in a virtual machine), many things are going to be more difficult. Ask the instructor for advice if necessary. 8 | 9 | The purpose of this lab is to make sure you are sufficiently acquainted with Scala to succeed in the rest of the labs. 10 | If Scala is one of your primary language, this should be smooth sailing; otherwise, please make sure you complete these 11 | tasks before moving on to the next labs. 12 | 13 | This lab assumes that you have Spark 1.6+ installed on your system. 14 | If you're using the instructor-provided VirtualBox appliance, you're all set. 15 | 16 | 17 | 18 | #### Task 1: Experimenting with the Spark REPL 19 | 20 | Open a terminal window navigate to Spark/bin and run `./spark-shell`. An interactive prompt similar to the following should appear: 21 | 22 | ``` 23 | Welcome to 24 | ____ __ 25 | / __/__ ___ _____/ /__ 26 | _\ \/ _ \/ _ `/ __/ '_/ 27 | /___/ .__/\_,_/_/ /_/\_\ version 1.6.1 28 | /_/ 29 | 30 | Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_71) 31 | 32 | 33 | SQL context available as sqlContext. 34 | 35 | scala> 36 | ``` 37 | 38 | This is the scala REPL -- Read, Eval, Print Loop environment. Try some basic commands to make sure everything works: 39 | ``` 40 | scala> 2+2 41 | res0: Int = 4 42 | 43 | scala> println("Hello") 44 | Hello 45 | ``` 46 | ___ 47 | 48 | #### Task 2: Scala basics 49 | 50 | Scala is object-oriented language and everything is an object, including numbers or functions. 51 | The expresion: 1 + 2 * 3 52 | is equivalent to: (1).+((2).*(3)) here we used unary numerical methods. 53 | 54 | Functions are objects. They can be passed into functions as arguments, stored in variable or return them from function. 55 | This is the core of the paradigm called "Functional programming". 56 | 57 | #### Object 58 | ``` 59 | Object ScalaBasics { 60 | def Foo(bar: () => Unit) { 61 | bar() 62 | } 63 | 64 | def Bar() { 65 | println("This is Bar") 66 | } 67 | 68 | def main(args: Array[String]){ 69 | Foo(Bar) 70 | } 71 | } 72 | ``` 73 | 74 | * In first line we see "Object" keyword. This is declaration of class with a single instance (commonly known as singelton object). 75 | * Function parameter declaration in line 2 "() => Unit" translated as: no input parameters and the function returns nothing (like void in C#) 76 | 77 | #### Variables (val vs. var) 78 | 79 | Both vals and vars must be initialized when defined, but only vars can be later reassigned to refer to a different object. Both are evaluated once. 80 | ``` 81 | val x = 3 82 | x: Int = 3 83 | 84 | x = 4 85 | error: reassignment to val 86 | 87 | var y = 5 88 | y: Int = 5 89 | 90 | y = 6 91 | y: Int = 6 92 | ``` 93 | 94 | #### Case Class 95 | 96 | This is regular class that export constuctor parameters and provide a decomposition mechanism via "pattern matching" 97 | ``` 98 | abstract class Employee 99 | case class Worker(name: String, managerName: String) extends Employee 100 | case class Manager(name: String) extends Employee 101 | ``` 102 | 103 | The constuctor parameters can be accessed directly 104 | ``` 105 | val emp1 = Manager("Dan") 106 | emp1.name 107 | res0: String = Dan 108 | 109 | def IsWorkerOrManager(emp: Employee): String = { 110 | val result = emp match { 111 | case Worker(name, _) => { 112 | println("Worker: " + name) 113 | "Worker" 114 | } 115 | case Manager(name) => { 116 | println("Manager: " + name) 117 | "Manager" 118 | } 119 | } 120 | result 121 | } 122 | 123 | IsWorkerOrManager(emp1) 124 | Manager: Dan 125 | res1: String = Manager 126 | ``` 127 | 128 | #### Tuples 129 | 130 | Tuples are collection of items not of the same types, but they are immutable. 131 | ``` 132 | val t = (1, "Hello", 3.0) 133 | t: (Int, String, Double) = (1,Hello,3.0) 134 | ``` 135 | 136 | The access to elemets done by ._ of element. 137 | ``` 138 | scala> println(t._1) 139 | 1 140 | 141 | scala> println(t._2) 142 | Hello 143 | 144 | scala> println(t._3) 145 | 3.0 146 | ``` 147 | 148 | #### Lambda 149 | ``` 150 | def fun1 = (x: Int) => println(x) 151 | fun1(3) 152 | 3 153 | 154 | def f1 = () => "Hello" 155 | f1() 156 | res3: String = Hello 157 | ``` 158 | 159 | #### Using "_" (Underscore) 160 | 161 | In Scala we can replace variables by "_" 162 | ``` 163 | val intList=List(1,2,3,4) 164 | intList.map(_ + 1) is equivalent to following: 165 | intList.map(x => x + 1) 166 | res4: List[Int] = List(2, 3, 4, 5) 167 | 168 | intList.reduce(_ + _) is equivalent to following: 169 | intList.reduce((acc, x) => acc + x) 170 | ``` 171 | 172 | In pattern matching the use of "_" is done when we do not care about the variable. 173 | Review of the match from IsWorkerOrManager function. 174 | 175 | ``` 176 | ... 177 | emp match { 178 | case Worker(name, _) => { 179 | println("Worker: " + name) 180 | "Worker" 181 | } 182 | ... 183 | } 184 | ... 185 | ``` 186 | 187 | We want to know the name of the worker, but we do not care about the name of manager. 188 | -------------------------------------------------------------------------------- /python/lab1-wordcount.md: -------------------------------------------------------------------------------- 1 | ### Lab 1: Multi-File Word Count 2 | 3 | In this lab, you will become acquainted with your Spark installation, and run your first Spark job -- a multi-file word count. 4 | 5 | > The instructor should have explained how to install Spark on your machine. One option is to use the instructor's appliance, which you can access through any web browser. The appliance has Spark 1.6.2 installed, and has all the necessary data files for this and subsequent exercises in the `~/data` directory. 6 | > 7 | > Alternatively, you can install Spark yourself. Download it from [spark.apache.org](http://spark.apache.org/downloads.html) -- make sure to select a prepackaged binary version, such as [Spark 1.6.1 for Hadoop 2.6](http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz). Extract the archive to some location on your system. Then, download the [data files](../data.zip) for the labs and place them in `~/data`. 8 | > 9 | > **NOTE**: If you install Spark on Windows (not in a virtual machine), many things are going to be more difficult. Ask the instructor for advice if necessary. 10 | 11 | ___ 12 | 13 | #### Task 1: Inspecting the Spark Installation 14 | 15 | Open a terminal window. Navigate to the directory where you extracted Apache Spark. On the instructor-provided virtual machine, this is `~/spark`. 16 | 17 | Inspect the files in the `bin` directory. You will soon use `pyspark` to launch your first Spark job. Also note `spark-submit`, which is used to submit standalone Spark programs to a cluster. 18 | 19 | Inspect the scripts in the `sbin` directory. These scripts help with setting up a stand-alone Spark cluster, deploying Spark to EC2 virtual machines, and a bunch of additional tasks. 20 | 21 | Finally, take a look at the `examples` directory. You can find a number of stand-alone demo programs here, covering a variety of Spark APIs. 22 | 23 | ___ 24 | 25 | #### Task 2: Inspecting the Lab Data Files 26 | 27 | In this lab, you will implement a multi-file word count. The texts you will use are freely available books from [Project Gutenberg](http://www.gutenberg.org), including classics such as Lewis Carroll's "Alice in Wonderland" and Jane Austin's "Pride and Prejudice". 28 | 29 | Take a look at some of the text files in the `~/data` directory. From the terminal, run: 30 | 31 | ``` 32 | head -n 50 ~/data/*.txt | less 33 | ``` 34 | 35 | This shows the first 50 lines of each file. Press SPACE to scroll, or `q` to exit `less`. 36 | 37 | ___ 38 | 39 | #### Task 3: Implementing a Multi-File Word Count 40 | 41 | Navigate to the Spark installation directory, and run `bin/pyspark`. After a few seconds, you should see an interactive Python shell, which has a pre-initialized `SparkContext` object called `sc`. 42 | 43 | ``` 44 | Welcome to 45 | ____ __ 46 | / __/__ ___ _____/ /__ 47 | _\ \/ _ \/ _ `/ __/ '_/ 48 | /__ / .__/\_,_/_/ /_/\_\ version 1.6.1 49 | /_/ 50 | 51 | Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) 52 | SparkContext available as sc, HiveContext available as sqlContext. 53 | >>> 54 | ``` 55 | 56 | To explore the available methods, run the following command: 57 | 58 | ```python 59 | dir(sc) 60 | ``` 61 | 62 | In this lab, you are going to use the `sc.textFile` method. To figure out what it does, run the following command: 63 | 64 | ```python 65 | help(sc.textFile) 66 | ``` 67 | 68 | Note that even though it's not mentioned in the short documentation snippet you just read, the `textFile` method can also work with a directory path or a wildcard filter such as `/home/ubuntu/data/*.txt`. 69 | 70 | > Of course, if you are not using the instructor-supplied appliance, your `data` directory might reside in a different location. 71 | 72 | Your first task is to print out the number of lines in all the text files, combined. In general, you should try to come up with the solution yourself, and only then continue reading for the "school" solution. 73 | 74 | **Solution**: 75 | 76 | ```python 77 | sc.textFile("/home/ubuntu/data/*.txt").count() 78 | ``` 79 | 80 | Great! Your next task is to implement the actual word-counting program. You've already seen one in class, and now it's time for your own. Print the top 10 most frequent words in the provided books. 81 | 82 | **Solution**: 83 | 84 | ```python 85 | lines = sc.textFile("/home/ubuntu/data/*.txt") 86 | words = lines.flatMap(lambda line: line.split()) 87 | pairs = words.map(lambda word: (word, 1)) 88 | freqs = pairs.reduceByKey(lambda a, b: a + b) 89 | top10 = freqs.sortBy(lambda (word, count): -count).take(10) 90 | for (word, count) in top10: 91 | print("the word '%s' appears %d times" % (word, count)) 92 | ``` 93 | 94 | To be honest, we don't really care about words like "the", "a", and "of". Ideally, we would have a list of stop words to ignore. For now, modify your solution to filter out words shorter than 4 characters. 95 | 96 | Additionally, you might be wondering about the types of all these variables -- most of them are RDDs. To trace the lineage of an RDD, use the `toDebugString` method. For example, `print(freqs.toDebugString())` should display the logical plan for that RDD's evaluation. We will discuss some of these concepts later. 97 | 98 | ___ 99 | 100 | #### Task 4: Run a Stand-Alone Spark Program 101 | 102 | You're now ready to convert your multi-file word count into a stand-alone Spark program. Create a new file called `wordcount.py`. 103 | 104 | Initialize a `SparkContext` as follows: 105 | 106 | ```python 107 | from pyspark import SparkContext 108 | 109 | def run(): 110 | sc = SparkContext() 111 | # TODO Your code goes here 112 | 113 | if __name__ == "__main__": 114 | run() 115 | ``` 116 | 117 | Now, you can copy and paste your solution in the `run` method. Congratulations -- you have a stand-alone Spark program! To run it, navigate back to the Spark installation directory in your terminal, and run the following command: 118 | 119 | ``` 120 | bin/spark-submit --master 'local[*]' path/to/wordcount.py 121 | ``` 122 | 123 | You should replace `path/to/wordcount.py` with the actual path on your system. If everything went fine, you should see a lot of diagnostic output, but somewhere buried in it would be your top 10 words. 124 | 125 | ___ 126 | 127 | #### Discussion 128 | 129 | Instead of using `reduceByKey`, you could have used a method called `countByValue`. Read its documentation, and try to understand how it works. Would using it be a good idea? 130 | -------------------------------------------------------------------------------- /scala/lab6-pagerank.md: -------------------------------------------------------------------------------- 1 | ### Lab 6: Movie PageRank 2 | 3 | In this lab, you will run the [PageRank](https://en.wikipedia.org/wiki/PageRank) algorithm on a dataset of movie references, and try to identify the most popular movies based on how many references they have. The dataset you'll be working with is [provided by IMDB](http://www.imdb.com/interfaces). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | The original IMDB dataset is not very friendly for automatic processing. You can find it in the `~/data` folder of the VirtualBox appliance, or download it yourself from the IMDB FTP website -- it's the `movie-links.list` dataset. Here's a sampler: 10 | 11 | ``` 12 | "#1 Single" (2006) 13 | (referenced in "Howard Stern on Demand" (2005) {Lisa Loeb & Sister}) 14 | 15 | "#LawstinWoods" (2013) {The Arrival (#1.1)} 16 | (references "Lost" (2004)) 17 | (references Kenny Rogers and Dolly Parton: Together (1985) (TV)) 18 | (references The Grudge (2004)) 19 | (references The Ring (2002)) 20 | ``` 21 | 22 | Instead of using this raw dataset, there's a pre-processed one available in the `processed-movie-links.txt` file (it doesn't contain all the information from the first one, but we can live with that). Again, here's a sample: 23 | 24 | ``` 25 | $ head processed-movie-links.txt 26 | #LawstinWoods --> Lost 27 | #LawstinWoods --> Kenny Rogers and Dolly Parton: Together 28 | #LawstinWoods --> The Grudge 29 | #LawstinWoods --> The Ring 30 | #MonologueWars --> Trainspotting 31 | Community --> $#*! My Dad Says 32 | Conan --> $#*! My Dad Says 33 | Geeks Who Drink --> $#*! My Dad Says 34 | Late Show with David Letterman --> $#*! My Dad Says 35 | ``` 36 | 37 | ___ 38 | 39 | #### Task 2: Finding Top Movies 40 | 41 | Now it's time to implement the PageRank algorithm. It's probably the most challenging task so far, so here are some instructions that might help. 42 | 43 | > **NOTE**: This is a very naive implementation of PageRank, which doesn't really try to optimize and minimize data shuffling. The GraphX library, which is also part of Spark, has a [native implementation of PageRank](https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html#pagerank). You can try it in Task 3. 44 | 45 | Begin by parsing the movie references to an RDD called `links` (using `SparkContext.textFile` and `map`) and processing it into key/value pairs where the key is the movie and the value is a list of all movies referenced by it. 46 | 47 | Next, create an RDD called `ranks` of key/value pairs where the key is the movie and the value is its rank, set to 1.0 initially for all movies. 48 | 49 | Next, write a function `computeContribs` that takes a list of referenced movies and the referencing movie's rank, and returns a list of key/value pairs where the key is the movie and the value is its rank contribution. Each of the referenced movies gets an equal portion of the referencing movie's rank. For example, if "Star Wars" currently has rank 1.0 and references "Wizard of Oz" and "Star Trek", then the function should return two pairs: `("Wizard of Oz", 0.5)` and `("Star Trek", 0.5)`. 50 | 51 | Next, we're getting to the heart of the algorithm. In a loop that repeats 10 times, compute a new RDD called `contribs` which is formed by joining `links` and `ranks` (the join is on the movie name). Use `flatMap` to collect the results from `computeContribs` on each key/value pair in the result of the join. To understand what we're doing, consider that joining `links` and `ranks` produces a pair RDD whose elements look like this: 52 | 53 | ```scala 54 | ("Star Wars", ("Wizard of Oz", "Star Trek", 0.8)) 55 | ``` 56 | 57 | Now, invoking `computeContribs` on the value of this pair produces a list of pairs: 58 | 59 | ```scala 60 | Array(("Wizard of Oz", 0.4), ("Star Trek", 0.4)) 61 | ``` 62 | 63 | By applying `computeContribs` and collecting the results with `flatMap`, we get a pair RDD that has, for each movie, its contribution from each of its neighbors. You should now sum (reduce) this pair RDD by key, so we get the sum of each movie's contributions from its neighbors. 64 | 65 | Next, the PageRank algorithm dictates that we should recompute each movie's rank from the `ranks` RDD as 0.15 + 0.85 times its neighbors' contribution (you can use `mapValues` for this). This recomputation produces a new input value for `ranks`. 66 | 67 | Finally, when your loop is done, display the 10 highest-ranked movies and their PageRank. 68 | 69 | **Solution**: 70 | 71 | ```scala 72 | // links is RDD of (movie, [referenced movies]) 73 | val links = sc.textFile("file:///home/vagrant/data/processed-movie-links.txt") 74 | .map(line => line.split("-->")) 75 | .map(x => (x(0).trim, x(1).trim)) 76 | .distinct() 77 | .groupByKey() 78 | .cache() 79 | 80 | // ranks is RDD of (movie, 1.0) 81 | var ranks = links.map(movie => (movie._1, 1.0)) 82 | 83 | // each of our references gets a contribution of our rank divided by the 84 | // total number of our references 85 | def computeContribs(referenced : Array[String], rank : Double) ={ 86 | val count = referenced.length 87 | referenced.map(movie => (movie, rank / count)) 88 | } 89 | 90 | for (a <- 0 to 10) 91 | { 92 | // recompute each movie's contributions from its referencing movies 93 | val contribs = links.join(ranks).flatMap(x => computeContribs(x._2._1.toArray, x._2._2)) 94 | 95 | // recompute the movie's ranks by accounting all its referencing 96 | // movies' contributions 97 | ranks = contribs.reduceByKey(_ + _) 98 | .mapValues(rank => rank*0.85 + 0.15) 99 | } 100 | 101 | 102 | ranks.sortBy(x => -1*x._2).take(10).foreach(println) 103 | ``` 104 | 105 | ___ 106 | 107 | #### Task 3: GraphX PageRank 108 | 109 | The PageRank algorithm we implemented in the previous task is not very efficient. For example, running it on our dataset for 100 iterations took approximately 15 minutes on a 4-core machine. Considering that there are "just" about 25,000 movies ranked, this is not a very good result. 110 | 111 | Spark ships with a native graph algorithm library called GraphX. Unfortunately, it doesn't yet have a Python binding -- you can only use it from Scala and Java. But we're not going to let that stop us! 112 | 113 | Navigate to the Spark installation directory (`~/spark` in the VirtualBox appliance) and run `bin/spark-shell`. This is the Spark Scala REPL, which is very similar to PySpark, except it uses Scala. First, you're going to need a couple of import statements: 114 | 115 | ```scala 116 | import org.apache.spark._ 117 | import org.apache.spark.graphx._ 118 | import org.apache.spark.graphx.lib._ 119 | ``` 120 | 121 | Next, load the graph edges from the supplied `~/data/movie-edges.txt` file: 122 | 123 | ```scala 124 | val graph = GraphLoader.edgeListFile(sc, 125 | "file:///home/vagrant/data/movie-edges.txt") 126 | ``` 127 | 128 | This file was generated from the same dataset, but it has a format that GraphX natively supports. You can check out the format by running the following commands: 129 | 130 | ``` 131 | $ head ~/data/movie-edges.txt 132 | 0 1 133 | 2 3 134 | 2 4 135 | 2 5 136 | 2 6 137 | 7 8 138 | 9 10 139 | 11 10 140 | 12 10 141 | 13 10 142 | $ head ~/data/movie-vertices.txt 143 | 0 Howard Stern on Demand 144 | 1 #1 Single 145 | 2 #LawstinWoods 146 | 3 Lost 147 | 4 Kenny Rogers and Dolly Parton: Together 148 | 5 The Grudge 149 | 6 The Ring 150 | 7 #MonologueWars 151 | 8 Trainspotting 152 | 9 Community 153 | ``` 154 | 155 | That's it -- we can run PageRank. Instead of working with a set number of iterations, the PageRank implementation in GraphX can run until the ranks converge (stop changing). We'll set the tolerance threshold to 0.0001, which means we're waiting for convergence up to that threshold. This computation took just under 2 minutes on the same machine! 156 | 157 | ```scala 158 | val pageRank = PageRank.runUntilConvergence(graph, 0.0001).vertices.map( 159 | p => (p._1.toInt, p._2)).cache() 160 | ``` 161 | 162 | > The resulting graph vertices are pairs of the vertex id and its rank. We use `toInt` to convert it to an int for the subsequent join operation. 163 | 164 | Next, load the vertices file that specifies the movie title for each id: 165 | 166 | ```scala 167 | val titles = sc.textFile("file:///home/vagrant/data/movie-vertices.txt").map( 168 | line => { 169 | val parts = line.split(" ") 170 | (parts(0).toInt, parts.drop(1).mkString(" ")) 171 | } 172 | ) 173 | ``` 174 | 175 | Finally, join the ranks and the titles and sort the result to print the top 10 movies by rank: 176 | 177 | ```scala 178 | titles.join(pageRank).sortBy(-_._2._2).map(_._2).take(10) 179 | ``` 180 | 181 | ___ 182 | 183 | #### Discussion 184 | 185 | Besides being easier to use than implementing your own algorithms, why do you think GraphX has potential for being faster than something you'd roll by hand? 186 | -------------------------------------------------------------------------------- /python/lab6-pagerank.md: -------------------------------------------------------------------------------- 1 | ### Lab 6: Movie PageRank 2 | 3 | In this lab, you will run the [PageRank](https://en.wikipedia.org/wiki/PageRank) algorithm on a dataset of movie references, and try to identify the most popular movies based on how many references they have. The dataset you'll be working with is [provided by IMDB](http://www.imdb.com/interfaces). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | The original IMDB dataset is not very friendly for automatic processing. You can find it in the `~/data` folder of the VirtualBox appliance, or download it yourself from the IMDB FTP website -- it's the `movie-links.list` dataset. Here's a sampler: 10 | 11 | ``` 12 | "#1 Single" (2006) 13 | (referenced in "Howard Stern on Demand" (2005) {Lisa Loeb & Sister}) 14 | 15 | "#LawstinWoods" (2013) {The Arrival (#1.1)} 16 | (references "Lost" (2004)) 17 | (references Kenny Rogers and Dolly Parton: Together (1985) (TV)) 18 | (references The Grudge (2004)) 19 | (references The Ring (2002)) 20 | ``` 21 | 22 | Instead of using this raw dataset, there's a pre-processed one available in the `processed-movie-links.txt` file (it doesn't contain all the information from the first one, but we can live with that). Again, here's a sample: 23 | 24 | ``` 25 | $ head processed-movie-links.txt 26 | #LawstinWoods --> Lost 27 | #LawstinWoods --> Kenny Rogers and Dolly Parton: Together 28 | #LawstinWoods --> The Grudge 29 | #LawstinWoods --> The Ring 30 | #MonologueWars --> Trainspotting 31 | Community --> $#*! My Dad Says 32 | Conan --> $#*! My Dad Says 33 | Geeks Who Drink --> $#*! My Dad Says 34 | Late Show with David Letterman --> $#*! My Dad Says 35 | ``` 36 | 37 | ___ 38 | 39 | #### Task 2: Finding Top Movies 40 | 41 | Now it's time to implement the PageRank algorithm. It's probably the most challenging task so far, so here are some instructions that might help. 42 | 43 | > **NOTE**: This is a very naive implementation of PageRank, which doesn't really try to optimize and minimize data shuffling. The GraphX library, which is also part of Spark, has a [native implementation of PageRank](https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html#pagerank). You can try it in Task 3. 44 | 45 | Begin by parsing the movie references to an RDD called `links` (using `SparkContext.textFile` and `map`) and processing it into key/value pairs where the key is the movie and the value is a list of all movies referenced by it. 46 | 47 | Next, create an RDD called `ranks` of key/value pairs where the key is the movie and the value is its rank, set to 1.0 initially for all movies. 48 | 49 | Next, write a function `computeContribs` that takes a list of referenced movies and the referencing movie's rank, and returns a list of key/value pairs where the key is the movie and the value is its rank contribution. Each of the referenced movies gets an equal portion of the referencing movie's rank. For example, if "Star Wars" currently has rank 1.0 and references "Wizard of Oz" and "Star Trek", then the function should return two pairs: `("Wizard of Oz", 0.5)` and `("Star Trek", 0.5)`. 50 | 51 | Next, we're getting to the heart of the algorithm. In a loop that repeats 10 times, compute a new RDD called `contribs` which is formed by joining `links` and `ranks` (the join is on the movie name). Use `flatMap` to collect the results from `computeContribs` on each key/value pair in the result of the join. To understand what we're doing, consider that joining `links` and `ranks` produces a pair RDD whose elements look like this: 52 | 53 | ```python 54 | ("Star Wars", ("Wizard of Oz", "Star Trek", 0.8)) 55 | ``` 56 | 57 | Now, invoking `computeContribs` on the value of this pair produces a list of pairs: 58 | 59 | ```python 60 | [("Wizard of Oz", 0.4), ("Star Trek", 0.4)] 61 | ``` 62 | 63 | By applying `computeContribs` and collecting the results with `flatMap`, we get a pair RDD that has, for each movie, its contribution from each of its neighbors. You should now sum (reduce) this pair RDD by key, so we get the sum of each movie's contributions from its neighbors. 64 | 65 | Next, the PageRank algorithm dictates that we should recompute each movie's rank from the `ranks` RDD as 0.15 + 0.85 times its neighbors' contribution (you can use `mapValues` for this). This recomputation produces a new input value for `ranks`. 66 | 67 | Finally, when your loop is done, display the 10 highest-ranked movies and their PageRank. 68 | 69 | **Solution**: 70 | 71 | ```python 72 | # links is RDD of (movie, [referenced movies]) 73 | links = sc.textFile("file:///home/ubuntu/data/processed-movie-links.txt") \ 74 | .map(lambda line: line.split("-->")) \ 75 | .map(lambda (a, b): (a.strip(), b.strip())) \ 76 | .distinct() \ 77 | .groupByKey() \ 78 | .cache() 79 | 80 | # ranks is RDD of (movie, 1.0) 81 | ranks = links.map(lambda (movie, _): (movie, 1.0)) 82 | 83 | # each of our references gets a contribution of our rank divided by the 84 | # total number of our references 85 | def computeContribs(referenced, rank): 86 | count = len(referenced) 87 | for movie in referenced: 88 | yield (movie, rank / count) 89 | 90 | for _ in range(0, 10): 91 | # recompute each movie's contributions from its referencing movies 92 | contribs = links.join(ranks).flatMap(lambda (_, (referenced, rank)): 93 | computeContribs(referenced, rank) 94 | ) 95 | # recompute the movie's ranks by accounting all its referencing 96 | # movies' contributions 97 | ranks = contribs.reduceByKey(lambda a, b: a + b) \ 98 | .mapValues(lambda rank: rank*0.85 + 0.15) 99 | 100 | for movie, rank in ranks.sortBy(lambda (_, rank): -rank).take(10): 101 | print('"%s" has rank %2.2f' % (movie, rank)) 102 | ``` 103 | 104 | ___ 105 | 106 | #### Task 3: GraphX PageRank 107 | 108 | The PageRank algorithm we implemented in the previous task is not very efficient. For example, running it on our dataset for 100 iterations took approximately 15 minutes on a 4-core machine. Considering that there are "just" about 25,000 movies ranked, this is not a very good result. 109 | 110 | Spark ships with a native graph algorithm library called GraphX. Unfortunately, it doesn't yet have a Python binding -- you can only use it from Scala and Java. But we're not going to let that stop us! 111 | 112 | Navigate to the Spark installation directory (`~/spark` in the appliance) and run `bin/spark-shell`. This is the Spark Scala REPL, which is very similar to PySpark, except it uses Scala. First, you're going to need a couple of import statements: 113 | 114 | ```scala 115 | import org.apache.spark._ 116 | import org.apache.spark.graphx._ 117 | import org.apache.spark.graphx.lib._ 118 | ``` 119 | 120 | Next, load the graph edges from the supplied `~/data/movie-edges.txt` file: 121 | 122 | ```scala 123 | val graph = GraphLoader.edgeListFile(sc, 124 | "file:///home/ubuntu/data/movie-edges.txt") 125 | ``` 126 | 127 | This file was generated from the same dataset, but it has a format that GraphX natively supports. You can check out the format by running the following commands: 128 | 129 | ``` 130 | $ head ~/data/movie-edges.txt 131 | 0 1 132 | 2 3 133 | 2 4 134 | 2 5 135 | 2 6 136 | 7 8 137 | 9 10 138 | 11 10 139 | 12 10 140 | 13 10 141 | $ head ~/data/movie-vertices.txt 142 | 0 Howard Stern on Demand 143 | 1 #1 Single 144 | 2 #LawstinWoods 145 | 3 Lost 146 | 4 Kenny Rogers and Dolly Parton: Together 147 | 5 The Grudge 148 | 6 The Ring 149 | 7 #MonologueWars 150 | 8 Trainspotting 151 | 9 Community 152 | ``` 153 | 154 | That's it -- we can run PageRank. Instead of working with a set number of iterations, the PageRank implementation in GraphX can run until the ranks converge (stop changing). We'll set the tolerance threshold to 0.0001, which means we're waiting for convergence up to that threshold. This computation took just under 2 minutes on the same machine! 155 | 156 | ```scala 157 | val pageRank = PageRank.runUntilConvergence(graph, 0.0001).vertices.map( 158 | p => (p._1.toInt, p._2)).cache() 159 | ``` 160 | 161 | > The resulting graph vertices are pairs of the vertex id and its rank. We use `toInt` to convert it to an int for the subsequent join operation. 162 | 163 | Next, load the vertices file that specifies the movie title for each id: 164 | 165 | ```scala 166 | val titles = sc.textFile("file:///home/ubuntu/data/movie-vertices.txt").map( 167 | line => { 168 | val parts = line.split(" "); 169 | (parts(0).toInt, parts.drop(1).mkString(" ")) 170 | } 171 | ) 172 | ``` 173 | 174 | Finally, join the ranks and the titles and sort the result to print the top 10 movies by rank: 175 | 176 | ```scala 177 | titles.join(pageRank).sortBy(-_._2._2).map(_._2).take(10) 178 | ``` 179 | 180 | ___ 181 | 182 | #### Discussion 183 | 184 | Besides being easier to use than implementing your own algorithms, why do you think GraphX has potential for being faster than something you'd roll by hand? 185 | -------------------------------------------------------------------------------- /scala/lab2-airlines.md: -------------------------------------------------------------------------------- 1 | ### Lab 2: Flight Delay Analysis 2 | 3 | In this lab, you will analyze a real-world dataset -- information about US flight delays in January 2016, courtesy of the United States Department of Transportation. You can [download additional datasets](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) later. Here's another example you might find interesting -- [US border crossing/entry data per port of entry](http://transborder.bts.gov/programs/international/transborder/TBDR_BC/TBDR_BCQ.html). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | This dataset ships with two files (in the `/home/vagrant/data` directory, if you are using the instructor-provided VirtualBox appliance). First, the `airline-format.html` file contains a brief description of the dataset, and the various data fields. For example, the `ArrDelay` field is the flight's arrival delay, in minutes. Second, the `airline-delays.csv` file is a comma-separated collection of flight records, one record per line. 10 | 11 | Inspect the fields described in the `airline-format.html` file. Make a note of fields that describe the flight, its origin and destination airports, and any delays encountered on departure and arrival. 12 | 13 | Let's start by counting the number of records in our dataset. Run the following command in a terminal window: 14 | 15 | ``` 16 | wc -l airline-delays.csv 17 | ``` 18 | 19 | This dataset has hundreds of thousands of records. To sample 10 records from the dataset picked at probability 0.005%, run the following command (for convenience, its output is also quoted here): 20 | 21 | ``` 22 | $ cat airline-delays.csv | cut -d',' -f1-20 | awk '{ if (rand() <= 0.00005 || FNR==1) { print $0; if (++count > 11) exit; } }' 23 | "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac" 24 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N3AFAA","242",14771,1477102,32457,"SFO","San Francisco, CA","CA","06","California" 25 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N859AA","284",12173,1217302,32134,"HNL","Honolulu, HI","HI","15","Hawaii" 26 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N3GRAA","1227",11278,1127803,30852,"DCA","Washington, DC","VA","51","Virginia" 27 | 2016,1,1,4,1,2016-01-04,"AA",19805,"AA","N3BGAA","1450",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas" 28 | 2016,1,1,5,2,2016-01-05,"AA",19805,"AA","N3AMAA","1616",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas" 29 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N916US","1783",11057,1105703,31057,"CLT","Charlotte, NC","NC","37","North Carolina" 30 | 2016,1,1,2,6,2016-01-02,"AS",19930,"AS","N517AS","879",14747,1474703,30559,"SEA","Seattle, WA","WA","53","Washington" 31 | 2016,1,1,20,3,2016-01-20,"AS",19930,"AS","N769AS","568",14057,1405702,34057,"PDX","Portland, OR","OR","41","Oregon" 32 | 2016,1,1,24,7,2016-01-24,"UA",19977,"UA","","706",14843,1484304,34819,"SJU","San Juan, PR","PR","72","Puerto Rico" 33 | 2016,1,1,15,5,2016-01-15,"UA",19977,"UA","N34460","1077",12266,1226603,31453,"IAH","Houston, TX","TX","48","Texas" 34 | 2016,1,1,12,2,2016-01-12,"UA",19977,"UA","N423UA","1253",13303,1330303,32467,"MIA","Miami, FL","FL","12","Florida" 35 | ``` 36 | 37 | This displays the first 20 fields of the 10 sampled records from the file. The first line is a header line, so we printed it unconditionally. This is a typical example of structured data that we would have to parse first before analyzing it with Spark. 38 | 39 | > We could examine the full dataset using shell commands, because it is not exceptionally big. For larger datasets that couldn't conceivably be processed or even stored on a single machine, we could have used Spark itself to perform the sampling. If you're interested, examine the `takeSample` method that Spark RDDs provide. 40 | 41 | ___ 42 | 43 | #### Task 2: Parsing CSV Data 44 | 45 | Next, you have to parse the CSV data. The header line provides the column names, and then each subsequent line can be parsed taking these into account. In Spark there is no built-in functionality that can parse CSV files. But there is a library(com.databricks.spark.csv) that you can use to parse CSV lines. 46 | 47 | You must run the following code before running any Spark code. 48 | Otherwise you restrart Zeppilin Spark Interpreter: Interpreter -> spark box(first one) -> restart button 49 | Now go back to your note and run: 50 | 51 | ```scala 52 | %dep 53 | z.reset() 54 | z.load("com.databricks:spark-csv_2.11:1.4.0") 55 | ``` 56 | 57 | Next, create an DataFrame based on the `airline-delays.csv` file by using this library. 58 | Note that you have access to a pre-initialized `SQLContext` object named `sqlContext`. 59 | 60 | ```scala 61 | val flightsDF = sqlContext.read 62 | .format("com.databricks.spark.csv") 63 | .option("header", "true") // Use first line of all files as header 64 | .option("inferSchema", "true") // Automatically infer data types 65 | .load("file:///home/vagrant/data/airline-delays.csv") 66 | ``` 67 | 68 | You can check the schema by printing it: 69 | ``` 70 | flightsDF.printSchema 71 | ``` 72 | 73 | ___ 74 | 75 | #### Task 3: Converting dataframe to rdd 76 | 77 | In this lab we want to practice RDD operations. Later in this workshop we will use dataframes to manipulate data. 78 | First we need to create case class Flight(use only several fields that you will use in this lab) with fields: Carrier, OriginCityName, ArrDelay, DestCityName, 79 | Distance 80 | 81 | And then create FlightsRdd from flightsDF 82 | 83 | **Solution**: 84 | 85 | ```scala 86 | case class Flight(Carrier: String, OriginCityName: String, ArrDelay: Double, DestCityName: String, Distance: Double) 87 | val flightRdd = flightsDF.map(row => Flight(row.getAs("Carrier"), row.getAs("OriginCityName"), row.getAs("ArrDelay"), row.getAs("DestCityName"), row.getAs("Distance"))) 88 | ``` 89 | 90 | ___ 91 | 92 | #### Task 4: Querying Flights and Delays 93 | 94 | Now that you have the flight objects, it's time to perform a few queries and gather some useful information. Suppose you're in Boston, MA. Which airline has the most flights departing from Boston? 95 | 96 | 97 | **Solution**: 98 | 99 | ```scala 100 | val carriersFromBoston = flightRdd.filter(f => f.OriginCityName == "Boston, MA").map(f => (f.Carrier, 1)) 101 | val carrierWithMostFlights = carriersFromBoston.reduceByKey(_ + _).sortBy(_._2, false).take(1) 102 | ``` 103 | 104 | 105 | Overall, which airline has the worst average delay? How bad was that delay? 106 | 107 | > **HINT**: Use `combineByKey`. 108 | 109 | 110 | **Solution**: 111 | 112 | ```scala 113 | val avgDelay = flightRdd.filter(f => f.ArrDelay > 0) 114 | .map(f => (f.Carrier, f.ArrDelay)) 115 | .combineByKey(d => (d, 1), 116 | (s:(Double, Int), d: Double) => (s._1 + d, s._2 + 1), 117 | (s1:(Double, Int) , s2: (Double, Int)) => (s1._1 + s2._1, s1._2 + s2._2) 118 | ) 119 | 120 | val worstAirline = avgDelay.map{ case (car, (av, cnt)) => ((car, av/cnt)) } 121 | worstAirline.collect() 122 | ``` 123 | 124 | 125 | Living in Chicago, IL, what are the farthest 10 destinations that you could fly to? (Note that our dataset contains only US domestic flights.) 126 | 127 | **Solution**: 128 | 129 | ```scala 130 | val chicagoFarthest = flightRdd.filter(f => f.OriginCityName == "Chicago, IL") 131 | .map(f => (f.DestCityName, f.Distance)) 132 | .distinct() 133 | .sortBy(_._2, false) 134 | .take(10) 135 | ``` 136 | 137 | 138 | Suppose you're in New York, NY and are contemplating direct flights to San Francisco, CA. In terms of arrival delay, which airline has the best record on that route? 139 | 140 | **Solution**: 141 | 142 | ```scala 143 | val nyToSF = flightRdd.filter(f => (f.OriginCityName == "New York, NY") && (f.DestCityName == "San Francisco, CA") && (f.ArrDelay > 0)) 144 | .map(f => (f.Carrier, f.ArrDelay)) 145 | .reduceByKey(_ + _) 146 | .sortBy(_._2) 147 | .take(1) 148 | ``` 149 | 150 | 151 | Suppose you live in San Jose, CA, and there don't seem to be many direct flights taking you to Boston, MA. Of all the 1-stop flights, which would be the best option in terms of average arrival delay? (It's OK to assume that every pair of flights from San Jose to X and from X to Boston is an option that you could use.) 152 | 153 | > **NOTE**: To answer this question, you will probably need a cartesian product of the dataset with itself. Beside the fact that it's a fairly expensive operation, we haven't learned about multi-RDD operations yet. Still, you can explore the `join` RDD method, which applies to pair (key-value) RDDs, discussed later in our workshop. 154 | 155 | **Solution**: 156 | 157 | ```scala 158 | val flightsByDst = flightRdd.filter(f => f.OriginCityName == "San Jose, CA") 159 | .map(f => (f.DestCityName, f)) 160 | 161 | val flightsByOrg = flightRdd.filter(f => f.DestCityName == "Boston, MA") 162 | .map(f => (f.OriginCityName, f)) 163 | 164 | def addDelays(f1:Flight, f2:Flight) = { 165 | var total = 0.0 166 | total += f1.ArrDelay 167 | total += f2.ArrDelay 168 | total 169 | } 170 | 171 | flightsByDst.join(flightsByOrg) 172 | .map{ case (city, (f1, f2)) => (city, addDelays(f1, f2)) } 173 | .combineByKey(d => (d, 1), 174 | (s: (Double, Int), d: Double) => (s._1 + d, s._2 + 1), 175 | (s1: (Double, Int), s2: (Double, Int)) => (s1._1 + s2._1, s1._2 + s2._2)) 176 | .map { case (city, s) => (city, s._1/s._2) } 177 | .sortBy(_._2) 178 | .take(1) 179 | ``` 180 | 181 | ___ 182 | 183 | ### Discussion 184 | 185 | Suppose you had to calculate multiple aggregated values from the `flights` RDD -- e.g., the average arrival delay, the average departure delay, and the average flight duration for flights from Boston. How would you express it using SQL, if `flights` was a table in a relational database? How would you express it using transformations and actions on RDDs? Which is easier to develop and maintain? 186 | 187 | 188 | 189 | -------------------------------------------------------------------------------- /python/lab2-airlines.md: -------------------------------------------------------------------------------- 1 | ### Lab 2: Flight Delay Analysis 2 | 3 | In this lab, you will analyze a real-world dataset -- information about US flight delays in January 2016, courtesy of the United States Department of Transportation. You can [download additional datasets](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) later. Here's another example you might find interesting -- [US border crossing/entry data per port of entry](http://transborder.bts.gov/programs/international/transborder/TBDR_BC/TBDR_BCQ.html). 4 | 5 | ___ 6 | 7 | #### Task 1: Inspecting the Data 8 | 9 | This dataset ships with two files (in the `~/data` directory, if you are using the instructor-provided appliance). First, the `airline-format.html` file contains a brief description of the dataset, and the various data fields. For example, the `ArrDelay` field is the flight's arrival delay, in minutes. Second, the `airline-delays.csv` file is a comma-separated collection of flight records, one record per line. 10 | 11 | Inspect the fields described in the `airline-format.html` file. Make a note of fields that describe the flight, its origin and destination airports, and any delays encountered on departure and arrival. 12 | 13 | Let's start by counting the number of records in our dataset. Run the following command in a terminal window: 14 | 15 | ``` 16 | wc -l airline-delays.csv 17 | ``` 18 | 19 | This dataset has hundreds of thousands of records. To sample 10 records from the dataset picked at probability 0.005%, run the following command (for convenience, its output is also quoted here): 20 | 21 | ``` 22 | $ cat airline-delays.csv | cut -d',' -f1-20 | awk '{ if (rand() <= 0.00005 || FNR==1) { print $0; if (++count > 11) exit; } }' 23 | "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac" 24 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N3AFAA","242",14771,1477102,32457,"SFO","San Francisco, CA","CA","06","California" 25 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N859AA","284",12173,1217302,32134,"HNL","Honolulu, HI","HI","15","Hawaii" 26 | 2016,1,1,9,6,2016-01-09,"AA",19805,"AA","N3GRAA","1227",11278,1127803,30852,"DCA","Washington, DC","VA","51","Virginia" 27 | 2016,1,1,4,1,2016-01-04,"AA",19805,"AA","N3BGAA","1450",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas" 28 | 2016,1,1,5,2,2016-01-05,"AA",19805,"AA","N3AMAA","1616",11298,1129804,30194,"DFW","Dallas/Fort Worth, TX","TX","48","Texas" 29 | 2016,1,1,20,3,2016-01-20,"AA",19805,"AA","N916US","1783",11057,1105703,31057,"CLT","Charlotte, NC","NC","37","North Carolina" 30 | 2016,1,1,2,6,2016-01-02,"AS",19930,"AS","N517AS","879",14747,1474703,30559,"SEA","Seattle, WA","WA","53","Washington" 31 | 2016,1,1,20,3,2016-01-20,"AS",19930,"AS","N769AS","568",14057,1405702,34057,"PDX","Portland, OR","OR","41","Oregon" 32 | 2016,1,1,24,7,2016-01-24,"UA",19977,"UA","","706",14843,1484304,34819,"SJU","San Juan, PR","PR","72","Puerto Rico" 33 | 2016,1,1,15,5,2016-01-15,"UA",19977,"UA","N34460","1077",12266,1226603,31453,"IAH","Houston, TX","TX","48","Texas" 34 | 2016,1,1,12,2,2016-01-12,"UA",19977,"UA","N423UA","1253",13303,1330303,32467,"MIA","Miami, FL","FL","12","Florida" 35 | ``` 36 | 37 | This displays the first 20 fields of the 10 sampled records from the file. The first line is a header line, so we printed it unconditionally. This is a typical example of structured data that we would have to parse first before analyzing it with Spark. 38 | 39 | > We could examine the full dataset using shell commands, because it is not exceptionally big. For larger datasets that couldn't conceivably be processed or even stored on a single machine, we could have used Spark itself to perform the sampling. If you're interested, examine the `takeSample` method that Spark RDDs provide. 40 | 41 | ___ 42 | 43 | #### Task 2: Parsing CSV Data 44 | 45 | Next, you have to parse the CSV data. The header line provides the column names, and then each subsequent line can be parsed taking these into account. Python has a built-in `csv` module (unrelated to Spark) that you can use to parse CSV lines. Try it out in a Python shell (by running `python` in a terminal window) or the Pyspark shell (by running `bin/pyspark` from Spark's installation directory in a terminal window): 46 | 47 | ```python 48 | import csv 49 | from StringIO import StringIO 50 | 51 | si = StringIO('"Alice",14,"panda"') 52 | fields = ["name", "age", "favorite animal"] 53 | csv.DictReader(si, fieldnames=fields).next() 54 | ``` 55 | 56 | Great! Next, write a function that parses one line from the flight delays CSV file. You can call that function `parseLine`, and it should return the Python dict that `DictReader.next` returns. 57 | 58 | **Solution**: 59 | 60 | ```python 61 | def parseLine(line, fieldnames): 62 | si = StringIO(line) 63 | return csv.DictReader(si, fieldnames=fieldnames).next() 64 | ``` 65 | 66 | Next, create an RDD based on the `airline-delays.csv` file, and map each line of that file using the `parseLine` function you wrote. The result should be an RDD of Python dicts representing the flight delay data. Note that the first line (the header line) should be discarded. 67 | 68 | **Solution**: 69 | 70 | ```python 71 | rdd = sc.textFile("file:////home/ubuntu/data/airline-delays.csv") 72 | headerline = rdd.first() 73 | fieldnames = filter(lambda field: len(field) > 0, 74 | map(lambda field: field.strip('"'), headerline.split(','))) 75 | flights = rdd.filter(lambda line: line != headerline) \ 76 | .map(lambda line: parseLine(line, fieldnames)) 77 | flights.persist() 78 | ``` 79 | 80 | ___ 81 | 82 | #### Task 3: Querying Flights and Delays 83 | 84 | Now that you have the flight objects, it's time to perform a few queries and gather some useful information. Suppose you're in Boston, MA. Which airline has the most flights departing from Boston? 85 | 86 | **Solution**: 87 | 88 | ```python 89 | flightsByCarrier = flights.filter( 90 | lambda flight: flight['OriginCityName'] == "Boston, MA") \ 91 | .map(lambda flight: flight['Carrier']) \ 92 | .countByValue() 93 | sorted(flightsByCarrier.items(), key=lambda p: -p[1])[0] 94 | ``` 95 | 96 | Overall, which airline has the worst average delay? How bad was that delay? 97 | 98 | > **HINT**: Use `combineByKey`. 99 | 100 | **Solution**: 101 | 102 | ```python 103 | flights.filter(lambda f: f['ArrDelay'] != '') \ 104 | .map(lambda f: (f['Carrier'], float(f['ArrDelay']))) \ 105 | .combineByKey(lambda d: (d, 1), 106 | lambda s, d: (s[0]+d, s[1]+1), 107 | lambda s1, s2: (s1[0]+s2[0], s1[1]+s2[1])) \ 108 | .map(lambda (k, (s, c)): (k, s/float(c))) \ 109 | .collect() 110 | ``` 111 | 112 | Living in Chicago, IL, what are the farthest 10 destinations that you could fly to? (Note that our dataset contains only US domestic flights.) 113 | 114 | **Solution**: 115 | 116 | ```python 117 | flights.filter(lambda f: f['OriginCityName'] == "Chicago, IL") \ 118 | .map(lambda f: (f['DestCityName'], float(f['Distance']))) \ 119 | .distinct() \ 120 | .sortBy(lambda (dest, dist): -dist) \ 121 | .take(10) 122 | ``` 123 | 124 | Suppose you're in New York, NY and are contemplating direct flights to San Francisco, CA. In terms of arrival delay, which airline has the best record on that route? 125 | 126 | **Solution**: 127 | 128 | ```python 129 | flights.filter(lambda flight: flight['OriginCityName'] == "New York, NY" and 130 | flight['DestCityName'] == "San Francisco, CA" and 131 | flight['ArrDelay'] != '') \ 132 | .map(lambda flight: (flight['Carrier'], float(flight['ArrDelay']))) \ 133 | .reduceByKey(lambda a, b: a + b) \ 134 | .sortBy(lambda (carrier, delay): delay) \ 135 | .first() 136 | ``` 137 | 138 | Suppose you live in San Jose, CA, and there don't seem to be many direct flights taking you to Boston, MA. Of all the 1-stop flights, which would be the best option in terms of average arrival delay? (It's OK to assume that every pair of flights from San Jose to X and from X to Boston is an option that you could use.) 139 | 140 | > **NOTE**: To answer this question, you will probably need a cartesian product of the dataset with itself. Beside the fact that it's a fairly expensive operation, we haven't learned about multi-RDD operations yet. Still, you can explore the `join` RDD method, which applies to pair (key-value) RDDs, discussed later in our workshop. 141 | 142 | **Solution**: 143 | 144 | ```python 145 | flightsByDst = flights.filter(lambda f: f['OriginCityName'] == 'San Jose, CA')\ 146 | .map(lambda f: (f['DestCityName'], f)) 147 | flightsByOrg = flights.filter(lambda f: f['DestCityName'] == 'Boston, MA') \ 148 | .map(lambda f: (f['OriginCityName'], f)) 149 | 150 | def addDelays(f1, f2): 151 | total = 0 152 | total += float(f1['ArrDelay']) if f1['ArrDelay'] != '' else 0 153 | total += float(f2['ArrDelay']) if f2['ArrDelay'] != '' else 0 154 | return total 155 | 156 | flightsByDst.join(flightsByOrg) \ 157 | .map(lambda (city, (f1, f2)): (city, addDelays(f1, f2))) \ 158 | .combineByKey(lambda d: (d, 1), 159 | lambda s, d: (s[0]+d, s[1]+1), 160 | lambda s1, s2: (s1[0]+s2[0], s1[1]+s2[1])) \ 161 | .map(lambda (city, s): (city, s[0]/float(s[1]))) \ 162 | .sortBy(lambda (city, delay): delay) \ 163 | .first() 164 | ``` 165 | 166 | ___ 167 | 168 | ### Discussion 169 | 170 | Suppose you had to calculate multiple aggregated values from the `flights` RDD -- e.g., the average arrival delay, the average departure delay, and the average flight duration for flights from Boston. How would you express it using SQL, if `flights` was a table in a relational database? How would you express it using transformations and actions on RDDs? Which is easier to develop and maintain? 171 | --------------------------------------------------------------------------------