├── .gitattributes
├── .gitignore
├── A Gentle Introduction to Apache Spark on Databricks.ipynb
├── Chicago_Airlines_Demo.ipynb
├── Lecture2s.pdf
├── README.md
├── cs105_lab1a_spark_tutorial.ipynb
├── cs105_lab1b_word_count.ipynb
├── cs105_lab2_apache_log.ipynb
├── cs110_lab1_power_plant_ml_pipeline.html
├── cs110_lab1_power_plant_ml_pipeline.ipynb
├── cs110_lab2_als_prediction.ipynb
├── cs120_lab1a_math_review.ipynb
├── cs120_lab1b_word_count_rdd.ipynb
├── cs120_lab2_linear_regression_df.ipynb
├── cs120_lab3_ctr_df.ipynb
└── cs120_lab4_pca.ipynb


/.gitattributes:
--------------------------------------------------------------------------------
 1 | # Auto detect text files and perform LF normalization
 2 | * text=auto
 3 | 
 4 | # Custom for Visual Studio
 5 | *.cs     diff=csharp
 6 | 
 7 | # Standard to msysgit
 8 | *.doc	 diff=astextplain
 9 | *.DOC	 diff=astextplain
10 | *.docx diff=astextplain
11 | *.DOCX diff=astextplain
12 | *.dot  diff=astextplain
13 | *.DOT  diff=astextplain
14 | *.pdf  diff=astextplain
15 | *.PDF	 diff=astextplain
16 | *.rtf	 diff=astextplain
17 | *.RTF	 diff=astextplain
18 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Windows image file caches
 2 | Thumbs.db
 3 | ehthumbs.db
 4 | 
 5 | # Folder config file
 6 | Desktop.ini
 7 | 
 8 | # Recycle Bin used on file shares
 9 | $RECYCLE.BIN/
10 | 
11 | # Windows Installer files
12 | *.cab
13 | *.msi
14 | *.msm
15 | *.msp
16 | 
17 | # Windows shortcuts
18 | *.lnk
19 | 
20 | # =========================
21 | # Operating System Files
22 | # =========================
23 | 
24 | # OSX
25 | # =========================
26 | 
27 | .DS_Store
28 | .AppleDouble
29 | .LSOverride
30 | 
31 | # Thumbnails
32 | ._*
33 | 
34 | # Files that might appear in the root of a volume
35 | .DocumentRevisions-V100
36 | .fseventsd
37 | .Spotlight-V100
38 | .TemporaryItems
39 | .Trashes
40 | .VolumeIcon.icns
41 | 
42 | # Directories potentially created on remote AFP share
43 | .AppleDB
44 | .AppleDesktop
45 | Network Trash Folder
46 | Temporary Items
47 | .apdisk
48 | 


--------------------------------------------------------------------------------
/A Gentle Introduction to Apache Spark on Databricks.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["# A Gentle Introduction to Apache Spark on Databricks\n\n** Welcome to Databricks! **\n\nThis notebook is intended to be the first step in your process to learn more about how to best use Apache Spark on Databricks together. We'll be walking through the core concepts, the fundamental abstractions, and the tools at your disposal. This notebook will teach the fundamental concepts and best practices directly from those that have written Apache Spark and know it best.\n\nFirst, it's worth defining Databricks. Databricks is a managed platform for running Apache Spark - that means that you do not have to learn complex cluster management concepts nor perform tedious maintenance tasks to take advantage of Spark. Databricks also provides a host of features to help its users be more productive with Spark. It's a point and click platform for those that prefer a user interface like data scientists or data analysts. However, this UI is accompanied by a sophisticated API for those that want to automate aspects of their data workloads with automated jobs. To meet the needs of enterprises, Databricks also includes features such as role-based access control and other intelligent optimizations that not only improve usability for users but also reduce costs and complexity for administrators.\n\n** The Gentle Introduction Series **\n\nThis notebook is a part of a series of notebooks aimed to get you up to speed with the basics of Apache Spark quickly. This notebook is best suited for those that have very little or no experience with Spark. The series also serves as a strong review for those that have some experience with Spark but aren't as familiar with some of the more sophisticated tools like UDF creation and machine learning pipelines. The other notebooks in this series are:\n\n- [A Gentle Introduction to Apache Spark on Databricks](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2168141618055043/484361/latest.html)\n- [Apache Spark on Databricks for Data Scientists](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2168141618055194/484361/latest.html)\n- [Apache Spark on Databricks for Data Engineers](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2168141618055109/484361/latest.html)\n\n## Databricks Terminology\n\nDatabricks has key concepts that are worth understanding. You'll notice that many of these line up with the links and icons that you'll see on the left side. These together define the fundamental tools that Databricks provides to you as an end user. They are available both in the web application UI as well as the REST API.\n\n-   ****Workspaces****\n    -   Workspaces allow you to organize all the work that you are doing on Databricks. Like a folder structure in your computer, it allows you to save ****notebooks**** and ****libraries**** and share them with other users. Workspaces are not connected to data and should not be used to store data. They're simply for you to store the ****notebooks**** and ****libraries**** that you use to operate on and manipulate your data with.\n-   ****Notebooks****\n    -   Notebooks are a set of any number of cells that allow you to execute commands. Cells hold code in any of the following languages: `Scala`, `Python`, `R`, `SQL`, or `Markdown`. Notebooks have a default language, but each cell can have a language override to another language. This is done by including `%[language name]` at the top of the cell. For instance `%python`. We'll see this feature shortly.\n    -   Notebooks need to be connected to a ****cluster**** in order to be able to execute commands however they are not permanently tied to a cluster. This allows notebooks to be shared via the web or downloaded onto your local machine.\n    -   Here is a demonstration video of [Notebooks](http://www.youtube.com/embed/MXI0F8zfKGI).\n    -   ****Dashboards****\n        -   ****Dashboards**** can be created from ****notebooks**** as a way of displaying the output of cells without the code that generates them. \n    - ****Notebooks**** can also be scheduled as ****jobs**** in one click either to run a data pipeline, update a machine learning model, or update a dashboard.\n-   ****Libraries****\n    -   Libraries are packages or modules that provide additional functionality that you need to solve your business problems. These may be custom written Scala or Java jars; python eggs or custom written packages. You can write and upload these manually or you may install them directly via package management utilities like pypi or maven.\n-   ****Tables****\n    -   Tables are structured data that you and your team will use for analysis. Tables can exist in several places. Tables can be stored on Amazon S3, they can be stored on the cluster that you're currently using, or they can be cached in memory. [For more about tables see the documentation](https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#02%20Product%20Overview/07%20Tables.html).\n-   ****Clusters****\n    -   Clusters are groups of computers that you treat as a single computer. In Databricks, this means that you can effectively treat 20 computers as you might treat one computer. Clusters allow you to execute code from ****notebooks**** or ****libraries**** on set of data. That data may be raw data located on S3 or structured data that you uploaded as a ****table**** to the cluster you are working on. \n    - It is important to note that clusters have access controls to control who has access to each cluster.\n    -   Here is a demonstration video of [Clusters](http://www.youtube.com/embed/2-imke2vDs8).\n-   ****Jobs****\n    -   Jobs are the tool by which you can schedule execution to occur either on an already existing ****cluster**** or a cluster of its own. These can be ****notebooks**** as well as jars or python scripts. They can be created either manually or via the REST API.\n    -   Here is a demonstration video of [Jobs](<http://www.youtube.com/embed/srI9yNOAbU0).\n-   ****Apps****\n    -   Apps are third party integrations with the Databricks platform. These include applications like Tableau.\n\n## Databricks and Apache Spark Help Resources\n\nDatabricks comes with a variety of tools to help you learn how to use Databricks and Apache Spark effectively. Databricks holds the greatest collection of Apache Spark documentation available anywhere on the web. There are two fundamental sets of resources that we make available: resources to help you learn how to use Apache Spark and Databricks and resources that you can refer to if you already know the basics.\n\nTo access these resources at any time, click the question mark button at the top right-hand corner. This search menu will search all of the below sections of the documentation.\n\n![img](http://training.databricks.com/databricks_guide/gentle_introduction/help_menu.png)\n\n-   ****The Databricks Guide****\n    -   The Databricks Guide is the definitive reference for you and your team once you've become accustomed to using and leveraging Apache Spark. It allows for quick reference of common Databricks and Spark APIs with snippets of sample code.\n    -   The Guide also includes a series of tutorials (including this one!) that provide a more guided introduction to a given topic.\n-   ****The Spark APIs****\n    -   Databricks makes it easy to search the Apache Spark APIs directly. Simply use the search that is available at the top right and it will automatically display API results as well.\n-   ****The Apache Spark Documentation****\n    -   The Apache Spark open source documentation is also made available for quick and simple search if you need to dive deeper into some of the internals of Apache Spark.\n-   ****Databricks Forums****\n    -   [The Databricks Forums](https://forums.databricks.com/) are a community resource for those that have specific use case questions or questions that they cannot see answered in the guide or the documentation.\n    \n[Databricks also provides professional and enterprise level technical support](http://go.databricks.com/contact-databricks) for companies and enterprises looking to take their Apache Spark deployments to the next level.\n\n## Databricks and Apache Spark Abstractions\n\nNow that we've defined the terminology and more learning resources - let's go through a basic introduction of Apache Spark and Databricks. While you're likely familiar with the concept of Spark, let's take a moment to ensure that we all share the same definitions and give you the opportunity to learn a bit about Spark's history.\n\n### The Apache Spark project's History\n\nSpark was originally written by the founders of Databricks during their time at UC Berkeley. The Spark project started in 2009, was open sourced in 2010, and in 2013 its code was donated to Apache, becoming Apache Spark. The employees of Databricks have written over 75% of the code in Apache Spark and have contributed more than 10 times more code than any other organization. Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines. While the abstractions and interfaces are simple, managing clusters of computers and ensuring production-level stability is not. Databricks makes big data simple by providing Apache Spark as a hosted solution.\n\n### The Contexts/Environments\n\nLet's now tour the core abstractions in Apache Spark to ensure that you'll be comfortable with all the pieces that you're going to need to understand in order to understand how to use Databricks and Spark effectively.\n\nHistorically, Apache Spark has had two core contexts that are available to the user. The `sparkContext` made available as `sc` and the `SQLContext` made available as `sqlContext`, these contexts make a variety of functions and information available to the user. The `sqlContext` makes a lot of DataFrame functionality available while the `sparkContext` focuses more on the Apache Spark engine itself.\n\nHowever in Apache Spark 2.X, there is just one context - the `SparkSession`.\n\n### The Data Interfaces\n\nThere are several key interfaces that you should understand when you go to use Spark.\n\n-   ****The Dataset****\n    -   The Dataset is Apache Spark's newest distributed collection and can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.\n-   ****The DataFrame****\n    -   The DataFrame is collection of distributed `Row` types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in python (pandas) as well as in the R language.\n-   ****The RDD (Resilient Distributed Dataset)****\n    -   Apache Spark's first abstraction was the RDD or Resilient Distributed Dataset. Essentially it is an interface to a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster. RDD's can be created in a variety of ways and are the \"lowest level\" API available to the user. While this is the original data structure made available, new users should focus on Datasets as those will be supersets of the current RDD functionality.\n\n\n# Getting Started with Some Code!\n\nWhew, that's a lot to cover thus far! But we've made it to the demonstration so we can see the power of Apache Spark and Databricks together. To do this you can do one of several things. First, and probably simplest, is that you can copy this notebook into your own environment via the `Import Notebook` button that is available at the top right or top left of this page. If you'd rather type all of the commands yourself, you can create a new notebook and type the commands as we proceed.\n\n## Creating a Cluster\n\n*If you're in the Community Edition of Databricks, this will all happen automatically once you start running cells in a notebook! However you're free to follow the below directions if you wish.*\n\nClick the Clusters button that you'll notice on the left side of the page. On the Clusters page, click on ![img](http://training.databricks.com/databricks_guide/create_cluster.png) in the upper left corner.\n\nThen, enter the configuration for the new cluster:\n\n![img](http://training.databricks.com/databricks_guide/create_cluster_2.13v1.png)\n\nFinally, \n\n-   Select a unique name for the cluster.\n-   Select the Spark Version.\n    -   Optionally, you can test out experimental versions of Spark.\n-   Enter the number of workers to bring up - at least 1 is required to run Spark commands.\n-   Select whether to use On-Demand Instances, Spot Instances, or a combination of both.\n\nTo read more about some of the other options that are available to users please see the [Databricks Guide on Clusters](https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#02%20Product%20Overview/01%20Clusters.html)."],"metadata":{}},{"cell_type":"markdown","source":["first let's explore the previously mentioned `SparkSession`. We can access it via the `spark` variable. As explained, the Spark Session is the core location for where Apache Spark related information is stored. For Spark 1.X the variables are `sqlContext` and `sc`.\n\nCells can be executed by hitting `shift+enter` while the cell is selected."],"metadata":{}},{"cell_type":"code","source":["sqlContext\nsc"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"code","source":["## If you're on 2.X the spark session is made available with the variable below\n#####\n\n# spark"],"metadata":{},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["We can use the Spark Context to access information but we can also use it to parallelize a collection as well. Here we'll parallelize a small python range that will provide a return type of `DataFrame`."],"metadata":{}},{"cell_type":"code","source":["firstDataFrame = sqlContext.range(1000000)\n\n# The code for 2.X is\n# spark.range(1000000)\nprint firstDataFrame"],"metadata":{},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["Now one might think that this would actually print out the values of the `DataFrame` that we just parallelized, however that's not quite how APache Spark works. Spark allows two distinct kinds of operations by the user. There are **transformations** and there are **actions**.\n\n### Transformations\n\nTransformations are operations that will not be completed at the time you write and execute the code in a cell - they will only get executed once you have called a **action**. An example of a transformation might be to convert an integer into a float or to filter a set of values.\n\n### Actions\n\nActions are commands that are computed by Spark right at the time of their execution. They consist of running all of the previous transformations in order to get back an actual result. An action is composed of one or more jobs which consists of tasks that will be executed by the workers in parallel where possible\n\nHere are some simple examples of transformations and actions. Remember, these **are not all** the transformations and actions - this is just a short sample of them. We'll get to why Apache Spark is designed this way shortly!\n\n![transformations and actions](http://training.databricks.com/databricks_guide/gentle_introduction/trans_and_actions.png)"],"metadata":{}},{"cell_type":"code","source":["# An example of a transformation\n# select the ID column values and multiply them by 2\nsecondDataFrame = firstDataFrame.selectExpr(\"(id * 2) as value\")"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"code","source":["# an example of an action\n# take the first 5 values that we have in our firstDataFrame\nprint firstDataFrame.take(5)\n# take the first 5 values that we have in our secondDataFrame\nprint secondDataFrame.take(5)"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"markdown","source":["Now we've seen that Spark consists of actions and transformations. Let's talk about why that's the case. The reason for this is that it gives a simple way to optimize the entire pipeline of computations as opposed to the individual pieces. This makes it exceptionally fast for certain types of computation because it can perform all relevant computations at once. Technically speaking, Spark `pipelines` this computation which we can see in the image below. This means that certain computations can all be performed at once (like a map and a filter) rather than having to do one operation for all pieces of data then the following operation.\n\n![transformations and actions](http://training.databricks.com/databricks_guide/gentle_introduction/pipeline.png)\n\nApache Spark can also keep results in memory as opposed to other frameworks that immediately write to disk after each task.\n\n## Apache Spark Architecture\n\nBefore proceeding with our example, let's see an overview of the Apache Spark architecture. As mentioned before, Apache Spark allows you to treat many machines as one machine and this is done via a master-worker type architecture where there is a `driver` or master node in the cluster, accompanied by `worker` nodes. The master sends work to the workers and either instructs them to pull to data from memory or from disk (or from another data source like S3 or Redshift).\n\nThe diagram below shows an example Apache Spark cluster, basically there exists a Driver node that communicates with executor nodes. Each of these executor nodes have slots which are logically like execution cores. \n\n![spark-architecture](http://training.databricks.com/databricks_guide/gentle_introduction/videoss_logo.png)\n\nThe Driver sends Tasks to the empty slots on the Executors when work has to be done:\n\n![spark-architecture](http://training.databricks.com/databricks_guide/gentle_introduction/spark_cluster_tasks.png)\n\nNote: In the case of the Community Edition there is no Worker, and the Master, not shown in the figure, executes the entire code.\n\n![spark-architecture](http://training.databricks.com/databricks_guide/gentle_introduction/notebook_microcluster.png)\n\nYou can view the details of your Apache Spark application in the Apache Spark web UI.  The web UI is accessible in Databricks by going to \"Clusters\" and then clicking on the \"View Spark UI\" link for your cluster, it is also available by clicking at the top left of this notebook where you would select the cluster to attach this notebook to. In this option will be a link to the Apache Spark Web UI.\n\nAt a high level, every Apache Spark application consists of a driver program that launches various parallel operations on executor Java Virtual Machines (JVMs) running either in a cluster or locally on the same machine. In Databricks, the notebook interface is the driver program.  This driver program contains the main loop for the program and creates distributed datasets on the cluster, then applies operations (transformations & actions) to those datasets.\nDriver programs access Apache Spark through a `SparkSession` object regardless of deployment location.\n\n## A Worked Example of Transformations and Actions\n\nTo illustrate all of these architectural and most relevantly **transformations** and **actions** - let's go through a more thorough example, this time using `DataFrames` and a csv file. \n\nThe DataFrame and SparkSQL work almost exactly as we have described above, we're going to build up a plan for how we're going to access the data and then finally execute that plan with an action. We'll see this process in the diagram below. We go through a process of analyzing the query, building up a plan, comparing them and then finally executing it.\n\n![Spark Query Plan](http://training.databricks.com/databricks_guide/gentle_introduction/query-plan-generation.png)\n\nWhile we won't go too deep into the details for how this process works, you can read a lot more about this process on the [Databricks blog](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html). For those that want a more information about how Apache Spark goes through this process, I would definitely recommend that post!\n\nGoing forward, we're going to access a set of public datasets that Databricks makes available. Databricks datasets are a small curated group that we've pulled together from across the web. We make these available using the Databricks filesystem. Let's load the popular diamonds dataset in as a spark `DataFrame`. Now let's go through the dataset that we'll be working with."],"metadata":{}},{"cell_type":"code","source":["%fs ls /databricks-datasets/Rdatasets/data-001/datasets.csv"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"code","source":["dataPath = \"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv\"\ndiamonds = sqlContext.read.format(\"com.databricks.spark.csv\")\\\n  .option(\"header\",\"true\")\\\n  .option(\"inferSchema\", \"true\")\\\n  .load(dataPath)\n  \n# inferSchema means we will automatically figure out column types \n# at a cost of reading the data more than once"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["Now that we've loaded in the data, we're going to perform computations on it. This provide us a convenient tour of some of the basic functionality and some of the nice features that makes running Spark on Databricks the simplest! In order to be able to perform our computations, we need to understand more about the data. We can do this with the `display` function."],"metadata":{}},{"cell_type":"code","source":["display(diamonds)"],"metadata":{},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["what makes `display` exceptional is the fact that we can very easily create some more sophisticated graphs by clicking the graphing icon that you can see below. Here's a plot that allows us to compare price, color, and cut."],"metadata":{}},{"cell_type":"code","source":["display(diamonds)"],"metadata":{},"outputs":[],"execution_count":16},{"cell_type":"markdown","source":["Now that we've explored the data, let's return to understanding **transformations** and **actions**. I'm going to create several transformations and then an action. After that we will inspect exactly what's happening under the hood.\n\nThese transformations are simple, first we group by two variables, cut and color and then compute the average price. Then we're going to inner join that to the original dataset on the column `color`. Then we'll select the average price as well as the carat from that new dataset."],"metadata":{}},{"cell_type":"code","source":["df1 = diamonds.groupBy(\"cut\", \"color\").avg(\"price\") # a simple grouping\n\ndf2 = df1\\\n  .join(diamonds, on='color', how='inner')\\\n  .select(\"`avg(price)`\", \"carat\")\n# a simple join and selecting some columns"],"metadata":{},"outputs":[],"execution_count":18},{"cell_type":"markdown","source":["These transformations are now complete in a sense but nothing has happened. As you'll see above we don't get any results back! \n\nThe reason for that is these computations are *lazy* in order to build up the entire flow of data from start to finish required by the user. This is a intelligent optimization for two key reasons. Any calculation can be recomputed from the very source data allowing Apache Spark to handle any failures that occur along the way, successfully handle stragglers. Secondly, Apache Spark can optimize computation so that data and computation can be `pipelined` as we mentioned above. Therefore, with each transformation Apache spark creates a plan for how it will perform this work.\n\nTo get a sense for what this plan consists of, we can use the `explain` method. Remember that none of our computations have been executed yet, so all this explain method does is tells us the lineage for how to compute this exact dataset."],"metadata":{}},{"cell_type":"code","source":["df2.explain()"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"markdown","source":["Now explaining the above results is outside of this introductory tutorial, but please feel free to read through it. What you should deduce from this is that Spark has generated a plan for how it hopes to execute the given query. Let's now run an action in order to execute the above plan."],"metadata":{}},{"cell_type":"code","source":["df2.count()"],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"markdown","source":["This will execute the plan that Apache Spark built up previously. Click the little arrow next to where it says `(2) Spark Jobs` after that cell finishes executing and then click the `View` link. This brings up the Apache Spark Web UI right inside of your notebook. This can also be accessed from the cluster attach button at the top of this notebook. In the Spark UI, you should see something that includes a diagram something like this.\n\n![img](http://training.databricks.com/databricks_guide/gentle_introduction/spark-dag-ui-before-2-0.png)\n\nor\n\n![img](http://training.databricks.com/databricks_guide/gentle_introduction/spark-dag-ui.png)\n\nThese are significant visualizations. The top one is using Apache Spark 1.6 while the lower one is using Apache Spark 2.0, we'll be focusing on the 2.0 version. These are Directed Acyclic Graphs (DAG)s of all the computations that have to be performed in order to get to that result. It's easy to see that the second DAG visualization is much cleaner than the one before but both visualizations show us all the steps that Spark has to get our data into the final form. \n\nAgain, this DAG is generated because transformations are *lazy* - while generating this series of steps Spark will optimize lots of things along the way and will even generate code to do so. This is one of the core reasons that users should be focusing on using DataFrames and Datasets instead of the legacy RDD API. With DataFrames and Datasets, Apache Spark will work under the hood to optimize the entire query plan and pipeline entire steps together. You'll see instances of `WholeStageCodeGen` as well as `tungsten` in the plans and these are apart of the improvements [in SparkSQL which you can read more about on the Databricks blog.](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)\n\nIn this diagram you can see that we start with a CSV all the way on the left side, perform some changes, merge it with another CSV file (that we created from the original DataFrame), then join those together and finally perform some aggregations until we get our final result!"],"metadata":{}},{"cell_type":"markdown","source":["### Caching\n\nOne of the significant parts of Apache Spark is its ability to to store things in memory during computation. This is a neat trick that you can use as a way to speed up access to commonly queried tables or pieces of data. This is also great for iterative algorithms that work over and over again on the same data. While many see this as a panacea for all speed issues, think of it much more like a tool that you can use. Other important concepts like data partitioning, clustering and bucketing can end up having a much greater effect on the execution of your job than caching however remember - these are all tools in your tool kit!\n\nTo cache a DataFrame or RDD, simply use the cache method."],"metadata":{}},{"cell_type":"code","source":["df2.cache()"],"metadata":{},"outputs":[],"execution_count":25},{"cell_type":"markdown","source":["Caching, like a transformation, is performed lazily. That means that it won't store the data in memory until you call an action on that dataset. \n\nHere's a simple example. We've created our df2 DataFrame which is essentially a logical plan that tells us how to compute that exact DataFrame. We've told Apache Spark to cache that data after we compute it for the first time. So let's call a full scan of the data with a count twice. The first time, this will create the DataFrame, cache it in memory, then return the result. The second time, rather than recomputing that whole DataFrame, it will just hit the version that it has in memory.\n\nLet's take a look at how we can discover this."],"metadata":{}},{"cell_type":"code","source":["df2.count()"],"metadata":{},"outputs":[],"execution_count":27},{"cell_type":"markdown","source":["However after we've now counted the data. We'll see that the explain ends up being quite different."],"metadata":{}},{"cell_type":"code","source":["df2.count()"],"metadata":{},"outputs":[],"execution_count":29},{"cell_type":"markdown","source":["In the above example, we can see that this cuts down on the time needed to generate this data immensely - often by at least an order of magnitude. With much larger and more complex data analysis, the gains that we get from caching can be even greater!"],"metadata":{}},{"cell_type":"markdown","source":["## Conclusion\n\nIn this notebook we've covered a ton of material! But you're now well on your way to understanding Spark and Databricks! Now that you've completed this notebook, you should hopefully be more familiar with the core concepts of Spark on Databricks. Be sure to subscribe to our blog to get the latest updates about Apache Spark 2.0 and the next notebooks in this series!"],"metadata":{}}],"metadata":{"name":"A Gentle Introduction to Apache Spark on Databricks","notebookId":713593585589395},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------
/Lecture2s.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MiyainNYC/Distributed-Machine-Learning/15954ae4fc8149fdcc0b9e1007794b0ed3af546a/Lecture2s.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Science and Engineering with Apache Spark
 2 | 
 3 | This deposit is about all assignments I've finished for online courses about how to perform data science and data engineering at scale using Spark.
 4 | Every assignment requires python coding fluency and deep understanding for Spark framework and scalable algorithml. Besides, Familiarity with basic machine learning concepts and exposure to algorithms, probability, linear algebra and calculus is a must to complete all those assignments.
 5 | 
 6 | ## This series contains three courses in total:
 7 | * **Introduction to Apache Spark: fundamentals and architecture of Apache Spark**
 8 | 
 9 | lab one:
10 | 
11 | > Part 1: Basic notebook usage and Python integration
12 | 
13 | > Part 2: An introduction to using Apache Spark with the PySpark SQL API running in a notebook
14 | 
15 | > Part 3: Using DataFrames and chaining together transformations and actions
16 | 
17 | > Part 4: Python Lambda functions and User Defined Functions
18 | 
19 | > Part 5: Additional DataFrame actions
20 | 
21 | > Part 6: Additional DataFrame transformations
22 | 
23 | > Part 7: Caching DataFrames and storage options
24 | 
25 | > Part 8: Debugging Spark applications and lazy evaluation
26 | 
27 | lab two:
28 | > Part 1: Introduction and Imports
29 | 
30 | > Part 2: Exploratory Data Analysis
31 | 
32 | > Part 3: Analysis Walk-Through on the Web Server Log File
33 | 
34 | > Part 4: Analyzing Web Server Log File
35 | 
36 | > Part 5: Exploring 404 Response Codes
37 | 
38 | * **Distributed Machine Learning with Apache Spark**
39 | 
40 | lab one:
41 | 
42 | > Basic Machine Learningn concepts
43 | 
44 | > supervised learning pipelines
45 |  
46 | > linear algebra 
47 | 
48 | > computational complexity/big O notation 
49 | 
50 | > RDD data structure
51 | 
52 | lab two;
53 | 
54 | > Linear regression formulation and closed-form solution
55 | 
56 | > Distributed machine learning principles (related to computation, storage, and communication)
57 | 
58 | > Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. Implement a gradient descent solver for linear regression, use Spark's machine Learning library (mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition. Finally, write a concise version of this pipeline using Spark's pipeline API.
59 | 
60 | lab three:
61 | 
62 | > Online advertising, linear classification, logistic regression, working with probabilistic predictions, categorical data and one-hot-encoding, feature hashing for dimensionality reduction.
63 | 
64 | > Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. Extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot.
65 | 
66 | lab four:
67 | 
68 | > Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.
69 | 
70 | > Neuroimaging Analysis via PCA - Identify patterns of brain activity in larval zebrafish. Work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish's neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, Use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.
71 | 


--------------------------------------------------------------------------------
/cs105_lab1a_spark_tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>."],"metadata":{}},{"cell_type":"markdown","source":["#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n# **Spark Tutorial: Learning Apache Spark**\n\nThis tutorial will teach you how to use [Apache Spark](http://spark.apache.org/), a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer.  However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer (such as the datasets in this tutorial), the dataset can often be processed much more quickly using multiple computers.\n\nSpark has efficient implementations of a number of transformations and actions that can be composed together to perform data processing and analysis.  Spark excels at distributing these operations across a cluster while abstracting away many of the underlying implementation details.  Spark has been designed with a focus on scalability and efficiency.  With Spark you can begin developing your solution on your laptop, using a small dataset, and then use that same code to process terabytes or even petabytes across a distributed cluster.\n\n**During this tutorial we will cover:**\n\n* *Part 1:* Basic notebook usage and [Python](https://docs.python.org/2/) integration\n* *Part 2:* An introduction to using [Apache Spark](https://spark.apache.org/) with the [PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module) running in a notebook\n* *Part 3:* Using DataFrames and chaining together transformations and actions\n* *Part 4*: Python Lambda functions and User Defined Functions\n* *Part 5:* Additional DataFrame actions\n* *Part 6:* Additional DataFrame transformations\n* *Part 7:* Caching DataFrames and storage options\n* *Part 8:* Debugging Spark applications and lazy evaluation\n\nThe following transformations will be covered:\n* `select()`, `filter()`, `distinct()`, `dropDuplicates()`, `orderBy()`, `groupBy()`\n\nThe following actions will be covered:\n* `first()`, `take()`, `count()`, `collect()`, `show()`\n\nAlso covered:\n* `cache()`, `unpersist()`\n\nNote that, for reference, you can look up the details of these methods in the [Spark's PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module)"],"metadata":{}},{"cell_type":"markdown","source":["## **Part 1: Basic notebook usage and [Python](https://docs.python.org/2/) integration **"],"metadata":{}},{"cell_type":"markdown","source":["### (1a) Notebook usage\n\nA notebook is comprised of a linear sequence of cells.  These cells can contain either markdown or code, but we won't mix both in one cell.  When a markdown cell is executed it renders formatted text, images, and links just like HTML in a normal webpage.  The text you are reading right now is part of a markdown cell.  Python code cells allow you to execute arbitrary Python commands just like in any Python shell. Place your cursor inside the cell below, and press \"Shift\" + \"Enter\" to execute the code and advance to the next cell.  You can also press \"Ctrl\" + \"Enter\" to execute the code and remain in the cell.  These commands work the same in both markdown and code cells."],"metadata":{}},{"cell_type":"code","source":["# This is a Python cell. You can run normal Python code here...\nprint 'The sum of 1 and 1 is {0}'.format(1+1)"],"metadata":{},"outputs":[],"execution_count":5},{"cell_type":"code","source":["# Here is another Python cell, this time with a variable (x) declaration and an if statement:\nx = 42\nif x > 40:\n    print 'The sum of 1 and 2 is {0}'.format(1+2)"],"metadata":{},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### (1b) Notebook state\n\nAs you work through a notebook it is important that you run all of the code cells.  The notebook is stateful, which means that variables and their values are retained until the notebook is detached (in Databricks) or the kernel is restarted (in Jupyter notebooks).  If you do not run all of the code cells as you proceed through the notebook, your variables will not be properly initialized and later code might fail.  You will also need to rerun any cells that you have modified in order for the changes to be available to other cells."],"metadata":{}},{"cell_type":"code","source":["# This cell relies on x being defined already.\n# If we didn't run the cells from part (1a) this code would fail.\nprint x * 2"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### (1c) Library imports\n\nWe can import standard Python libraries ([modules](https://docs.python.org/2/tutorial/modules.html)) the usual way.  An `import` statement will import the specified module.  In this tutorial and future labs, we will provide any imports that are necessary."],"metadata":{}},{"cell_type":"code","source":["# Import the regular expression library\nimport re\nm = re.search('(?<=abc)def', 'abcdef')\nm.group(0)"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"code","source":["# Import the datetime library\nimport datetime\nprint 'This was last run on: {0}'.format(datetime.datetime.now())"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"markdown","source":["##  **Part 2: An introduction to using [Apache Spark](https://spark.apache.org/) with the [PySpark SQL API](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module) running in a notebook**"],"metadata":{}},{"cell_type":"markdown","source":["### Spark Context\n\nIn Spark, communication occurs between a driver and executors.  The driver has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the executors for completion.  The results from these tasks are delivered back to the driver.\n\nIn part 1, we saw that normal Python code can be executed via cells. When using Databricks this code gets executed in the Spark driver's Java Virtual Machine (JVM) and not in an executor's JVM, and when using an Jupyter notebook it is executed within the kernel associated with the notebook. Since no Spark functionality is actually being used, no tasks are launched on the executors.\n\nIn order to use Spark and its DataFrame API we will need to use a `SQLContext`.  When running Spark, you start a new Spark application by creating a [SparkContext](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext). You can then create a [SQLContext](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext) from the `SparkContext`. When the `SparkContext` is created, it asks the master for some cores to use to do work.  The master sets these cores aside just for you; they won't be used for other applications. When using Databricks, both a `SparkContext` and a `SQLContext` are created for you automatically. `sc` is your `SparkContext`, and `sqlContext` is your `SQLContext`."],"metadata":{}},{"cell_type":"markdown","source":["### (2a) Example Cluster\nThe diagram shows an example cluster, where the slots allocated for an application are outlined in purple. (Note: We're using the term _slots_ here to indicate threads available to perform parallel work for Spark.\nSpark documentation often refers to these threads as _cores_, which is a confusing term, as the number of slots available on a particular machine does not necessarily have any relationship to the number of physical CPU\ncores on that machine.)\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs105x/diagram-2a.png\" style=\"height: 800px;float: right\"/>\n\nYou can view the details of your Spark application in the Spark web UI.  The web UI is accessible in Databricks by going to \"Clusters\" and then clicking on the \"Spark UI\" link for your cluster.  In the web UI, under the \"Jobs\" tab, you can see a list of jobs that have been scheduled or run.  It's likely there isn't any thing interesting here yet because we haven't run any jobs, but we'll return to this page later.\n\nAt a high level, every Spark application consists of a driver program that launches various parallel operations on executor Java Virtual Machines (JVMs) running either in a cluster or locally on the same machine. In Databricks, \"Databricks Shell\" is the driver program.  When running locally, `pyspark` is the driver program. In all cases, this driver program contains the main loop for the program and creates distributed datasets on the cluster, then applies operations (transformations & actions) to those datasets.\nDriver programs access Spark through a SparkContext object, which represents a connection to a computing cluster. A Spark SQL context object (`sqlContext`) is the main entry point for Spark DataFrame and SQL functionality. A `SQLContext` can be used to create DataFrames, which allows you to direct the operations on your data.\n\nTry printing out `sqlContext` to see its type."],"metadata":{}},{"cell_type":"code","source":["# Display the type of the Spark sqlContext\ntype(sqlContext)"],"metadata":{},"outputs":[],"execution_count":15},{"cell_type":"markdown","source":["Note that the type is `HiveContext`. This means we're working with a version of Spark that has Hive support. Compiling Spark with Hive support is a good idea, even if you don't have a Hive metastore. As the\n[Spark Programming Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext) states, a `HiveContext` \"provides a superset of the functionality provided by the basic `SQLContext`. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs [user-defined functions], and the ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an existing Hive setup, and all of the data sources available to a `SQLContext` are still available.\""],"metadata":{}},{"cell_type":"markdown","source":["### (2b) SparkContext attributes\n\nYou can use Python's [dir()](https://docs.python.org/2/library/functions.html?highlight=dir#dir) function to get a list of all the attributes (including methods) accessible through the `sqlContext` object."],"metadata":{}},{"cell_type":"code","source":["# List sqlContext's attributes\ndir(sqlContext)"],"metadata":{},"outputs":[],"execution_count":18},{"cell_type":"markdown","source":["### (2c) Getting help\n\nAlternatively, you can use Python's [help()](https://docs.python.org/2/library/functions.html?highlight=help#help) function to get an easier to read list of all the attributes, including examples, that the `sqlContext` object has."],"metadata":{}},{"cell_type":"code","source":["# Use help to obtain more detailed information\nhelp(sqlContext)"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"markdown","source":["Outside of `pyspark` or a notebook, `SQLContext` is created from the lower-level `SparkContext`, which is usually used to create Resilient Distributed Datasets (RDDs). An RDD is the way Spark actually represents data internally; DataFrames are actually implemented in terms of RDDs.\n\nWhile you can interact directly with RDDs, DataFrames are preferred. They're generally faster, and they perform the same no matter what language (Python, R, Scala or Java) you use with Spark.\n\nIn this course, we'll be using DataFrames, so we won't be interacting directly with the Spark Context object very much. However, it's worth knowing that inside `pyspark` or a notebook, you already have an existing `SparkContext` in the `sc` variable. One simple thing we can do with `sc` is check the version of Spark we're using:"],"metadata":{}},{"cell_type":"code","source":["# After reading the help we've decided we want to use sc.version to see what version of Spark we are running\nsc.version"],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"code","source":["# Help can be used on any Python object\nhelp(map)"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"markdown","source":["## **Part 3: Using DataFrames and chaining together transformations and actions**"],"metadata":{}},{"cell_type":"markdown","source":["### Working with your first DataFrames\n\nIn Spark, we first create a base [DataFrame](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame). We can then apply one or more transformations to that base DataFrame. *A DataFrame is immutable, so once it is created, it cannot be changed.* As a result, each transformation creates a new DataFrame. Finally, we can apply one or more actions to the DataFrames.\n\n> Note that Spark uses lazy evaluation, so transformations are not actually executed until an action occurs.\n\nWe will perform several exercises to obtain a better understanding of DataFrames:\n* Create a Python collection of 10,000 integers\n* Create a Spark DataFrame from that collection\n* Subtract one from each value using `map`\n* Perform action `collect` to view results\n* Perform action `count` to view counts\n* Apply transformation `filter` and view results with `collect`\n* Learn about lambda functions\n* Explore how lazy evaluation works and the debugging challenges that it introduces\n\nA DataFrame consists of a series of `Row` objects; each `Row` object has a set of named columns. You can think of a DataFrame as modeling a table, though the data source being processed does not have to be a table.\n\nMore formally, a DataFrame must have a _schema_, which means it must consist of columns, each of which has a _name_ and a _type_. Some data sources have schemas built into them. Examples include RDBMS databases, Parquet files, and NoSQL databases like Cassandra. Other data sources don't have computer-readable schemas, but you can often apply a schema programmatically."],"metadata":{}},{"cell_type":"markdown","source":["### (3a) Create a Python collection of 10,000 people\n\nWe will use a third-party Python testing library called [fake-factory](https://pypi.python.org/pypi/fake-factory/0.5.3) to create a collection of fake person records."],"metadata":{}},{"cell_type":"code","source":["from faker import Factory\nfake = Factory.create()\nfake.seed(4321)"],"metadata":{},"outputs":[],"execution_count":27},{"cell_type":"markdown","source":["We're going to use this factory to create a collection of randomly generated people records. In the next section, we'll turn that collection into a DataFrame. We'll use the Spark `Row` class,\nbecause that will help us define the Spark DataFrame schema. There are other ways to define schemas, though; see\nthe Spark Programming Guide's discussion of [schema inference](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection) for more information. (For instance,\nwe could also use a Python `namedtuple`.)"],"metadata":{}},{"cell_type":"code","source":["# Each entry consists of last_name, first_name, ssn, job, and age (at least 1)\nfrom pyspark.sql import Row\ndef fake_entry():\n  name = fake.name().split()\n  return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)"],"metadata":{},"outputs":[],"execution_count":29},{"cell_type":"code","source":["# Create a helper function to call a function repeatedly\ndef repeat(times, func, *args, **kwargs):\n    for _ in xrange(times):\n        yield func(*args, **kwargs)"],"metadata":{},"outputs":[],"execution_count":30},{"cell_type":"code","source":["data = list(repeat(10000, fake_entry))"],"metadata":{},"outputs":[],"execution_count":31},{"cell_type":"markdown","source":["`data` is just a normal Python list, containing Python tuples objects. Let's look at the first item in the list:"],"metadata":{}},{"cell_type":"code","source":["data[0]"],"metadata":{},"outputs":[],"execution_count":33},{"cell_type":"markdown","source":["We can check the size of the list using the Python `len()` function."],"metadata":{}},{"cell_type":"code","source":["len(data)"],"metadata":{},"outputs":[],"execution_count":35},{"cell_type":"markdown","source":["### (3b) Distributed data and using a collection to create a DataFrame\n\nIn Spark, datasets are represented as a list of entries, where the list is broken up into many different partitions that are each stored on a different machine.  Each partition holds a unique subset of the entries in the list.  Spark calls datasets that it stores \"Resilient Distributed Datasets\" (RDDs). Even DataFrames are ultimately represented as RDDs, with additional meta-data.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3b.png\" style=\"width: 900px; float: right; margin: 5px\"/>\n\nOne of the defining features of Spark, compared to other data analytics frameworks (e.g., Hadoop), is that it stores data in memory rather than on disk.  This allows Spark applications to run much more quickly, because they are not slowed down by needing to read data from disk.\nThe figure to the right illustrates how Spark breaks a list of data entries into partitions that are each stored in memory on a worker.\n\n\nTo create the DataFrame, we'll use `sqlContext.createDataFrame()`, and we'll pass our array of data in as an argument to that function. Spark will create a new set of input data based on data that is passed in.  A DataFrame requires a _schema_, which is a list of columns, where each column has a name and a type. Our list of data has elements with types (mostly strings, but one integer). We'll supply the rest of the schema and the column names as the second argument to `createDataFrame()`."],"metadata":{}},{"cell_type":"markdown","source":["Let's view the help for `createDataFrame()`."],"metadata":{}},{"cell_type":"code","source":["help(sqlContext.createDataFrame)"],"metadata":{},"outputs":[],"execution_count":38},{"cell_type":"code","source":["dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))"],"metadata":{},"outputs":[],"execution_count":39},{"cell_type":"markdown","source":["Let's see what type `sqlContext.createDataFrame()` returned."],"metadata":{}},{"cell_type":"code","source":["print 'type of dataDF: {0}'.format(type(dataDF))"],"metadata":{},"outputs":[],"execution_count":41},{"cell_type":"markdown","source":["Let's take a look at the DataFrame's schema and some of its rows."],"metadata":{}},{"cell_type":"code","source":["dataDF.printSchema()"],"metadata":{},"outputs":[],"execution_count":43},{"cell_type":"markdown","source":["We can register the newly created DataFrame as a named table, using the `registerDataFrameAsTable()` method."],"metadata":{}},{"cell_type":"code","source":["sqlContext.registerDataFrameAsTable(dataDF, 'dataframe')"],"metadata":{},"outputs":[],"execution_count":45},{"cell_type":"markdown","source":["What methods can we call on this DataFrame?"],"metadata":{}},{"cell_type":"code","source":["help(dataDF)"],"metadata":{},"outputs":[],"execution_count":47},{"cell_type":"markdown","source":["How many partitions will the DataFrame be split into?"],"metadata":{}},{"cell_type":"code","source":["dataDF.rdd.getNumPartitions()"],"metadata":{},"outputs":[],"execution_count":49},{"cell_type":"markdown","source":["###### A note about DataFrames and queries\n\nWhen you use DataFrames or Spark SQL, you are building up a _query plan_. Each transformation you apply to a DataFrame adds some information to the query plan. When you finally call an action, which triggers execution of your Spark job, several things happen:\n\n1. Spark's Catalyst optimizer analyzes the query plan (called an _unoptimized logical query plan_) and attempts to optimize it. Optimizations include (but aren't limited to) rearranging and combining `filter()` operations for efficiency, converting `Decimal` operations to more efficient long integer operations, and pushing some operations down into the data source (e.g., a `filter()` operation might be translated to a SQL `WHERE` clause, if the data source is a traditional SQL RDBMS). The result of this optimization phase is an _optimized logical plan_.\n2. Once Catalyst has an optimized logical plan, it then constructs multiple _physical_ plans from it. Specifically, it implements the query in terms of lower level Spark RDD operations.\n3. Catalyst chooses which physical plan to use via _cost optimization_. That is, it determines which physical plan is the most efficient (or least expensive), and uses that one.\n4. Finally, once the physical RDD execution plan is established, Spark actually executes the job.\n\nYou can examine the query plan using the `explain()` function on a DataFrame. By default, `explain()` only shows you the final physical plan; however, if you pass it an argument of `True`, it will show you all phases.\n\n(If you want to take a deeper dive into how Catalyst optimizes DataFrame queries, this blog post, while a little old, is an excellent overview: [Deep Dive into Spark SQL's Catalyst Optimizer](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html).)\n\nLet's add a couple transformations to our DataFrame and look at the query plan on the resulting transformed DataFrame. Don't be too concerned if it looks like gibberish. As you gain more experience with Apache Spark, you'll begin to be able to use `explain()` to help you understand more about your DataFrame operations."],"metadata":{}},{"cell_type":"code","source":["newDF = dataDF.distinct().select('*')\nnewDF.explain(True)"],"metadata":{},"outputs":[],"execution_count":51},{"cell_type":"markdown","source":["### (3c): Subtract one from each value using _select_\n\nSo far, we've created a distributed DataFrame that is split into many partitions, where each partition is stored on a single machine in our cluster.  Let's look at what happens when we do a basic operation on the dataset.  Many useful data analysis operations can be specified as \"do something to each item in the dataset\".  These data-parallel operations are convenient because each item in the dataset can be processed individually: the operation on one entry doesn't effect the operations on any of the other entries.  Therefore, Spark can parallelize the operation.\n\nOne of the most common DataFrame operations is `select()`, and it works more or less like a SQL `SELECT` statement: You can select specific columns from the DataFrame, and you can even use `select()` to create _new_ columns with values that are derived from existing column values. We can use `select()` to create a new column that decrements the value of the existing `age` column.\n\n`select()` is a _transformation_. It returns a new DataFrame that captures both the previous DataFrame and the operation to add to the query (`select`, in this case). But it does *not* actually execute anything on the cluster. When transforming DataFrames, we are building up a _query plan_. That query plan will be optimized, implemented (in terms of RDDs), and executed by Spark _only_ when we call an action."],"metadata":{}},{"cell_type":"code","source":["# Transform dataDF through a select transformation and rename the newly created '(age -1)' column to 'age'\n# Because select is a transformation and Spark uses lazy evaluation, no jobs, stages,\n# or tasks will be launched when we run this code.\nsubDF = dataDF.select('last_name', 'first_name', 'ssn', 'occupation', (dataDF.age - 1).alias('age'))"],"metadata":{},"outputs":[],"execution_count":53},{"cell_type":"markdown","source":["Let's take a look at the query plan."],"metadata":{}},{"cell_type":"code","source":["subDF.explain(True)"],"metadata":{},"outputs":[],"execution_count":55},{"cell_type":"markdown","source":["### (3d) Use _collect_ to view results\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3d.png\" style=\"height:700px;float:right\"/>\n\nTo see a list of elements decremented by one, we need to create a new list on the driver from the the data distributed in the executor nodes.  To do this we can call the `collect()` method on our DataFrame.  `collect()` is often used after transformations to ensure that we are only returning a *small* amount of data to the driver.  This is done because the data returned to the driver must fit into the driver's available memory.  If not, the driver will crash.\n\nThe `collect()` method is the first action operation that we have encountered.  Action operations cause Spark to perform the (lazy) transformation operations that are required to compute the values returned by the action.  In our example, this means that tasks will now be launched to perform the `createDataFrame`, `select`, and `collect` operations.\n\nIn the diagram, the dataset is broken into four partitions, so four `collect()` tasks are launched. Each task collects the entries in its partition and sends the result to the driver, which creates a list of the values, as shown in the figure below.\n\nNow let's run `collect()` on `subDF`."],"metadata":{}},{"cell_type":"code","source":["# Let's collect the data\nresults = subDF.collect()\nprint results"],"metadata":{},"outputs":[],"execution_count":57},{"cell_type":"markdown","source":["A better way to visualize the data is to use the `show()` method. If you don't tell `show()` how many rows to display, it displays 20 rows."],"metadata":{}},{"cell_type":"code","source":["subDF.show()"],"metadata":{},"outputs":[],"execution_count":59},{"cell_type":"markdown","source":["If you'd prefer that `show()` not truncate the data, you can tell it not to:"],"metadata":{}},{"cell_type":"code","source":["subDF.show(n=30, truncate=False)"],"metadata":{},"outputs":[],"execution_count":61},{"cell_type":"markdown","source":["In Databricks, there's an even nicer way to look at the values in a DataFrame: The `display()` helper function."],"metadata":{}},{"cell_type":"code","source":["display(subDF)"],"metadata":{},"outputs":[],"execution_count":63},{"cell_type":"code","source":["help(display)"],"metadata":{},"outputs":[],"execution_count":64},{"cell_type":"markdown","source":["### (3e) Use _count_ to get total\n\nOne of the most basic jobs that we can run is the `count()` job which will count the number of elements in a DataFrame, using the `count()` action. Since `select()` creates a new DataFrame with the same number of elements as the starting DataFrame, we expect that applying `count()` to each DataFrame will return the same result.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3e.png\" style=\"height:700px;float:right\"/>\n\nNote that because `count()` is an action operation, if we had not already performed an action with `collect()`, then Spark would now perform the transformation operations when we executed `count()`.\n\nEach task counts the entries in its partition and sends the result to your SparkContext, which adds up all of the counts. The figure on the right shows what would happen if we ran `count()` on a small example dataset with just four partitions."],"metadata":{}},{"cell_type":"code","source":["print dataDF.count()\nprint subDF.count()"],"metadata":{},"outputs":[],"execution_count":66},{"cell_type":"markdown","source":["### (3f) Apply transformation _filter_ and view results with _collect_\n\nNext, we'll create a new DataFrame that only contains the people whose ages are less than 10. To do this, we'll use the `filter()` transformation. (You can also use `where()`, an alias for `filter()`, if you prefer something more SQL-like). The `filter()` method is a transformation operation that creates a new DataFrame from the input DataFrame, keeping only values that match the filter expression.\n\nThe figure shows how this might work on the small four-partition dataset.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs105x/diagram-3f.png\" style=\"height:700px;float:right\"/>\n\nTo view the filtered list of elements less than 10, we need to create a new list on the driver from the distributed data on the executor nodes.  We use the `collect()` method to return a list that contains all of the elements in this filtered DataFrame to the driver program."],"metadata":{}},{"cell_type":"code","source":["filteredDF = subDF.filter(subDF.age < 10)\nfilteredDF.show(truncate=False)\nfilteredDF.count()"],"metadata":{},"outputs":[],"execution_count":68},{"cell_type":"markdown","source":["(These are some _seriously_ precocious children...)"],"metadata":{}},{"cell_type":"markdown","source":["## Part 4: Python Lambda functions and User Defined Functions\n\nPython supports the use of small one-line anonymous functions that are not bound to a name at runtime.\n\n`lambda` functions, borrowed from LISP, can be used wherever function objects are required. They are syntactically restricted to a single expression. Remember that `lambda` functions are a matter of style and using them is never required - semantically, they are just syntactic sugar for a normal function definition. You can always define a separate normal function instead, but using a `lambda` function is an equivalent and more compact form of coding. Ideally you should consider using `lambda` functions where you want to encapsulate non-reusable code without littering your code with one-line functions.\n\nHere, instead of defining a separate function for the `filter()` transformation, we will use an inline `lambda()` function and we will register that lambda as a Spark _User Defined Function_ (UDF). A UDF is a special wrapper around a function, allowing the function to be used in a DataFrame query."],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import BooleanType\nless_ten = udf(lambda s: s < 10, BooleanType())\nlambdaDF = subDF.filter(less_ten(subDF.age))\nlambdaDF.show()\nlambdaDF.count()"],"metadata":{},"outputs":[],"execution_count":71},{"cell_type":"code","source":["# Let's collect the even values less than 10\neven = udf(lambda s: s % 2 == 0, BooleanType())\nevenDF = lambdaDF.filter(even(lambdaDF.age))\nevenDF.show()\nevenDF.count()"],"metadata":{},"outputs":[],"execution_count":72},{"cell_type":"markdown","source":["## Part 5: Additional DataFrame actions\n\nLet's investigate some additional actions:\n\n* [first()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.first)\n* [take()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.take)\n\nOne useful thing to do when we have a new dataset is to look at the first few entries to obtain a rough idea of what information is available.  In Spark, we can do that using actions like `first()`, `take()`, and `show()`. Note that for the `first()` and `take()` actions, the elements that are returned depend on how the DataFrame is *partitioned*.\n\nInstead of using the `collect()` action, we can use the `take(n)` action to return the first _n_ elements of the DataFrame. The `first()` action returns the first element of a DataFrame, and is equivalent to `take(1)[0]`."],"metadata":{}},{"cell_type":"code","source":["print \"first: {0}\\n\".format(filteredDF.first())\n\nprint \"Four of them: {0}\\n\".format(filteredDF.take(4))"],"metadata":{},"outputs":[],"execution_count":74},{"cell_type":"markdown","source":["This looks better:"],"metadata":{}},{"cell_type":"code","source":["display(filteredDF.take(4))"],"metadata":{},"outputs":[],"execution_count":76},{"cell_type":"markdown","source":["## Part 6: Additional DataFrame transformations"],"metadata":{}},{"cell_type":"markdown","source":["### (6a) _orderBy_\n\n[`orderBy()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct) allows you to sort a DataFrame by one or more columns, producing a new DataFrame.\n\nFor example, let's get the first five oldest people in the original (unfiltered) DataFrame. We can use the `orderBy()` transformation. `orderBy` takes one or more columns, either as _names_ (strings) or as `Column` objects. To get a `Column` object, we use one of two notations on the DataFrame:\n\n* Pandas-style notation: `filteredDF.age`\n* Subscript notation: `filteredDF['age']`\n\nBoth of those syntaxes return a `Column`, which has additional methods like `desc()` (for sorting in descending order) or `asc()` (for sorting in ascending order, which is the default).\n\nHere are some examples:\n\n```\ndataDF.orderBy(dataDF['age'])  # sort by age in ascending order; returns a new DataFrame\ndataDF.orderBy(dataDF.last_name.desc()) # sort by last name in descending order\n```"],"metadata":{}},{"cell_type":"code","source":["# Get the five oldest people in the list. To do that, sort by age in descending order.\ndisplay(dataDF.orderBy(dataDF.age.desc()).take(5))"],"metadata":{},"outputs":[],"execution_count":79},{"cell_type":"markdown","source":["Let's reverse the sort order. Since ascending sort is the default, we can actually use a `Column` object expression or a simple string, in this case. The `desc()` and `asc()` methods are only defined on `Column`. Something like `orderBy('age'.desc())` would not work, because there's no `desc()` method on Python string objects. That's why we needed the column expression. But if we're just using the defaults, we can pass a string column name into `orderBy()`. This is sometimes easier to read."],"metadata":{}},{"cell_type":"code","source":["display(dataDF.orderBy('age').take(5))"],"metadata":{},"outputs":[],"execution_count":81},{"cell_type":"markdown","source":["### (6b) _distinct_ and _dropDuplicates_\n\n[`distinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct) filters out duplicate rows, and it considers all columns. Since our data is completely randomly generated (by `fake-factory`), it's extremely unlikely that there are any duplicate rows:"],"metadata":{}},{"cell_type":"code","source":["print dataDF.count()\nprint dataDF.distinct().count()"],"metadata":{},"outputs":[],"execution_count":83},{"cell_type":"markdown","source":["To demonstrate `distinct()`, let's create a quick throwaway dataset."],"metadata":{}},{"cell_type":"code","source":["tempDF = sqlContext.createDataFrame([(\"Joe\", 1), (\"Joe\", 1), (\"Anna\", 15), (\"Anna\", 12), (\"Ravi\", 5)], ('name', 'score'))"],"metadata":{},"outputs":[],"execution_count":85},{"cell_type":"code","source":["tempDF.show()"],"metadata":{},"outputs":[],"execution_count":86},{"cell_type":"code","source":["tempDF.distinct().show()"],"metadata":{},"outputs":[],"execution_count":87},{"cell_type":"markdown","source":["Note that one of the (\"Joe\", 1) rows was deleted, but both rows with name \"Anna\" were kept, because all columns in a row must match another row for it to be considered a duplicate."],"metadata":{}},{"cell_type":"markdown","source":["[`dropDuplicates()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates) is like `distinct()`, except that it allows us to specify the columns to compare. For instance, we can use it to drop all rows where the first name and last name duplicates (ignoring the occupation and age columns)."],"metadata":{}},{"cell_type":"code","source":["print dataDF.count()\nprint dataDF.dropDuplicates(['first_name', 'last_name']).count()"],"metadata":{},"outputs":[],"execution_count":90},{"cell_type":"markdown","source":["### (6c) _drop_\n\n[`drop()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop) is like the opposite of `select()`: Instead of selecting specific columns from a DataFrame, it drops a specifed column from a DataFrame.\n\nHere's a simple use case: Suppose you're reading from a 1,000-column CSV file, and you have to get rid of five of the columns. Instead of selecting 995 of the columns, it's easier just to drop the five you don't want."],"metadata":{}},{"cell_type":"code","source":["dataDF.drop('occupation').drop('age').show()"],"metadata":{},"outputs":[],"execution_count":92},{"cell_type":"markdown","source":["### (6d) _groupBy_\n\n[`groupBy()`]((http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy) is one of the most powerful transformations. It allows you to perform aggregations on a DataFrame.\n\nUnlike other DataFrame transformations, `groupBy()` does _not_ return a DataFrame. Instead, it returns a special [GroupedData](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData) object that contains various aggregation functions.\n\nThe most commonly used aggregation function is [count()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.count),\nbut there are others (like [sum()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.sum), [max()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.max), and [avg()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.avg).\n\nThese aggregation functions typically create a new column and return a new DataFrame."],"metadata":{}},{"cell_type":"code","source":["dataDF.groupBy('occupation').count().show(truncate=False)"],"metadata":{},"outputs":[],"execution_count":94},{"cell_type":"code","source":["dataDF.groupBy().avg('age').show(truncate=False)"],"metadata":{},"outputs":[],"execution_count":95},{"cell_type":"markdown","source":["We can also use `groupBy()` to do aother useful aggregations:"],"metadata":{}},{"cell_type":"code","source":["print \"Maximum age: {0}\".format(dataDF.groupBy().max('age').first()[0])\nprint \"Minimum age: {0}\".format(dataDF.groupBy().min('age').first()[0])"],"metadata":{},"outputs":[],"execution_count":97},{"cell_type":"markdown","source":["### (6e) _sample_ (optional)\n\nWhen analyzing data, the [`sample()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample) transformation is often quite useful. It returns a new DataFrame with a random sample of elements from the dataset.  It takes in a `withReplacement` argument, which specifies whether it is okay to randomly pick the same item multiple times from the parent DataFrame (so when `withReplacement=True`, you can get the same item back multiple times). It takes in a `fraction` parameter, which specifies the fraction elements in the dataset you want to return. (So a `fraction` value of `0.20` returns 20% of the elements in the DataFrame.) It also takes an optional `seed` parameter that allows you to specify a seed value for the random number generator, so that reproducible results can be obtained."],"metadata":{}},{"cell_type":"code","source":["sampledDF = dataDF.sample(withReplacement=False, fraction=0.10)\nprint sampledDF.count()\nsampledDF.show()"],"metadata":{},"outputs":[],"execution_count":99},{"cell_type":"code","source":["print dataDF.sample(withReplacement=False, fraction=0.05).count()"],"metadata":{},"outputs":[],"execution_count":100},{"cell_type":"markdown","source":["## Part 7: Caching DataFrames and storage options"],"metadata":{}},{"cell_type":"markdown","source":["### (7a) Caching DataFrames\n\nFor efficiency Spark keeps your DataFrames in memory. (More formally, it keeps the _RDDs_ that implement your DataFrames in memory.) By keeping the contents in memory, Spark can quickly access the data. However, memory is limited, so if you try to keep too many partitions in memory, Spark will automatically delete partitions from memory to make space for new ones. If you later refer to one of the deleted partitions, Spark will automatically recreate it for you, but that takes time.\n\nSo, if you plan to use a DataFrame more than once, then you should tell Spark to cache it. You can use the `cache()` operation to keep the DataFrame in memory. However, you must still trigger an action on the DataFrame, such as `collect()` or `count()` before the caching will occur. In other words, `cache()` is lazy: It merely tells Spark that the DataFrame should be cached _when the data is materialized_. You have to run an action to materialize the data; the DataFrame will be cached as a side effect. The next time you use the DataFrame, Spark will use the cached data, rather than recomputing the DataFrame from the original data.\n\nYou can see your cached DataFrame in the \"Storage\" section of the Spark web UI. If you click on the name value, you can see more information about where the the DataFrame is stored."],"metadata":{}},{"cell_type":"code","source":["# Cache the DataFrame\nfilteredDF.cache()\n# Trigger an action\nprint filteredDF.count()\n# Check if it is cached\nprint filteredDF.is_cached"],"metadata":{},"outputs":[],"execution_count":103},{"cell_type":"markdown","source":["### (7b) Unpersist and storage options\n\nSpark automatically manages the partitions cached in memory. If it has more partitions than available memory, by default, it will evict older partitions to make room for new ones. For efficiency, once you are finished using cached DataFrame, you can optionally tell Spark to stop caching it in memory by using the DataFrame's `unpersist()` method to inform Spark that you no longer need the cached data.\n\n** Advanced: ** Spark provides many more options for managing how DataFrames cached. For instance, you can tell Spark to spill cached partitions to disk when it runs out of memory, instead of simply throwing old ones away. You can explore the API for DataFrame's [persist()](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.persist) operation using Python's [help()](https://docs.python.org/2/library/functions.html?highlight=help#help) command.  The `persist()` operation, optionally, takes a pySpark [StorageLevel](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.StorageLevel) object."],"metadata":{}},{"cell_type":"code","source":["# If we are done with the DataFrame we can unpersist it so that its memory can be reclaimed\nfilteredDF.unpersist()\n# Check if it is cached\nprint filteredDF.is_cached"],"metadata":{},"outputs":[],"execution_count":105},{"cell_type":"markdown","source":["## ** Part 8: Debugging Spark applications and lazy evaluation **"],"metadata":{}},{"cell_type":"markdown","source":["### How Python is Executed in Spark\n\nInternally, Spark executes using a Java Virtual Machine (JVM). pySpark runs Python code in a JVM using [Py4J](http://py4j.sourceforge.net). Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects.\n\nBecause pySpark uses Py4J, coding errors often result in a complicated, confusing stack trace that can be difficult to understand. In the following section, we'll explore how to understand stack traces."],"metadata":{}},{"cell_type":"markdown","source":["### (8a) Challenges with lazy evaluation using transformations and actions\n\nSpark's use of lazy evaluation can make debugging more difficult because code is not always executed immediately. To see an example of how this can happen, let's first define a broken filter function.\nNext we perform a `filter()` operation using the broken filtering function.  No error will occur at this point due to Spark's use of lazy evaluation.\n\nThe `filter()` method will not be executed *until* an action operation is invoked on the DataFrame.  We will perform an action by using the `count()` method to return a list that contains all of the elements in this DataFrame."],"metadata":{}},{"cell_type":"code","source":["def brokenTen(value):\n    \"\"\"Incorrect implementation of the ten function.\n\n    Note:\n        The `if` statement checks an undefined variable `val` instead of `value`.\n\n    Args:\n        value (int): A number.\n\n    Returns:\n        bool: Whether `value` is less than ten.\n\n    Raises:\n        NameError: The function references `val`, which is not available in the local or global\n            namespace, so a `NameError` is raised.\n    \"\"\"\n    if (val < 10):\n        True\n    else:\n        False\n\nbtUDF = udf(brokenTen)\nbrokenDF = subDF.filter(btUDF(subDF.age)==True)"],"metadata":{},"outputs":[],"execution_count":109},{"cell_type":"code","source":["# Now we'll see the error\n# Click on the `+` button to expand the error and scroll through the message.\nbrokenDF.count()"],"metadata":{},"outputs":[],"execution_count":110},{"cell_type":"markdown","source":["### (8b) Finding the bug\n\nWhen the `filter()` method is executed, Spark calls the UDF. Since our UDF has an error in the underlying filtering function `brokenTen()`, an error occurs.\n\nScroll through the output \"Py4JJavaError     Traceback (most recent call last)\" part of the cell and first you will see that the line that generated the error is the `count()` method line. There is *nothing wrong with this line*. However, it is an action and that caused other methods to be executed. Continue scrolling through the Traceback and you will see the following error line:\n\n`NameError: global name 'val' is not defined`\n\nLooking at this error line, we can see that we used the wrong variable name in our filtering function `brokenTen()`."],"metadata":{}},{"cell_type":"markdown","source":["### (8c) Moving toward expert style\n\nAs you are learning Spark, I recommend that you write your code in the form:\n```\n    df2 = df1.transformation1()\n    df2.action1()\n    df3 = df2.transformation2()\n    df3.action2()\n```\nUsing this style will make debugging your code much easier as it makes errors easier to localize - errors in your transformations will occur when the next action is executed.\n\nOnce you become more experienced with Spark, you can write your code with the form: `df.transformation1().transformation2().action()`\n\nWe can also use `lambda()` functions instead of separately defined functions when their use improves readability and conciseness."],"metadata":{}},{"cell_type":"code","source":["# Cleaner code through lambda use\nmyUDF = udf(lambda v: v < 10)\nsubDF.filter(myUDF(subDF.age) == True)"],"metadata":{},"outputs":[],"execution_count":113},{"cell_type":"markdown","source":["### (8d) Readability and code style\n\nTo make the expert coding style more readable, enclose the statement in parentheses and put each method, transformation, or action on a separate line."],"metadata":{}},{"cell_type":"code","source":["# Final version\nfrom pyspark.sql.functions import *\n(dataDF\n .filter(dataDF.age > 20)\n .select(concat(dataDF.first_name, lit(' '), dataDF.last_name), dataDF.occupation)\n .show(truncate=False)\n )"],"metadata":{},"outputs":[],"execution_count":115},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":116}],"metadata":{"name":"cs105_lab1a_spark_tutorial","notebookId":2123285665303917},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------
/cs105_lab1b_word_count.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>."],"metadata":{}},{"cell_type":"markdown","source":["#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n# **Word Count Lab: Building a word count application**\n\nThis lab will build on the techniques covered in the Spark tutorial to develop a simple word count application.  The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data.  In this lab, we will write code that calculates the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page).  This could also be scaled to larger applications, such as finding the most common words in Wikipedia.\n\n** During this lab we will cover: **\n* *Part 1:* Creating a base DataFrame and performing operations\n* *Part 2:* Counting with Spark SQL and DataFrames\n* *Part 3:* Finding unique words and a mean value\n* *Part 4:* Apply word count to a file\n\nNote that for reference, you can look up the details of the relevant methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.sql)."],"metadata":{}},{"cell_type":"code","source":["labVersion = 'cs105x-word-count-df-0.1.0'"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["#### ** Part 1: Creating a base DataFrame and performing operations **"],"metadata":{}},{"cell_type":"markdown","source":["In this part of the lab, we will explore creating a base DataFrame with `sqlContext.createDataFrame` and using DataFrame operations to count words."],"metadata":{}},{"cell_type":"markdown","source":["** (1a) Create a DataFrame **\n\nWe'll start by generating a base DataFrame by using a Python list of tuples and the `sqlContext.createDataFrame` method.  Then we'll print out the type and schema of the DataFrame.  The Python API has several examples for using the [`createDataFrame` method](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame)."],"metadata":{}},{"cell_type":"code","source":["wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])\nwordsDF.show()\nprint type(wordsDF)\nwordsDF.printSchema()"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"markdown","source":["** (1b) Using DataFrame functions to add an 's' **\n\nLet's create a new DataFrame from `wordsDF` by performing an operation that adds an 's' to each word.  To do this, we'll call the [`select` DataFrame function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select) and pass in a column that has the recipe for adding an 's' to our existing column.  To generate this `Column` object you should use the [`concat` function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.concat) found in the [`pyspark.sql.functions` module](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions).  Note that `concat` takes in two or more string columns and returns a single string column.  In order to pass in a constant or literal value like 's', you'll need to wrap that value with the [`lit` column function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lit).\n\nPlease replace `<FILL IN>` with your solution.  After you have created `pluralDF` you can run the next cell which contains two tests.  If you implementation is correct it will print `1 test passed` for each test.\n\nThis is the general form that exercises will take.  Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `<FILL IN>` sections.  The cell that needs to be modified will have `# TODO: Replace <FILL IN> with appropriate code` on its first line.  Once the `<FILL IN>` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution.  The last code cell before the next markdown section will contain the tests.\n\n> Note:\n> Make sure that the resulting DataFrame has one column which is named 'word'."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql.functions import lit, concat\n\npluralDF = wordsDF.select(concat(wordsDF['word'],lit('s')).alias('word'))\npluralDF.show()"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"code","source":["# Load in the testing code and check to see if your answer is correct\n# If incorrect it will report back '1 test failed' for each failed test\n# Make sure to rerun any cell you change before trying the test again\nfrom databricks_test_helper import Test\n# TEST Using DataFrame functions to add an 's' (1b)\nTest.assertEquals(pluralDF.first()[0], 'cats', 'incorrect result: you need to add an s')\nTest.assertEquals(pluralDF.columns, ['word'], \"there should be one column named 'word'\")"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["** (1c) Length of each word **\n\nNow use the SQL `length` function to find the number of characters in each word.  The [`length` function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.length) is found in the `pyspark.sql.functions` module."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql.functions import length\npluralLengthsDF = pluralDF.select(length(pluralDF['word']))\npluralLengthsDF.show()"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"code","source":["# TEST Length of each word (1e)\nfrom collections import Iterable\nasSelf = lambda v: map(lambda r: r[0] if isinstance(r, Iterable) and len(r) == 1 else r, v)\n\nTest.assertEquals(asSelf(pluralLengthsDF.collect()), [4, 9, 4, 4, 4],\n                  'incorrect values for pluralLengths')\n"],"metadata":{},"outputs":[],"execution_count":13},{"cell_type":"markdown","source":["#### ** Part 2: Counting with Spark SQL and DataFrames **"],"metadata":{}},{"cell_type":"markdown","source":["Now, let's count the number of times a particular word appears in the 'word' column. There are multiple ways to perform the counting, but some are much less efficient than others.\n\nA naive approach would be to call `collect` on all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations."],"metadata":{}},{"cell_type":"markdown","source":["** (2a) Using `groupBy` and `count` **\n\nUsing DataFrames, we can preform aggregations by grouping the data using the [`groupBy` function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy) on the DataFrame.  Using `groupBy` returns a [`GroupedData` object](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData) and we can use the functions available for `GroupedData` to aggregate the groups.  For example, we can call `avg` or `count` on a `GroupedData` object to obtain the average of the values in the groups or the number of occurrences in the groups, respectively.\n\nTo find the counts of words, group by the words and then use the [`count` function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.count) to find the number of times that words occur."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nwordCountsDF = (wordsDF\n                .groupBy('word')\n                .count())\nwordCountsDF.show()"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"code","source":["# TEST groupBy and count (2a)\nTest.assertEquals(wordCountsDF.collect(), [('cat', 2), ('rat', 2), ('elephant', 1)],\n                 'incorrect counts for wordCountsDF')"],"metadata":{},"outputs":[],"execution_count":18},{"cell_type":"markdown","source":["#### ** Part 3: Finding unique words and a mean value **"],"metadata":{}},{"cell_type":"markdown","source":["** (3a) Unique words **\n\nCalculate the number of unique words in `wordsDF`.  You can use other DataFrames that you have already created to make this easier."],"metadata":{}},{"cell_type":"code","source":["from spark_notebook_helpers import printDataFrames\n\n#This function returns all the DataFrames in the notebook and their corresponding column names.\nprintDataFrames(True)"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nuniqueWordsCount = wordCountsDF.count()\nprint uniqueWordsCount"],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"code","source":["# TEST Unique words (3a)\nTest.assertEquals(uniqueWordsCount, 3, 'incorrect count of unique words')"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"markdown","source":["** (3b) Means of groups using DataFrames **\n\nFind the mean number of occurrences of words in `wordCountsDF`.\n\nYou should use the [`mean` GroupedData method](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.mean) to accomplish this.  Note that when you use `groupBy` you don't need to pass in any columns.  A call without columns just prepares the DataFrame so that aggregation functions like `mean` can be applied."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\naverageCount = 1.666666\n\nprint averageCount"],"metadata":{},"outputs":[],"execution_count":25},{"cell_type":"code","source":["# TEST Means of groups using DataFrames (3b)\nTest.assertEquals(round(averageCount, 2), 1.67, 'incorrect value of averageCount')"],"metadata":{},"outputs":[],"execution_count":26},{"cell_type":"markdown","source":["#### ** Part 4: Apply word count to a file **"],"metadata":{}},{"cell_type":"markdown","source":["In this section we will finish developing our word count application.  We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data."],"metadata":{}},{"cell_type":"markdown","source":["** (4a) The `wordCount` function **\n\nFirst, define a function for word counting.  You should reuse the techniques that have been covered in earlier parts of this lab.  This function should take in a DataFrame that is a list of words like `wordsDF` and return a DataFrame that has all of the words and their associated counts."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ndef wordCount(wordListDF):\n    \"\"\"Creates a DataFrame with word counts.\n\n    Args:\n        wordListDF (DataFrame of str): A DataFrame consisting of one string column called 'word'.\n\n    Returns:\n        DataFrame of (str, int): A DataFrame containing 'word' and 'count' columns.\n    \"\"\"\n    return wordListDF.groupBy('word').count()\n\nwordCount(wordsDF).show()"],"metadata":{},"outputs":[],"execution_count":30},{"cell_type":"code","source":["# TEST wordCount function (4a)\nTest.assertEquals(sorted(wordCount(wordsDF).collect()),\n                  [('cat', 2), ('elephant', 1), ('rat', 2)],\n                  'incorrect definition for wordCountDF function')"],"metadata":{},"outputs":[],"execution_count":31},{"cell_type":"markdown","source":["** (4b) Capitalization and punctuation **\n\nReal world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:\n  + Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).\n  + All punctuation should be removed.\n  + Any leading or trailing spaces on a line should be removed.\n\nDefine the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces.  Use the Python [regexp_replace](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace) module to remove any text that is not a letter, number, or space. If you are unfamiliar with regular expressions, you may want to review [this tutorial](https://developers.google.com/edu/python/regular-expressions) from Google.  Also, [this website](https://regex101.com/#python) is  a great resource for debugging your regular expression.\n\nYou should also use the `trim` and `lower` functions found in [pyspark.sql.functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions).\n\n> Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task"],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql.functions import regexp_replace, trim, col, lower\ndef removePunctuation(column):\n    \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n\n    Note:\n        Only spaces, letters, and numbers should be retained.  Other characters should should be\n        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after\n        punctuation is removed.\n\n    Args:\n        column (Column): A Column containing a sentence.\n\n    Returns:\n        Column: A Column named 'sentence' with clean-up operations applied.\n    \"\"\"\n    return lower(trim(regexp_replace(column,r'[^\\w\\s]*',''))).alias('sentence')\n\nsentenceDF = sqlContext.createDataFrame([('Hi, you!',),\n                                         (' No under_score!',),\n                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])\nsentenceDF.show(truncate=False)\n(sentenceDF\n .select(removePunctuation(col('sentence')))\n .show(truncate=False))"],"metadata":{},"outputs":[],"execution_count":33},{"cell_type":"code","source":["# TEST Capitalization and punctuation (4b)\ntestPunctDF = sqlContext.createDataFrame([(\" The Elephant's 4 cats. \",)])\nTest.assertEquals(testPunctDF.select(removePunctuation(col('_1'))).first()[0],\n                  'the elephants 4 cats',\n                  'incorrect definition for removePunctuation function')"],"metadata":{},"outputs":[],"execution_count":34},{"cell_type":"markdown","source":["** (4c) Load a text file **\n\nFor the next part of this lab, we will use the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). To convert a text file into a DataFrame, we use the `sqlContext.read.text()` method. We also apply the recently defined `removePunctuation()` function using a `select()` transformation to strip out the punctuation and change all text to lower case.  Since the file is large we use `show(15)`, so that we only print 15 lines."],"metadata":{}},{"cell_type":"code","source":["fileName = \"dbfs:/databricks-datasets/cs100/lab1/data-001/shakespeare.txt\"\n\nshakespeareDF = sqlContext.read.text(fileName).select(removePunctuation(col('value')))\nshakespeareDF.show(15, truncate=False)"],"metadata":{},"outputs":[],"execution_count":36},{"cell_type":"markdown","source":["** (4d) Words from lines **\n\nBefore we can use the `wordcount()` function, we have to address two issues with the format of the DataFrame:\n  + The first issue is that  that we need to split each line by its spaces.\n  + The second issue is we need to filter out empty lines or words.\n\nApply a transformation that will split each 'sentence' in the DataFrame by its spaces, and then transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.  To accomplish these two tasks you can use the `split` and `explode` functions found in [pyspark.sql.functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions).\n\nOnce you have a DataFrame with one word per row you can apply the [DataFrame operation `where`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.where) to remove the rows that contain ''.\n\n> Note that `shakeWordsDF` should be a DataFrame with one column named `word`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql.functions import split, explode\nshakeWordsDF = (shakespeareDF\n                .select(explode(split(col('sentence'),' ')).alias('word')))\nshakeWordsDF = shakeWordsDF.where(shakeWordsDF.word!='')\nshakeWordsDF.show()\nshakeWordsDFCount = shakeWordsDF.count()\nprint shakeWordsDFCount"],"metadata":{},"outputs":[],"execution_count":38},{"cell_type":"code","source":["# TEST Remove empty elements (4d)\nTest.assertEquals(shakeWordsDF.count(), 882996, 'incorrect value for shakeWordCount')\nTest.assertEquals(shakeWordsDF.columns, ['word'], \"shakeWordsDF should only contain the Column 'word'\")"],"metadata":{},"outputs":[],"execution_count":39},{"cell_type":"markdown","source":["** (4e) Count the words **\n\nWe now have a DataFrame that is only words.  Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the first 20 words by using the `show()` action; however, we'd like to see the words in descending order of count, so we'll need to apply the [`orderBy` DataFrame method](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy) to first sort the DataFrame that is returned from `wordCount()`.\n\nYou'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql.functions import desc\ntopWordsAndCountsDF = wordCount(shakeWordsDF).orderBy('count',ascending = False)\ntopWordsAndCountsDF.show()"],"metadata":{},"outputs":[],"execution_count":41},{"cell_type":"code","source":["# TEST Count the words (4e)\nTest.assertEquals(topWordsAndCountsDF.take(15),\n                  [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n                   (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n                   (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n                  'incorrect value for top15WordsAndCountsDF')"],"metadata":{},"outputs":[],"execution_count":42},{"cell_type":"markdown","source":["#### ** Prepare to the course autograder **\nOnce you confirm that your lab notebook is passing all tests, you can submit it first to the course autograder and then second to the edX website to receive a grade.\n<img src=\"https://d37djvu3ytnwxt.cloudfront.net/asset-v1:BerkeleyX+CS105x+1T2016+type@asset+block/url-process.png\" alt=\"Drawing\" />\n\n** Note that you can only submit to the course autograder once every 1 minute. **"],"metadata":{}},{"cell_type":"markdown","source":["** (a) Restart your cluster by clicking on the dropdown next to your cluster name and selecting \"Restart Cluster\".**\n\nYou can do this step in either notebook, since there is one cluster for your notebooks.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_restart.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["** (b) _IN THIS NOTEBOOK_, click on \"Run All\" to run all of the cells. **\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_runall.png\" alt=\"Drawing\" />\n\nThis step will take some time. While the cluster is running all the cells in your lab notebook, you will see the \"Stop Execution\" button.\n\n <img src=\"http://spark-mooc.github.io/web-assets/images/stop_execution.png\" alt=\"Drawing\" />\n\nWait for your cluster to finish running the cells in your lab notebook before proceeding."],"metadata":{}},{"cell_type":"markdown","source":["** (c) Verify that your LAB notebook passes as many tests as you can. **\n\nMost computations should complete within a few seconds unless stated otherwise. As soon as the expression of a cell have been successfully evaluated, you will see one or more \"test passed\" messages if the cell includes test expressions:\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/test_passed.png\" alt=\"Drawing\" />\n\nor just execution time otherwise:\n  <img src=\"http://spark-mooc.github.io/web-assets/images/execution_time.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["** (d) Publish your LAB notebook(this notebook) by clicking on the \"Publish\" button at the top of your LAB notebook. **\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish0.png\" alt=\"Drawing\" />\n\nWhen you click on the button, you will see the following popup.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish1.png\" alt=\"Drawing\" />\n\nWhen you click on \"Publish\", you will see a popup with your notebook's public link. __Copy the link and set the notebook_URL variable in the AUTOGRADER notebook(not this notebook).__\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish2.png\" alt=\"Drawing\" />"],"metadata":{}}],"metadata":{"name":"cs105_lab1b_word_count","notebookId":731827462892792},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------
/cs110_lab1_power_plant_ml_pipeline.html:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MiyainNYC/Distributed-Machine-Learning/15954ae4fc8149fdcc0b9e1007794b0ed3af546a/cs110_lab1_power_plant_ml_pipeline.html


--------------------------------------------------------------------------------
/cs110_lab1_power_plant_ml_pipeline.ipynb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MiyainNYC/Distributed-Machine-Learning/15954ae4fc8149fdcc0b9e1007794b0ed3af546a/cs110_lab1_power_plant_ml_pipeline.ipynb


--------------------------------------------------------------------------------
/cs110_lab2_als_prediction.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"> <img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png\"/> </a> <br/> This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"> Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. </a>"],"metadata":{}},{"cell_type":"markdown","source":["#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/cs110x/movie-camera.png\" style=\"float:right; height: 200px; margin: 10px; border: 1px solid #ddd; border-radius: 15px 15px 15px 15px; padding: 10px\"/>\n\n# Predicting Movie Ratings\n\nOne of the most common uses of big data is to predict what users want.  This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like.  This lab will demonstrate how we can use Apache Spark to recommend movies to a user.  We will start with some basic techniques, and then use the [Spark ML][sparkml] library's Alternating Least Squares method to make more sophisticated predictions.\n\nFor this lab, we will use a subset dataset of 20 million ratings. This dataset is pre-mounted on Databricks and is from the [MovieLens stable benchmark rating dataset](http://grouplens.org/datasets/movielens/). However, the same code you write will also work on the full dataset (though running with the full dataset on Community Edition is likely to take quite a long time).\n\nIn this lab:\n* *Part 0*: Preliminaries\n* *Part 1*: Basic Recommendations\n* *Part 2*: Collaborative Filtering\n* *Part 3*: Predictions for Yourself\n\nAs mentioned during the first Learning Spark lab, think carefully before calling `collect()` on any datasets.  When you are using a small dataset, calling `collect()` and then using Python to get a sense for the data locally (in the driver program) will work fine, but this will not work when you are using a large dataset that doesn't fit in memory on one machine.  Solutions that call `collect()` and do local analysis that could have been done with Spark will likely fail in the autograder and not receive full credit.\n[sparkml]: https://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html"],"metadata":{}},{"cell_type":"code","source":["labVersion = 'cs110x.lab2-1.0.0'"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["## Code\n\nThis assignment can be completed using basic Python and pySpark DataFrame Transformations and Actions.  Libraries other than math are not necessary. With the exception of the ML functions that we introduce in this assignment, you should be able to complete all parts of this homework using only the Spark functions you have used in prior lab exercises (although you are welcome to use more features of Spark if you like!).\n\nWe'll be using motion picture data, the same data last year's CS100.1x used. However, in this course, we're using DataFrames, rather than RDDs.\n\nThe following cell defines the locations of the data files. If you want to run an exported version of this lab on your own machine (i.e., outside of Databricks), you'll need to download your own copy of the 20-million movie data set, and you'll need to adjust the paths, below.\n\n**To Do**: Run the following cell."],"metadata":{}},{"cell_type":"code","source":["import os\nfrom databricks_test_helper import Test\n\ndbfs_dir = '/databricks-datasets/cs110x/ml-20m/data-001'\nratings_filename = dbfs_dir + '/ratings.csv'\nmovies_filename = dbfs_dir + '/movies.csv'\n\n# The following line is here to enable this notebook to be exported as source and\n# run on a local machine with a local copy of the files. Just change the dbfs_dir,\n# above.\nif os.path.sep != '/':\n  # Handle Windows.\n  ratings_filename = ratings_filename.replace('/', os.path.sep)\n  movie_filename = movie_filename.replace('/', os.path.sep)"],"metadata":{},"outputs":[],"execution_count":5},{"cell_type":"markdown","source":["## Part 0: Preliminaries\n\nWe read in each of the files and create a DataFrame consisting of parsed lines.\n\n### The 20-million movie sample\n\nThe 20-million movie sample consists of CSV files (with headers), so there's no need to parse the files manually, as Spark CSV can do the job."],"metadata":{}},{"cell_type":"markdown","source":["First, let's take a look at the directory containing our files."],"metadata":{}},{"cell_type":"code","source":["display(dbutils.fs.ls(dbfs_dir)) "],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### CPU vs I/O tradeoff\n\nNote that we have both compressed files (ending in `.gz`) and uncompressed files. We have a CPU vs. I/O tradeoff here. If I/O is the bottleneck, then we want to process the compressed files and pay the extra CPU overhead. If CPU is the bottleneck, then it makes more sense to process the uncompressed files.\n\nWe've done some experiments, and we've determined that CPU is more of a bottleneck than I/O, on Community Edition. So, we're going to process the uncompressed data. In addition, we're going to speed things up further by specifying the DataFrame schema explicitly. (When the Spark CSV adapter infers the schema from a CSV file, it has to make an extra pass over the file. That'll slow things down here, and it isn't really necessary.)\n\n**To Do**: Run the following cell, which will define the schemas."],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\n\nratings_df_schema = StructType(\n  [StructField('userId', IntegerType()),\n   StructField('movieId', IntegerType()),\n   StructField('rating', DoubleType())]\n)\nmovies_df_schema = StructType(\n  [StructField('ID', IntegerType()),\n   StructField('title', StringType())]\n)"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Load and Cache\n\nThe Databricks File System (DBFS) sits on top of S3. We're going to be accessing this data a lot. Rather than read it over and over again from S3, we'll cache both\nthe movies DataFrame and the ratings DataFrame in memory.\n\n**To Do**: Run the following cell to load and cache the data. Please be patient: The code takes about 30 seconds to run."],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.functions import regexp_extract\nfrom pyspark.sql.types import *\n\nraw_ratings_df = sqlContext.read.format('com.databricks.spark.csv').options(header=True, inferSchema=False).schema(ratings_df_schema).load(ratings_filename)\nratings_df = raw_ratings_df.drop('Timestamp')\n\nraw_movies_df = sqlContext.read.format('com.databricks.spark.csv').options(header=True, inferSchema=False).schema(movies_df_schema).load(movies_filename)\nmovies_df = raw_movies_df.drop('Genres').withColumnRenamed('movieId', 'ID')\n\nratings_df.cache()\nmovies_df.cache()\n\nassert ratings_df.is_cached\nassert movies_df.is_cached\n\nraw_ratings_count = raw_ratings_df.count()\nratings_count = ratings_df.count()\nraw_movies_count = raw_movies_df.count()\nmovies_count = movies_df.count()\n\nprint 'There are %s ratings and %s movies in the datasets' % (ratings_count, movies_count)\nprint 'Ratings:'\nratings_df.show(3)\nprint 'Movies:'\nmovies_df.show(3, truncate=False)\n\nassert raw_ratings_count == ratings_count\nassert raw_movies_count == movies_count"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["Next, let's do a quick verification of the data.\n\n**To do**: Run the following cell. It should run without errors."],"metadata":{}},{"cell_type":"code","source":["assert ratings_count == 20000263\nassert movies_count == 27278\nassert movies_df.filter(movies_df.title == 'Toy Story (1995)').count() == 1\nassert ratings_df.filter((ratings_df.userId == 6) & (ratings_df.movieId == 1) & (ratings_df.rating == 5.0)).count() == 1"],"metadata":{},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["Let's take a quick look at some of the data in the two DataFrames.\n\n**To Do**: Run the following two cells."],"metadata":{}},{"cell_type":"code","source":["display(movies_df)"],"metadata":{},"outputs":[],"execution_count":16},{"cell_type":"code","source":["display(ratings_df)"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"markdown","source":["## Part 1: Basic Recommendations\n\nOne way to recommend movies is to always recommend the movies with the highest average rating. In this part, we will use Spark to find the name, number of ratings, and the average rating of the 20 movies with the highest average rating and at least 500 reviews. We want to filter our movies with high ratings but greater than or equal to 500 reviews because movies with few reviews may not have broad appeal to everyone."],"metadata":{}},{"cell_type":"markdown","source":["### (1a) Movies with Highest Average Ratings\n\nLet's determine the movies with the highest average ratings.\n\nThe steps you should perform are:\n\n1. Recall that the `ratings_df` contains three columns:\n    - The ID of the user who rated the film\n    - the ID of the movie being rated\n    - and the rating.\n\n   First, transform `ratings_df` into a second DataFrame, `movie_ids_with_avg_ratings`, with the following columns:\n    - The movie ID\n    - The number of ratings for the movie\n    - The average of all the movie's ratings\n\n2. Transform `movie_ids_with_avg_ratings` to another DataFrame, `movie_names_with_avg_ratings_df` that adds the movie name to each row. `movie_names_with_avg_ratings_df`\n   will contain these columns:\n    - The movie ID\n    - The movie name\n    - The number of ratings for the movie\n    - The average of all the movie's ratings\n\n   **Hint**: You'll need to do a join.\n\nYou should end up with something like the following:\n```\nmovie_ids_with_avg_ratings_df:\n+-------+-----+------------------+\n|movieId|count|average           |\n+-------+-----+------------------+\n|1831   |7463 |2.5785207021305103|\n|431    |8946 |3.695059244355019 |\n|631    |2193 |2.7273141814865483|\n+-------+-----+------------------+\nonly showing top 3 rows\n\nmovie_names_with_avg_ratings_df:\n+-------+-----------------------------+-----+-------+\n|average|title                        |count|movieId|\n+-------+-----------------------------+-----+-------+\n|5.0    |Ella Lola, a la Trilby (1898)|1    |94431  |\n|5.0    |Serving Life (2011)          |1    |129034 |\n|5.0    |Diplomatic Immunity (2009? ) |1    |107434 |\n+-------+-----------------------------+-----+-------+\nonly showing top 3 rows\n```"],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with appropriate code\nfrom pyspark.sql import functions as F\n# From ratingsDF, create a movie_ids_with_avg_ratings_df that combines the two DataFrames\nmovie_ids_with_avg_ratings_df = ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias(\"count\"), F.avg(ratings_df.rating).alias(\"average\"))\nprint 'movie_ids_with_avg_ratings_df:'\nmovie_ids_with_avg_ratings_df.show(3, truncate=False)\n\n# Note: movie_names_df is a temporary variable, used only to separate the steps necessary\n# to create the movie_names_with_avg_ratings_df DataFrame.\nmovie_names_df = movie_ids_with_avg_ratings_df.select('*')\nmovie_names_with_avg_ratings_df = movie_names_df.join(movies_df,movies_df.ID==movie_names_df.movieId).drop('ID')\n\nprint 'movie_names_with_avg_ratings_df:'\nmovie_names_with_avg_ratings_df.show(3, truncate=False)"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"code","source":["# TEST Movies with Highest Average Ratings (1a)\nTest.assertEquals(movie_ids_with_avg_ratings_df.count(), 26744,\n                'incorrect movie_ids_with_avg_ratings_df.count() (expected 26744)')\nmovie_ids_with_ratings_take_ordered = movie_ids_with_avg_ratings_df.orderBy('MovieID').take(3)\n_take_0 = movie_ids_with_ratings_take_ordered[0]\n_take_1 = movie_ids_with_ratings_take_ordered[1]\n_take_2 = movie_ids_with_ratings_take_ordered[2]\nTest.assertTrue(_take_0[0] == 1 and _take_0[1] == 49695,\n                'incorrect count of ratings for movie with ID {0} (expected 49695)'.format(_take_0[0]))\nTest.assertEquals(round(_take_0[2], 2), 3.92, \"Incorrect average for movie ID {0}. Expected 3.92\".format(_take_0[0]))\n\nTest.assertTrue(_take_1[0] == 2 and _take_1[1] == 22243,\n                'incorrect count of ratings for movie with ID {0} (expected 22243)'.format(_take_1[0]))\nTest.assertEquals(round(_take_1[2], 2), 3.21, \"Incorrect average for movie ID {0}. Expected 3.21\".format(_take_1[0]))\n\nTest.assertTrue(_take_2[0] == 3 and _take_2[1] == 12735,\n                'incorrect count of ratings for movie with ID {0} (expected 12735)'.format(_take_2[0]))\nTest.assertEquals(round(_take_2[2], 2), 3.15, \"Incorrect average for movie ID {0}. Expected 3.15\".format(_take_2[0]))\n\n\nTest.assertEquals(movie_names_with_avg_ratings_df.count(), 26744,\n                  'incorrect movie_names_with_avg_ratings_df.count() (expected 26744)')\nmovie_names_with_ratings_take_ordered = movie_names_with_avg_ratings_df.orderBy(['average', 'title']).take(3)\nresult = [(r['average'], r['title'], r['count'], r['movieId']) for r in movie_names_with_ratings_take_ordered]\nTest.assertEquals(result,\n                  [(0.5, u'13 Fighting Men (1960)', 1, 109355),\n                   (0.5, u'20 Years After (2008)', 1, 131062),\n                   (0.5, u'3 Holiday Tails (Golden Christmas 2: The Second Tail, A) (2011)', 1, 111040)],\n                  'incorrect top 3 entries in movie_names_with_avg_ratings_df')"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"markdown","source":["### (1b) Movies with Highest Average Ratings and at least 500 reviews\n\nNow that we have a DataFrame of the movies with highest average ratings, we can use Spark to determine the 20 movies with highest average ratings and at least 500 reviews.\n\nAdd a single DataFrame transformation (in place of `<FILL_IN>`, below) to limit the results to movies with ratings from at least 500 people."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nmovies_with_500_ratings_or_more = movie_names_with_avg_ratings_df.where(\"count>=500\")\nprint 'Movies with highest ratings:'\nmovies_with_500_ratings_or_more.show(20, truncate=False)"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"code","source":["# TEST Movies with Highest Average Ratings and at least 500 Reviews (1b)\n\nTest.assertEquals(movies_with_500_ratings_or_more.count(), 4489,\n                  'incorrect movies_with_500_ratings_or_more.count(). Expected 4489.')\ntop_20_results = [(r['average'], r['title'], r['count']) for r in movies_with_500_ratings_or_more.orderBy(F.desc('average')).take(20)]\n\nTest.assertEquals(top_20_results,\n                  [(4.446990499637029, u'Shawshank Redemption, The (1994)', 63366),\n                   (4.364732196832306, u'Godfather, The (1972)', 41355),\n                   (4.334372207803259, u'Usual Suspects, The (1995)', 47006),\n                   (4.310175010988133, u\"Schindler's List (1993)\", 50054),\n                   (4.275640557704942, u'Godfather: Part II, The (1974)', 27398),\n                   (4.2741796572216, u'Seven Samurai (Shichinin no samurai) (1954)', 11611),\n                   (4.271333600779414, u'Rear Window (1954)', 17449),\n                   (4.263182346109176, u'Band of Brothers (2001)', 4305),\n                   (4.258326830670664, u'Casablanca (1942)', 24349),\n                   (4.256934865900383, u'Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)', 6525),\n                   (4.24807897901911, u\"One Flew Over the Cuckoo's Nest (1975)\", 29932),\n                   (4.247286821705426, u'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)', 23220),\n                   (4.246001523229246, u'Third Man, The (1949)', 6565),\n                   (4.235410064157069, u'City of God (Cidade de Deus) (2002)', 12937),\n                   (4.2347902097902095, u'Lives of Others, The (Das leben der Anderen) (2006)', 5720),\n                   (4.233538107122288, u'North by Northwest (1959)', 15627),\n                   (4.2326233183856505, u'Paths of Glory (1957)', 3568),\n                   (4.227123123722136, u'Fight Club (1999)', 40106),\n                   (4.224281931146873, u'Double Indemnity (1944)', 4909),\n                   (4.224137931034483, u'12 Angry Men (1957)', 12934)],\n                  'Incorrect top 20 movies with 500 or more ratings')"],"metadata":{},"outputs":[],"execution_count":24},{"cell_type":"markdown","source":["Using a threshold on the number of reviews is one way to improve the recommendations, but there are many other good ways to improve quality. For example, you could weight ratings by the number of ratings."],"metadata":{}},{"cell_type":"markdown","source":["## Part 2: Collaborative Filtering\nIn this course, you have learned about many of the basic transformations and actions that Spark allows us to apply to distributed datasets.  Spark also exposes some higher level functionality; in particular, Machine Learning using a component of Spark called [MLlib][mllib].  In this part, you will learn how to use MLlib to make personalized movie recommendations using the movie data we have been analyzing.\n\n<img src=\"https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Collaborative_filtering.gif\" alt=\"collaborative filtering\" style=\"float: right\"/>\n\nWe are going to use a technique called [collaborative filtering][collab]. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly. You can read more about collaborative filtering [here][collab2].\n\nThe image at the right (from [Wikipedia][collab]) shows an example of predicting of the user's rating using collaborative filtering. At first, people rate different items (like videos, images, games). After that, the system is making predictions about a user's rating for an item, which the user has not rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in the image below the system has made a prediction, that the active user will not like the video.\n\n<br clear=\"all\"/>\n\n----\n\nFor movie recommendations, we start with a matrix whose entries are movie ratings by users (shown in red in the diagram below).  Each column represents a user (shown in green) and each row represents a particular movie (shown in blue).\n\nSince not all users have rated all movies, we do not know all of the entries in this matrix, which is precisely why we need collaborative filtering.  For each user, we have ratings for only a subset of the movies.  With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices: one that describes properties of each user (shown in green), and one that describes properties of each movie (shown in blue).\n\n<img alt=\"factorization\" src=\"http://spark-mooc.github.io/web-assets/images/matrix_factorization.png\" style=\"width: 885px\"/>\n<br clear=\"all\"/>\n\nWe want to select these two matrices such that the error for the users/movie pairs where we know the correct ratings is minimized.  The [Alternating Least Squares][als] algorithm does this by first randomly filling the users matrix with values and then optimizing the value of the movies such that the error is minimized.  Then, it holds the movies matrix constant and optimizes the value of the user's matrix.  This alternation between which matrix to optimize is the reason for the \"alternating\" in the name.\n\nThis optimization is what's being shown on the right in the image above.  Given a fixed set of user factors (i.e., values in the users matrix), we use the known ratings to find the best values for the movie factors using the optimization written at the bottom of the figure.  Then we \"alternate\" and pick the best user factors given fixed movie factors.\n\nFor a simple example of what the users and movies matrices might look like, check out the [videos from Lecture 2][videos] or the [slides from Lecture 8][slides]\n[videos]: https://courses.edx.org/courses/course-v1:BerkeleyX+CS110x+2T2016/courseware/9d251397874d4f0b947b606c81ccf83c/3cf61a8718fe4ad5afcd8fb35ceabb6e/\n[slides]: https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/fb269ff9a53b669a46d59e154b876d78/asset-v1:BerkeleyX+CS110x+2T2016+type@asset+block/Lecture2s.pdf\n[als]: https://en.wikiversity.org/wiki/Least-Squares_Method\n[mllib]: http://spark.apache.org/docs/1.6.2/mllib-guide.html\n[collab]: https://en.wikipedia.org/?title=Collaborative_filtering\n[collab2]: http://recommender-systems.org/collaborative-filtering/"],"metadata":{}},{"cell_type":"markdown","source":["### (2a) Creating a Training Set\n\nBefore we jump into using machine learning, we need to break up the `ratings_df` dataset into three pieces:\n* A training set (DataFrame), which we will use to train models\n* A validation set (DataFrame), which we will use to choose the best model\n* A test set (DataFrame), which we will use for our experiments\n\nTo randomly split the dataset into the multiple groups, we can use the pySpark [randomSplit()](http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit) transformation. `randomSplit()` takes a set of splits and a seed and returns multiple DataFrames."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with the appropriate code.\n\n# We'll hold out 60% for training, 20% of our data for validation, and leave 20% for testing\nseed = 1800009193L\n(split_60_df, split_a_20_df, split_b_20_df) = ratings_df.randomSplit(weights = [0.6,0.2,0.2],seed = seed)\n\n# Let's cache these datasets for performance\ntraining_df = split_60_df.cache()\nvalidation_df = split_a_20_df.cache()\ntest_df = split_b_20_df.cache()\n\nprint('Training: {0}, validation: {1}, test: {2}\\n'.format(\n  training_df.count(), validation_df.count(), test_df.count())\n)\ntraining_df.show(3)\nvalidation_df.show(3)\ntest_df.show(3)"],"metadata":{},"outputs":[],"execution_count":28},{"cell_type":"code","source":["# TEST Creating a Training Set (2a)\nTest.assertEquals(training_df.count(), 12001389, \"Incorrect training_df count. Expected 12001389\")\nTest.assertEquals(validation_df.count(), 4003694, \"Incorrect validation_df count. Expected 4003694\")\nTest.assertEquals(test_df.count(), 3995180, \"Incorrect test_df count. Expected 3995180\")\n\nTest.assertEquals(training_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 5952) & (ratings_df.rating == 5.0)).count(), 1)\nTest.assertEquals(training_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 1193) & (ratings_df.rating == 3.5)).count(), 1)\nTest.assertEquals(training_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 1196) & (ratings_df.rating == 4.5)).count(), 1)\n\nTest.assertEquals(validation_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 296) & (ratings_df.rating == 4.0)).count(), 1)\nTest.assertEquals(validation_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 32) & (ratings_df.rating == 3.5)).count(), 1)\nTest.assertEquals(validation_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 6888) & (ratings_df.rating == 3.0)).count(), 1)\n\nTest.assertEquals(test_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 4993) & (ratings_df.rating == 5.0)).count(), 1)\nTest.assertEquals(test_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 4128) & (ratings_df.rating == 4.0)).count(), 1)\nTest.assertEquals(test_df.filter((ratings_df.userId == 1) & (ratings_df.movieId == 4915) & (ratings_df.rating == 3.0)).count(), 1)"],"metadata":{},"outputs":[],"execution_count":29},{"cell_type":"markdown","source":["After splitting the dataset, your training set has about 12 million entries and the validation and test sets each have about 4 million entries. (The exact number of entries in each dataset varies slightly due to the random nature of the `randomSplit()` transformation.)"],"metadata":{}},{"cell_type":"markdown","source":["### (2b) Alternating Least Squares\n\nIn this part, we will use the Apache Spark ML Pipeline implementation of Alternating Least Squares, [ALS](http://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS). ALS takes a training dataset (DataFrame) and several parameters that control the model creation process. To determine the best values for the parameters, we will use ALS to train several models, and then we will select the best model and use the parameters from that model in the rest of this lab exercise.\n\nThe process we will use for determining the best model is as follows:\n1. Pick a set of model parameters. The most important parameter to model is the *rank*, which is the number of columns in the Users matrix (green in the diagram above) or the number of rows in the Movies matrix (blue in the diagram above). In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting).  We will train models with ranks of 4, 8, and 12 using the `training_df` dataset.\n\n2. Set the appropriate parameters on the `ALS` object:\n    * The \"User\" column will be set to the values in our `userId` DataFrame column.\n    * The \"Item\" column will be set to the values in our `movieId` DataFrame column.\n    * The \"Rating\" column will be set to the values in our `rating` DataFrame column.\n    * We'll using a regularization parameter of 0.1.\n\n   **Note**: Read the documentation for the [ALS](http://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS) class **carefully**. It will help you accomplish this step.\n3. Have the ALS output transformation (i.e., the result of [ALS.fit()](http://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS.fit)) produce a _new_ column\n   called \"prediction\" that contains the predicted value.\n\n4. Create multiple models using [ALS.fit()](http://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS.fit), one for each of our rank values. We'll fit\n   against the training data set (`training_df`).\n\n5. For each model, we'll run a prediction against our validation data set (`validation_df`) and check the error.\n\n6. We'll keep the model with the best error rate.\n\n#### Why are we doing our own cross-validation?\n\nA challenge for collaborative filtering is how to provide ratings to a new user (a user who has not provided *any* ratings at all). Some recommendation systems choose to provide new users with a set of default ratings (e.g., an average value across all ratings), while others choose to provide no ratings for new users. Spark's ALS algorithm yields a NaN (`Not a Number`) value when asked to provide a rating for a new user.\n\nUsing the ML Pipeline's [CrossValidator](http://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator) with ALS is thus problematic, because cross validation involves dividing the training data into a set of folds (e.g., three sets) and then using those folds for testing and evaluating the parameters during the parameter grid search process. It is likely that some of the folds will contain users that are not in the other folds, and, as a result, ALS produces NaN values for those new users. When the CrossValidator uses the Evaluator (RMSE) to compute an error metric, the RMSE algorithm will return NaN. This will make *all* of the parameters in the parameter grid appear to be equally good (or bad).\n\nYou can read the discussion on [Spark JIRA 14489](https://issues.apache.org/jira/browse/SPARK-14489) about this issue. There are proposed workarounds of having ALS provide default values or having RMSE drop NaN values. Both introduce potential issues. We have chosen to have RMSE drop NaN values. While this does not solve the underlying issue of ALS not predicting a value for a new user, it does provide some evaluation value. We manually implement the parameter grid search process using a for loop (below) and remove the NaN values before using RMSE.\n\nFor a production application, you would want to consider the tradeoffs in how to handle new users.\n\n**Note**: This cell will likely take a couple of minutes to run."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# This step is broken in ML Pipelines: https://issues.apache.org/jira/browse/SPARK-14489\nfrom pyspark.ml.recommendation import ALS\n\n# Let's initialize our ALS learner\nals = ALS()\n\n# Now we set the parameters for the method\nals.setMaxIter(5)\\\n   .setSeed(seed)\\\n   .setRegParam(0.1)\\\n   .setUserCol('userId')\\\n   .setItemCol('movieId')\\\n   .setRatingCol('rating')\n\n# Now let's compute an evaluation metric for our test dataset\nfrom pyspark.ml.evaluation import RegressionEvaluator\n\n# Create an RMSE evaluator using the label and predicted columns\nreg_eval = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"rating\", metricName=\"rmse\")\n\ntolerance = 0.03\nranks = [4, 8, 12]\nerrors = [0, 0, 0]\nmodels = [0, 0, 0]\nerr = 0\nmin_error = float('inf')\nbest_rank = -1\nfor rank in ranks:\n  # Set the rank here:\n  als.setRank(rank)\n  # Create the model with these parameters.\n  model = als.fit(training_df)\n  # Run the model to create a prediction. Predict against the validation_df.\n  predict_df = model.transform(validation_df)\n\n  # Remove NaN values from prediction (due to SPARK-14489)\n  predicted_ratings_df = predict_df.filter(predict_df.prediction != float('nan'))\n\n  # Run the previously created RMSE evaluator, reg_eval, on the predicted_ratings_df DataFrame\n  error = reg_eval.evaluate(predicted_ratings_df)\n  errors[err] = error\n  models[err] = model\n  print 'For rank %s the RMSE is %s' % (rank, error)\n  if error < min_error:\n    min_error = error\n    best_rank = err\n  err += 1\n\nals.setRank(ranks[best_rank])\nprint 'The best model was trained with rank %s' % ranks[best_rank]\nmy_model = models[best_rank]"],"metadata":{},"outputs":[],"execution_count":32},{"cell_type":"code","source":["# TEST\nTest.assertEquals(round(min_error, 2), 0.81, \"Unexpected value for best RMSE. Expected rounded value to be 0.81. Got {0}\".format(round(min_error, 2)))\nTest.assertEquals(ranks[best_rank], 12, \"Unexpected value for best rank. Expected 12. Got {0}\".format(ranks[best_rank]))\nTest.assertEqualsHashed(als.getItemCol(), \"18f0e2357f8829fe809b2d95bc1753000dd925a6\", \"Incorrect choice of {0} for ALS item column.\".format(als.getItemCol()))\nTest.assertEqualsHashed(als.getUserCol(), \"db36668fa9a19fde5c9676518f9e86c17cabf65a\", \"Incorrect choice of {0} for ALS user column.\".format(als.getUserCol()))\nTest.assertEqualsHashed(als.getRatingCol(), \"3c2d687ef032e625aa4a2b1cfca9751d2080322c\", \"Incorrect choice of {0} for ALS rating column.\".format(als.getRatingCol()))"],"metadata":{},"outputs":[],"execution_count":33},{"cell_type":"markdown","source":["### (2c) Testing Your Model\n\nSo far, we used the `training_df` and `validation_df` datasets to select the best model.  Since we used these two datasets to determine what model is best, we cannot use them to test how good the model is; otherwise, we would be very vulnerable to [overfitting](https://en.wikipedia.org/wiki/Overfitting).  To decide how good our model is, we need to use the `test_df` dataset.  We will use the `best_rank` you determined in part (2b) to create a model for predicting the ratings for the test dataset and then we will compute the RMSE.\n\nThe steps you should perform are:\n* Run a prediction, using `my_model` as created above, on the test dataset (`test_df`), producing a new `predict_df` DataFrame.\n* Filter out unwanted NaN values (necessary because of [a bug in Spark](https://issues.apache.org/jira/browse/SPARK-14489)). We've supplied this piece of code for you.\n* Use the previously created RMSE evaluator, `reg_eval` to evaluate the filtered DataFrame."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with the appropriate code\n# In ML Pipelines, this next step has a bug that produces unwanted NaN values. We\n# have to filter them out. See https://issues.apache.org/jira/browse/SPARK-14489\npredict_df = my_model.transform(test_df)\n\n# Remove NaN values from prediction (due to SPARK-14489)\npredicted_test_df = predict_df.filter(predict_df.prediction != float('nan'))\n\n# Run the previously created RMSE evaluator, reg_eval, on the predicted_test_df DataFrame\ntest_RMSE = reg_eval.evaluate(predicted_test_df)\n\nprint('The model had a RMSE on the test set of {0}'.format(test_RMSE))"],"metadata":{},"outputs":[],"execution_count":35},{"cell_type":"code","source":["# TEST Testing Your Model (2c)\nTest.assertTrue(abs(test_RMSE - 0.809624038485) < tolerance, 'incorrect test_RMSE: {0:.11f}'.format(test_RMSE))"],"metadata":{},"outputs":[],"execution_count":36},{"cell_type":"markdown","source":["### (2d) Comparing Your Model\n\nLooking at the RMSE for the results predicted by the model versus the values in the test set is one way to evalute the quality of our model. Another way to evaluate the model is to evaluate the error from a test set where every rating is the average rating for the training set.\n\nThe steps you should perform are:\n* Use the `training_df` to compute the average rating across all movies in that training dataset.\n* Use the average rating that you just determined and the `test_df` to create a DataFrame (`test_for_avg_df`) with a `prediction` column containing the average rating. **HINT**: You'll want to use the `lit()` function,\n  from `pyspark.sql.functions`, available here as `F.lit()`.\n* Use our previously created `reg_eval` object to evaluate the `test_for_avg_df` and calculate the RMSE."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with the appropriate code.\n# Compute the average rating\navg_rating_df = training_df.select(F.avg(training_df.rating).alias(\"average\"))\n\n# Extract the average rating value. (This is row 0, column 0.)\ntraining_avg_rating = avg_rating_df.collect()[0][0]\n\nprint('The average rating for movies in the training set is {0}'.format(training_avg_rating))\n\n# Add a column with the average rating\ntest_for_avg_df = test_df.withColumn('prediction', F.lit(training_avg_rating))\n\n# Run the previously created RMSE evaluator, reg_eval, on the test_for_avg_df DataFrame\ntest_avg_RMSE = reg_eval.evaluate(test_for_avg_df)\n\nprint(\"The RMSE on the average set is {0}\".format(test_avg_RMSE))"],"metadata":{},"outputs":[],"execution_count":38},{"cell_type":"code","source":["# TEST Comparing Your Model (2d)\nTest.assertTrue(abs(training_avg_rating - 3.52547984237) < 0.000001,\n                'incorrect training_avg_rating (expected 3.52547984237): {0:.11f}'.format(training_avg_rating))\nTest.assertTrue(abs(test_avg_RMSE - 1.05190953037) < 0.000001,\n                'incorrect test_avg_RMSE (expected 1.0519743756): {0:.11f}'.format(test_avg_RMSE))"],"metadata":{},"outputs":[],"execution_count":39},{"cell_type":"markdown","source":["You now have code to predict how users will rate movies!"],"metadata":{}},{"cell_type":"markdown","source":["## Part 3: Predictions for Yourself\nThe ultimate goal of this lab exercise is to predict what movies to recommend to yourself.  In order to do that, you will first need to add ratings for yourself to the `ratings_df` dataset."],"metadata":{}},{"cell_type":"markdown","source":["**(3a) Your Movie Ratings**\n\nTo help you provide ratings for yourself, we have included the following code to list the names and movie IDs of the 50 highest-rated movies from `movies_with_500_ratings_or_more` which we created in part 1 the lab."],"metadata":{}},{"cell_type":"code","source":["print 'Most rated movies:'\nprint '(average rating, movie name, number of reviews, movie ID)'\ndisplay(movies_with_500_ratings_or_more.orderBy(movies_with_500_ratings_or_more['average'].desc()).take(50))"],"metadata":{},"outputs":[],"execution_count":43},{"cell_type":"markdown","source":["The user ID 0 is unassigned, so we will use it for your ratings. We set the variable `my_user_ID` to 0 for you. Next, create a new DataFrame called `my_ratings_df`, with your ratings for at least 10 movie ratings. Each entry should be formatted as `(my_user_id, movieID, rating)`.  As in the original dataset, ratings should be between 1 and 5 (inclusive). If you have not seen at least 10 of these movies, you can increase the parameter passed to `take()` in the above cell until there are 10 movies that you have seen (or you can also guess what your rating would be for movies you have not seen)."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom pyspark.sql import Row\nmy_user_id = 0\n\n# Note that the movie IDs are the *last* number on each line. A common error was to use the number of ratings as the movie ID.\nmy_rated_movies = [\n     Row(my_user_id, 260, 5),\n  Row(my_user_id, 26, 7),\n  Row(my_user_id, 2, 5),\n  Row(my_user_id, 23, 5),\n  Row(my_user_id, 28, 5),\n  Row(my_user_id, 30, 6),\n  Row(my_user_id, 250, 5),\n  Row(my_user_id, 120, 4),\n  Row(my_user_id, 10, 5),\n  Row(my_user_id, 30, 3)\n       # The format of each line is (my_user_id, movie ID, your rating)\n     # For example, to give the movie \"Star Wars: Episode IV - A New Hope (1977)\" a five rating, you would add the following line:\n     #   (my_user_id, 260, 5),\n]\n\nmy_ratings_df = sqlContext.createDataFrame(my_rated_movies, ['userId','movieId','rating'])\nprint 'My movie ratings:'\ndisplay(my_ratings_df.limit(10))"],"metadata":{},"outputs":[],"execution_count":45},{"cell_type":"markdown","source":["### (3b) Add Your Movies to Training Dataset\n\nNow that you have ratings for yourself, you need to add your ratings to the `training` dataset so that the model you train will incorporate your preferences.  Spark's [unionAll()](http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.unionAll) transformation combines two DataFrames; use `unionAll()` to create a new training dataset that includes your ratings and the data in the original training dataset."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ntraining_with_my_ratings_df = my_ratings_df.unionAll(training_df)\nprint ('The training dataset now has %s more entries than the original training dataset' %\n       (training_with_my_ratings_df.count() - training_df.count()))\nassert (training_with_my_ratings_df.count() - training_df.count()) == my_ratings_df.count()"],"metadata":{},"outputs":[],"execution_count":47},{"cell_type":"markdown","source":["### (3c) Train a Model with Your Ratings\n\nNow, train a model with your ratings added and the parameters you used in in part (2b) and (2c). Mke sure you include **all** of the parameters.\n\n**Note**: This cell will take about 30 seconds to run."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n\n# Reset the parameters for the ALS object.\nals.setPredictionCol(\"prediction\")\\\n   .setMaxIter(5)\\\n   .setSeed(seed)\\\n   .setRegParam(0.1)\\\n   .setUserCol('userId')\\\n   .setItemCol('movieId')\\\n   .setRatingCol('rating')\n\n# Create the model with these parameters.\nmy_ratings_model = als.fit(training_with_my_ratings_df)"],"metadata":{},"outputs":[],"execution_count":49},{"cell_type":"markdown","source":["### (3d) Check RMSE for the New Model with Your Ratings\n\nCompute the RMSE for this new model on the test set.\n* Run your model (the one you just trained) against the test data set in `test_df`.\n* Then, use our previously-computed `reg_eval` object to compute the RMSE of your ratings."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nmy_predict_df = my_ratings_model.transform(test_df)\n\n# Remove NaN values from prediction (due to SPARK-14489)\npredicted_test_my_ratings_df = my_predict_df.filter(my_predict_df.prediction != float('nan'))\n\n# Run the previously created RMSE evaluator, reg_eval, on the predicted_test_my_ratings_df DataFrame\ntest_RMSE_my_ratings = reg_eval.evaluate(predicted_test_my_ratings_df)\nprint('The model had a RMSE on the test set of {0}'.format(test_RMSE_my_ratings))"],"metadata":{},"outputs":[],"execution_count":51},{"cell_type":"markdown","source":["### (3e) Predict Your Ratings\n\nSo far, we have only computed the error of the model.  Next, let's predict what ratings you would give to the movies that you did not already provide ratings for.\n\nThe steps you should perform are:\n* Filter out the movies you already rated manually. (Use the `my_rated_movie_ids` variable.) Put the results in a new `not_rated_df`.\n\n   **Hint**: The [Column.isin()](http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isin)\n   method, as well as the `~` (\"not\") DataFrame logical operator, may come in handy here. Here's an example of using `isin()`:\n\n```\n    > df1 = sqlContext.createDataFrame([(\"Jim\", 10), (\"Julie\", 9), (\"Abdul\", 20), (\"Mireille\", 19)], [\"name\", \"age\"])\n    > df1.show()\n    +--------+---+\n    |    name|age|\n    +--------+---+\n    |     Jim| 10|\n    |   Julie|  9|\n    |   Abdul| 20|\n    |Mireille| 19|\n    +--------+---+\n\n    > names_to_delete = (\"Julie\", \"Abdul\") # this is just a Python tuple\n    > df2 = df1.filter(~ df1[\"name\"].isin(names_to_delete)) # \"NOT IN\"\n    > df2.show()\n    +--------+---+\n    |    name|age|\n    +--------+---+\n    |     Jim| 10|\n    |   Julie|  9|\n    +--------+---+\n```\n\n* Transform `not_rated_df` into `my_unrated_movies_df` by:\n    - renaming the \"ID\" column to \"movieId\"\n    - adding a \"userId\" column with the value contained in the `my_user_id` variable defined above.\n\n* Create a `predicted_ratings_df` DataFrame by applying `my_ratings_model` to `my_unrated_movies_df`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with the appropriate code\n\n# Create a list of my rated movie IDs\nmy_rated_movie_ids = [x[1] for x in my_rated_movies]\n\n# Filter out the movies I already rated.\nnot_rated_df = movies_df.filter(~ movies_df[\"ID\"].isin(my_rated_movie_ids))\n\n# Rename the \"ID\" column to be \"movieId\", and add a column with my_user_id as \"userId\".\nmy_unrated_movies_df = not_rated_df.select('title',not_rated_df.ID.alias(\"movieId\"),F.lit(my_user_id).alias('userId'))\n\n# Use my_rating_model to predict ratings for the movies that I did not manually rate.\nraw_predicted_ratings_df = my_ratings_model.transform(my_unrated_movies_df)\n\npredicted_ratings_df = raw_predicted_ratings_df.filter(raw_predicted_ratings_df['prediction'] != float('nan'))"],"metadata":{},"outputs":[],"execution_count":53},{"cell_type":"markdown","source":["### (3f) Predict Your Ratings\n\nWe have our predicted ratings. Now we can print out the 25 movies with the highest predicted ratings.\n\nThe steps you should perform are:\n* Join your `predicted_ratings_df` DataFrame with the `movie_names_with_avg_ratings_df` DataFrame to obtain the ratings counts for each movie.\n* Sort the resulting DataFrame (`predicted_with_counts_df`) by predicted rating (highest ratings first), and remove any ratings with a count of 75 or less.\n* Print the top 25 movies that remain."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL_IN> with the appropriate code\n\npredicted_with_counts_df = predicted_ratings_df.join(movie_names_with_avg_ratings_df,movie_names_with_avg_ratings_df.movieId == predicted_ratings_df.movieId)\npredicted_highest_rated_movies_df = predicted_with_counts_df.sort('prediction') \n\nprint ('My 25 highest rated movies as predicted (for movies with more than 75 reviews):')\npredicted_highest_rated_movies_df.take(75)"],"metadata":{},"outputs":[],"execution_count":55},{"cell_type":"markdown","source":["## Appendix A: Submitting Your Exercises to the Autograder\n\nThis section guides you through Step 2 of the grading process (\"Submit to Autograder\").\n\nOnce you confirm that your lab notebook is passing all tests, you can submit it first to the course autograder and then second to the edX website to receive a grade.\n\n** Note that you can only submit to the course autograder once every 1 minute. **"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(a): Restart your cluster by clicking on the dropdown next to your cluster name and selecting \"Restart Cluster\".\n\nYou can do this step in either notebook, since there is one cluster for your notebooks.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_restart.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(b): _IN THIS NOTEBOOK_, click on \"Run All\" to run all of the cells.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_runall.png\" alt=\"Drawing\" style=\"height: 80px\"/>\n\nThis step will take some time.\n\nWait for your cluster to finish running the cells in your lab notebook before proceeding."],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(c): Publish this notebook\n\nPublish _this_ notebook by clicking on the \"Publish\" button at the top.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish0.png\" alt=\"Drawing\" style=\"height: 150px\"/>\n\nWhen you click on the button, you will see the following popup.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish1.png\" alt=\"Drawing\" />\n\nWhen you click on \"Publish\", you will see a popup with your notebook's public link. **Copy the link and set the `notebook_URL` variable in the AUTOGRADER notebook (not this notebook).**\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish2.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(d): Set the notebook URL and Lab ID in the Autograder notebook, and run it\n\nGo to the Autograder notebook and paste the link you just copied into it, so that it is assigned to the `notebook_url` variable.\n\n```\nnotebook_url = \"...\" # put your URL here\n```\n\nThen, find the line that looks like this:\n\n```\nlab = <FILL IN>\n```\nand change `<FILL IN>` to \"CS110x-lab2\":\n\n```\nlab = \"CS110x-lab2\"\n```\n\nThen, run the Autograder notebook to submit your lab."],"metadata":{}},{"cell_type":"markdown","source":["### <img src=\"http://spark-mooc.github.io/web-assets/images/oops.png\" style=\"height: 200px\"/> If things go wrong\n\nIt's possible that your notebook looks fine to you, but fails in the autograder. (This can happen when you run cells out of order, as you're working on your notebook.) If that happens, just try again, starting at the top of Appendix A."],"metadata":{}}],"metadata":{"name":"cs110_lab2_als_prediction","notebookId":836562159920632},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------
/cs120_lab1a_math_review.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>."],"metadata":{}},{"cell_type":"markdown","source":["![ML Logo](http://spark-mooc.github.io/web-assets/images/CS190.1x_Banner_300.png)\n# Math and Python review\n\nThis notebook reviews vector and matrix math, the [NumPy](http://www.numpy.org/) Python package, and Python lambda expressions.  Part 1 covers vector and matrix math, and you'll do a few exercises by hand.  In Part 2, you'll learn about NumPy and use `ndarray` objects to solve the math exercises.   Part 3 provides additional information about NumPy and how it relates to array usage in Spark's [MLlib](https://spark.apache.org/mllib/).  Part 4 provides an overview of lambda expressions.\n\nTo move through the notebook just run each of the cells.  You can run a cell by pressing \"shift-enter\", which will compute the current cell and advance to the next cell, or by clicking in a cell and pressing \"control-enter\", which will compute the current cell and remain in that cell.  You should move through the notebook from top to bottom and run all of the cells.  If you skip some cells, later cells might not work as expected.\nNote that there are several exercises within this notebook.  You will need to provide solutions for cells that start with: `# TODO: Replace <FILL IN> with appropriate code`.\n\n** This notebook covers: **\n* *Part 1:* Math review\n* *Part 2:* NumPy\n* *Part 3:* Additional NumPy and Spark linear algebra\n* *Part 4:* Python lambda expressions\n* *Appendix A:* Submitting your exercises to the Autograder"],"metadata":{}},{"cell_type":"code","source":["labVersion = 'cs120x-lab1a-1.0.0'"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["## Part 1: Math review"],"metadata":{}},{"cell_type":"markdown","source":["### (1a) Scalar multiplication: vectors\n\nIn this exercise, you will calculate the product of a scalar and a vector by hand and enter the result in the code cell below.  Scalar multiplication is straightforward.  The resulting vector equals the product of the scalar, which is a single value, and each item in the original vector.\nIn the example below, \\\\( a \\\\) is the scalar (constant) and \\\\( \\mathbf{v} \\\\) is the vector.  \\\\[ a \\mathbf{v} = \\begin{bmatrix} a v_1 \\\\\\ a v_2 \\\\\\ \\vdots \\\\\\ a v_n \\end{bmatrix} \\\\]\n\nCalculate the value of \\\\( \\mathbf{x} \\\\): \\\\[ \\mathbf{x} = 3 \\begin{bmatrix} 1 \\\\\\ -2 \\\\\\ 0 \\end{bmatrix} \\\\]\nCalculate the value of \\\\( \\mathbf{y} \\\\): \\\\[ \\mathbf{y} = 2 \\begin{bmatrix} 2 \\\\\\ 4 \\\\\\ 8 \\end{bmatrix} \\\\]"],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Manually calculate your answer and represent the vector as a list of integers.\n# For example, [2, 4, 8].\nvectorX = [3,-6,0]\nvectorY = [4,8,16]"],"metadata":{},"outputs":[],"execution_count":6},{"cell_type":"code","source":["# TEST Scalar multiplication: vectors (1a)\n# Import test library\nfrom databricks_test_helper import Test\n\nTest.assertEqualsHashed(vectorX, 'e460f5b87531a2b60e0f55c31b2e49914f779981',\n                        'incorrect value for vectorX')\nTest.assertEqualsHashed(vectorY, 'e2d37ff11427dbac7f833a5a7039c0de5a740b1e',\n                        'incorrect value for vectorY')"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"markdown","source":["### (1b) Element-wise multiplication: vectors\n\nIn this exercise, you will calculate the element-wise multiplication of two vectors by hand and enter the result in the code cell below.  You'll later see that element-wise multiplication is the default method when two NumPy arrays are multiplied together.  Note we won't be performing element-wise multiplication in future labs, but we are introducing it here to distinguish it from other vector operators. It is also a common operation in NumPy, as we will discuss in Part (2b).\n\nThe element-wise calculation is as follows: \\\\[ \\mathbf{x} \\odot \\mathbf{y} =  \\begin{bmatrix} x_1 y_1 \\\\\\  x_2 y_2 \\\\\\ \\vdots \\\\\\ x_n y_n \\end{bmatrix} \\\\]\n\nCalculate the value of \\\\( \\mathbf{z} \\\\): \\\\[ \\mathbf{z} = \\begin{bmatrix} 1 \\\\\\  2 \\\\\\ 3 \\end{bmatrix} \\odot \\begin{bmatrix} 4 \\\\\\  5 \\\\\\ 6 \\end{bmatrix} \\\\]"],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Manually calculate your answer and represent the vector as a list of integers.\nz = [4,10,18]"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"code","source":["# TEST Element-wise multiplication: vectors (1b)\nTest.assertEqualsHashed(z, '4b5fe28ee2d274d7e0378bf993e28400f66205c2',\n                        'incorrect value for z')"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### (1c) Dot product\n\nIn this exercise, you will calculate the dot product of two vectors by hand and enter the result in the code cell below.  Note that the dot product is equivalent to performing element-wise multiplication and then summing the result.\n\nBelow, you'll find the calculation for the dot product of two vectors, where each vector has length \\\\( n \\\\): \\\\[ \\mathbf{w} \\cdot \\mathbf{x} = \\sum_{i=1}^n w_i x_i \\\\]\n\nNote that you may also see \\\\( \\mathbf{w} \\cdot \\mathbf{x} \\\\) represented as \\\\( \\mathbf{w}^\\top \\mathbf{x} \\\\)\n\nCalculate the value for \\\\( c_1 \\\\) based on the dot product of the following two vectors:\n\\\\[ c_1 = \\begin{bmatrix} 1 & -3 \\end{bmatrix} \\cdot \\begin{bmatrix} 4 \\\\\\ 5 \\end{bmatrix}\\\\]\n\nCalculate the value for \\\\( c_2 \\\\) based on the dot product of the following two vectors:\n\\\\[ c_2 = \\begin{bmatrix} 3 & 4 & 5 \\end{bmatrix} \\cdot \\begin{bmatrix} 1 \\\\\\ 2 \\\\\\ 3 \\end{bmatrix}\\\\]"],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Manually calculate your answer and set the variables to their appropriate integer values.\nc1 = -11\nc2 = 26"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"code","source":["# TEST Dot product (1c)\nTest.assertEqualsHashed(c1, '8d7a9046b6a6e21d66409ad0849d6ab8aa51007c', 'incorrect value for c1')\nTest.assertEqualsHashed(c2, '887309d048beef83ad3eabf2a79a64a389ab1c9f', 'incorrect value for c2')"],"metadata":{},"outputs":[],"execution_count":13},{"cell_type":"markdown","source":["### (1d) Matrix multiplication\n\nIn this exercise, you will calculate the result of multiplying two matrices together by hand and enter the result in the code cell below.\nRefer to the slides for the formula for multiplying two matrices together.\n\nFirst, you'll calculate the value for \\\\( \\mathbf{X} \\\\).\n\\\\[ \\mathbf{X} = \\begin{bmatrix} 1 & 2 & 3 \\\\\\ 4 & 5 & 6 \\end{bmatrix} \\begin{bmatrix} 1 & 2 \\\\\\ 3 & 4 \\\\\\ 5 & 6 \\end{bmatrix} \\\\]\n\nNext, you'll perform an outer product and calculate the value for \\\\( \\mathbf{Y} \\\\).\n\n\\\\[ \\mathbf{Y} = \\begin{bmatrix} 1 \\\\\\ 2 \\\\\\ 3 \\end{bmatrix} \\begin{bmatrix} 1 & 2 & 3 \\end{bmatrix} \\\\]\n\nThe resulting matrices should be stored row-wise (see [row-major order](https://en.wikipedia.org/wiki/Row-major_order)). This means that the matrix is organized by rows. For instance, a 2x2 row-wise matrix would be represented as: \\\\( [[r_1c_1, r_1c_2], [r_2c_1, r_2c_2]] \\\\) where r stands for row and c stands for column.\n\nNote that outer product is just a special case of general matrix multiplication and follows the same rules as normal matrix multiplication."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Represent matrices as lists within lists. For example, [[1,2,3], [4,5,6]] represents a matrix with\n# two rows and three columns. Use integer values.\nmatrixX = [[22,28],[49,64]]\nmatrixY = [[1,2,3],[2,4,6],[3,6,9]]"],"metadata":{},"outputs":[],"execution_count":15},{"cell_type":"code","source":["# TEST Matrix multiplication (1d)\nTest.assertEqualsHashed(matrixX, 'c2ada2598d8a499e5dfb66f27a24f444483cba13',\n                        'incorrect value for matrixX')\nTest.assertEqualsHashed(matrixY, 'f985daf651531b7d776523836f3068d4c12e4519',\n                        'incorrect value for matrixY')"],"metadata":{},"outputs":[],"execution_count":16},{"cell_type":"markdown","source":["##  Part 2: NumPy"],"metadata":{}},{"cell_type":"markdown","source":["### (2a) Scalar multiplication\n\n[NumPy](http://docs.scipy.org/doc/numpy/reference/) is a Python library for working with arrays.  NumPy provides abstractions that make it easy to treat these underlying arrays as vectors and matrices.  The library is optimized to be fast and memory efficient, and we'll be using it throughout the course.  The building block for NumPy is the [ndarray](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html), which is a multidimensional array of fixed-size that contains elements of one type (e.g. array of floats).\n\nFor this exercise, you'll create a `ndarray` consisting of the elements \\[1, 2, 3\\] and multiply this array by 5.  Use [np.array()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) to create the array.  Note that you can pass a Python list into `np.array()`.  To perform scalar multiplication with an `ndarray` just use `*`.\n\nNote that if you create an array from a Python list of integers you will obtain a one-dimensional array, *which is equivalent to a vector for our purposes*."],"metadata":{}},{"cell_type":"code","source":["# It is convention to import NumPy with the alias np\nimport numpy as np"],"metadata":{},"outputs":[],"execution_count":19},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Create a numpy array with the values 1, 2, 3\nsimpleArray = np.array([1,2,3])\n# Perform the scalar product of 5 and the numpy array\ntimesFive = 5 * simpleArray\nprint 'simpleArray\\n{0}'.format(simpleArray)\nprint '\\ntimesFive\\n{0}'.format(timesFive)"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"code","source":["# TEST Scalar multiplication (2a)\nTest.assertTrue(np.all(timesFive == [5, 10, 15]), 'incorrect value for timesFive')"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"markdown","source":["### (2b) Element-wise multiplication and dot product\n\nNumPy arrays support both element-wise multiplication and dot product.  Element-wise multiplication occurs automatically when you use the `*` operator to multiply two `ndarray` objects of the same length.\n\nTo perform the dot product you can use either [np.dot()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot) or [np.ndarray.dot()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.dot.html).  For example, if you had NumPy arrays `x` and `y`, you could compute their dot product four ways: `np.dot(x, y)`, `np.dot(y, x)`, `x.dot(y)`, or `y.dot(x)`.\n\nFor this exercise, multiply the arrays `u` and `v` element-wise and compute their dot product."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Create a ndarray based on a range and step size.\nu = np.arange(0, 5, .5)\nv = np.arange(5, 10, .5)\n\nelementWise = u * v\ndotProduct = np.dot(u,v)\nprint 'u: {0}'.format(u)\nprint 'v: {0}'.format(v)\nprint '\\nelementWise\\n{0}'.format(elementWise)\nprint '\\ndotProduct\\n{0}'.format(dotProduct)"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"code","source":["# TEST Element-wise multiplication and dot product (2b)\nTest.assertTrue(np.all(elementWise == [ 0., 2.75, 6., 9.75, 14., 18.75, 24., 29.75, 36., 42.75]),\n                'incorrect value for elementWise')\nTest.assertEquals(dotProduct, 183.75, 'incorrect value for dotProduct')"],"metadata":{},"outputs":[],"execution_count":24},{"cell_type":"markdown","source":["### (2c) Matrix math\nWith NumPy it is very easy to perform matrix math.  You can use [np.matrix()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html) to generate a NumPy matrix.  Just pass a two-dimensional `ndarray` or a list of lists to the function.  You can perform matrix math on NumPy matrices using `*`.\n\nYou can transpose a matrix by calling [numpy.matrix.transpose()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.transpose.html) or by using `.T` on the matrix object (e.g. `myMatrix.T`).  Transposing a matrix produces a matrix where the new rows are the columns from the old matrix. For example: \\\\[  \\begin{bmatrix} 1 & 2 & 3 \\\\\\ 4 & 5 & 6 \\end{bmatrix}^\\top = \\begin{bmatrix} 1 & 4 \\\\\\ 2 & 5 \\\\\\ 3 & 6 \\end{bmatrix} \\\\]\n\nInverting a matrix can be done using [numpy.linalg.inv()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html).  Note that only square matrices can be inverted, and square matrices are not guaranteed to have an inverse.  If the inverse exists, then multiplying a matrix by its inverse will produce the identity matrix.  \\\\( \\scriptsize ( A^{-1} A = I_n ) \\\\)  The identity matrix \\\\( \\scriptsize I_n \\\\) has ones along its diagonal and zeros elsewhere. \\\\[ I_n = \\begin{bmatrix} 1 & 0 & 0 & ... & 0 \\\\\\ 0 & 1 & 0 & ... & 0 \\\\\\ 0 & 0 & 1 & ... & 0 \\\\\\ ... & ... & ... & ... & ... \\\\\\ 0 & 0 & 0 & ... & 1 \\end{bmatrix} \\\\]\n\nFor this exercise, multiply \\\\( A \\\\) times its transpose \\\\( ( A^\\top ) \\\\) and then calculate the inverse of the result \\\\( (  [ A A^\\top ]^{-1}  ) \\\\)."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom numpy.linalg import inv\n\nA = np.matrix([[1,2,3,4],[5,6,7,8]])\nprint 'A:\\n{0}'.format(A)\n# Print A transpose\nprint '\\nA transpose:\\n{0}'.format(A.T)\n\n# Multiply A by A transpose\nAAt = A * A.T\nprint '\\nAAt:\\n{0}'.format(AAt)\n\n# Invert AAt with np.linalg.inv()\nAAtInv = np.linalg.inv(AAt)\nprint '\\nAAtInv:\\n{0}'.format(AAtInv)\n\n# Show inverse times matrix equals identity\n# We round due to numerical precision\nprint '\\nAAtInv * AAt:\\n{0}'.format((AAtInv * AAt).round(4))"],"metadata":{},"outputs":[],"execution_count":26},{"cell_type":"code","source":["# TEST Matrix math (2c)\nTest.assertTrue(np.all(AAt == np.matrix([[30, 70], [70, 174]])), 'incorrect value for AAt')\nTest.assertTrue(np.allclose(AAtInv, np.matrix([[0.54375, -0.21875], [-0.21875, 0.09375]])),\n                'incorrect value for AAtInv')"],"metadata":{},"outputs":[],"execution_count":27},{"cell_type":"markdown","source":["### Part 3: Additional NumPy and Spark linear algebra"],"metadata":{}},{"cell_type":"markdown","source":["### (3a) Slices\n\nYou can select a subset of a one-dimensional NumPy `ndarray`'s elements by using slices.  These slices operate the same way as slices for Python lists.  For example, `[0, 1, 2, 3][:2]` returns the first two elements `[0, 1]`.  NumPy, additionally, has more sophisticated slicing that allows slicing across multiple dimensions; however, you'll only need to use basic slices in future labs for this course.\n\nNote that if no index is placed to the left of a `:`, it is equivalent to starting at 0, and hence `[0, 1, 2, 3][:2]` and `[0, 1, 2, 3][0:2]` yield the same result.  Similarly, if no index is placed to the right of a `:`, it is equivalent to slicing to the end of the object.  Also, you can use negative indices to index relative to the end of the object, so `[-2:]` would return the last two elements of the object.\n\nFor this exercise, return the last 3 elements of the array `features`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfeatures = np.array([1, 2, 3, 4])\nprint 'features:\\n{0}'.format(features)\n\n# The last three elements of features\nlastThree = features[-3:]\n\nprint '\\nlastThree:\\n{0}'.format(lastThree)"],"metadata":{},"outputs":[],"execution_count":30},{"cell_type":"code","source":["# TEST Slices (3a)\nTest.assertTrue(np.all(lastThree == [2, 3, 4]), 'incorrect value for lastThree')"],"metadata":{},"outputs":[],"execution_count":31},{"cell_type":"markdown","source":["### (3b) Combining `ndarray` objects\n\nNumPy provides many functions for creating new arrays from existing arrays.  We'll explore two functions: [np.hstack()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html), which allows you to combine arrays column-wise, and [np.vstack()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html), which allows you to combine arrays row-wise.  Note that both `np.hstack()` and `np.vstack()` take in a tuple of arrays as their first argument.  To horizontally combine three arrays `a`, `b`, and `c`, you would run `np.hstack((a, b, c))`.\nIf we had two arrays: `a = [1, 2, 3, 4]` and `b = [5, 6, 7, 8]`, we could use `np.vstack((a, b))` to produce the two-dimensional array: \\\\[  \\begin{bmatrix} 1 & 2 & 3 & 4 \\\\\\ 5 & 6 & 7 & 8 \\end{bmatrix} \\\\]\n\nFor this exercise, you'll combine the `zeros` and `ones` arrays both horizontally (column-wise) and vertically (row-wise).\nNote that the result of stacking two arrays is an `ndarray`.  If you need the result to be a matrix, you can call `np.matrix()` on the result, which will return a NumPy matrix."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nzeros = np.zeros(8)\nones = np.ones(8)\nprint 'zeros:\\n{0}'.format(zeros)\nprint '\\nones:\\n{0}'.format(ones)\n\nzerosThenOnes = np.hstack((zeros, ones))   # A 1 by 16 array\nzerosAboveOnes = np.vstack((zeros, ones))  # A 2 by 8 array\n\nprint '\\nzerosThenOnes:\\n{0}'.format(zerosThenOnes)\nprint '\\nzerosAboveOnes:\\n{0}'.format(zerosAboveOnes)"],"metadata":{},"outputs":[],"execution_count":33},{"cell_type":"code","source":["# TEST Combining ndarray objects (3b)\nTest.assertTrue(np.all(zerosThenOnes == [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]),\n                'incorrect value for zerosThenOnes')\nTest.assertTrue(np.all(zerosAboveOnes == [[0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1]]),\n                'incorrect value for zerosAboveOnes')"],"metadata":{},"outputs":[],"execution_count":34},{"cell_type":"markdown","source":["### (3c) PySpark's DenseVector\n\nIn frequent ML scenarios, you may end up with very long vectors, possibly 100k's to millions, where most of the values are zeroes.  PySpark provides a [DenseVector](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector) class (in module the module [pyspark.mllib.linalg](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.linalg)), which allows you to more efficiently operate and store these sparse vectors.\n\n`DenseVector` is used to store arrays of values for use in PySpark.  `DenseVector` actually stores values in a NumPy array and delegates calculations to that object.  You can create a new `DenseVector` using `DenseVector()` and passing in a NumPy array or a Python list.\n\n`DenseVector` implements several functions.  The only function needed for this course is `DenseVector.dot()`, which operates just like `np.ndarray.dot()`.\nNote that `DenseVector` stores all values as `np.float64`, so even if you pass in an NumPy array of integers, the resulting `DenseVector` will contain floating-point numbers. Also, `DenseVector` objects exist locally and are not inherently distributed.  `DenseVector` objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.  You'll learn more about RDDs in the spark tutorial.\n\nFor this exercise, create a `DenseVector` consisting of the values `[3.0, 4.0, 5.0]` and compute the dot product of this vector with `numpyVector`."],"metadata":{}},{"cell_type":"code","source":["from pyspark.mllib.linalg import DenseVector"],"metadata":{},"outputs":[],"execution_count":36},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nnumpyVector = np.array([-3, -4, 5])\nprint '\\nnumpyVector:\\n{0}'.format(numpyVector)\n\n# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]\nmyDenseVector = DenseVector([3.0,4.0,5.0])\n# Calculate the dot product between the two vectors.\ndenseDotProduct = np.dot(numpyVector, myDenseVector)\n\nprint 'myDenseVector:\\n{0}'.format(myDenseVector)\nprint '\\ndenseDotProduct:\\n{0}'.format(denseDotProduct)"],"metadata":{},"outputs":[],"execution_count":37},{"cell_type":"code","source":["# TEST PySpark's DenseVector (3c)\nTest.assertTrue(isinstance(myDenseVector, DenseVector), 'myDenseVector is not a DenseVector')\nTest.assertTrue(np.allclose(myDenseVector, np.array([3., 4., 5.])),\n                'incorrect value for myDenseVector')\nTest.assertTrue(np.allclose(denseDotProduct, 0.0), 'incorrect value for denseDotProduct')"],"metadata":{},"outputs":[],"execution_count":38},{"cell_type":"markdown","source":["## Part 4: Python lambda expressions"],"metadata":{}},{"cell_type":"markdown","source":["### (4a) Lambda is an anonymous function\n\nWe can use a lambda expression to create a function.  To do this, you type `lambda` followed by the names of the function's parameters separated by commas, followed by a `:`, and then the expression statement that the function will evaluate.  For example, `lambda x, y: x + y` is an anonymous function that computes the sum of its two inputs.\n\nLambda expressions return a function when evaluated.  The function is not bound to any variable, which is why lambdas are associated with anonymous functions.  However, it is possible to assign the function to a variable.  Lambda expressions are particularly useful when you need to pass a simple function into another function.  In that case, the lambda expression generates a function that is bound to the parameter being passed into the function.\n\nBelow, we'll see an example of how we can bind the function returned by a lambda expression to a variable named `addSLambda`.  From this example, we can see that `lambda` provides a shortcut for creating a simple function.  Note that the behavior of the function created using `def` and the function created using `lambda` is equivalent.  Both functions have the same type and return the same results.  The only differences are the names and the way they were created.\nFor this exercise, first run the two cells below to compare a function created using `def` with a corresponding anonymous function.  Next, write your own lambda expression that creates a function that multiplies its input (a single parameter) by 10.\n\nHere are some additional references that explain lambdas: [Lambda Functions](http://www.secnetix.de/olli/Python/lambda_functions.hawk), [Lambda Tutorial](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/), and [Python Functions](http://www.bogotobogo.com/python/python_functions_lambda.php)."],"metadata":{}},{"cell_type":"code","source":["# Example function\ndef addS(x):\n    return x + 's'\nprint type(addS)\nprint addS\nprint addS('cat')"],"metadata":{},"outputs":[],"execution_count":41},{"cell_type":"code","source":["# As a lambda\naddSLambda = lambda x: x + 's'\nprint type(addSLambda)\nprint addSLambda\nprint addSLambda('cat')"],"metadata":{},"outputs":[],"execution_count":42},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Recall that: \"lambda x, y: x + y\" creates a function that adds together two numbers\nmultiplyByTen = lambda x: x *10\nprint multiplyByTen(5)\n\n# Note that the function still shows its name as <lambda>\nprint '\\n', multiplyByTen"],"metadata":{},"outputs":[],"execution_count":43},{"cell_type":"code","source":["# TEST Python lambda expressions (4a)\nTest.assertEquals(multiplyByTen(10), 100, 'incorrect definition for multiplyByTen')"],"metadata":{},"outputs":[],"execution_count":44},{"cell_type":"markdown","source":["### (4b) `lambda` fewer steps than `def`\n\n`lambda` generates a function and returns it, while `def` generates a function and assigns it to a name.  The function returned by `lambda` also automatically returns the value of its expression statement, which reduces the amount of code that needs to be written.\n\nFor this exercise, recreate the `def` behavior using `lambda`.  Note that since a lambda expression returns a function, it can be used anywhere an object is expected. For example, you can create a list of functions where each function in the list was generated by a lambda expression."],"metadata":{}},{"cell_type":"code","source":["# Code using def that we will recreate with lambdas\ndef plus(x, y):\n    return x + y\n\ndef minus(x, y):\n    return x - y\n\nfunctions = [plus, minus]\nprint functions[0](4, 5)\nprint functions[1](4, 5)"],"metadata":{},"outputs":[],"execution_count":46},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# The first function should add two values, while the second function should subtract the second\n# value from the first value.\nlambdaFunctions = [lambda x,y: x + y ,  lambda x,y: x - y]\nprint lambdaFunctions[0](4, 5)\nprint lambdaFunctions[1](4, 5)"],"metadata":{},"outputs":[],"execution_count":47},{"cell_type":"code","source":["# TEST lambda fewer steps than def (4b)\nTest.assertEquals(lambdaFunctions[0](10, 10), 20, 'incorrect first lambdaFunction')\nTest.assertEquals(lambdaFunctions[1](10, 10), 0, 'incorrect second lambdaFunction')"],"metadata":{},"outputs":[],"execution_count":48},{"cell_type":"markdown","source":["### (4c) Lambda expression arguments\n\nLambda expressions can be used to generate functions that take in zero or more parameters.  The syntax for `lambda` allows for multiple ways to define the same function.  For example, we might want to create a function that takes in a single parameter, where the parameter is a tuple consisting of two values, and the function adds the two values together.  The syntax could be either: `lambda x: x[0] + x[1]` or `lambda (x0, x1): x0 + x1`.  If we called either function on the tuple `(3, 4)` it would return `7`.  Note that the second `lambda` relies on the tuple `(3, 4)` being unpacked automatically, which means that `x0` is assigned the value `3` and `x1` is assigned the value `4`.\n\nAs an other example, consider the following parameter lambda expressions: `lambda x, y: (x[0] + y[0], x[1] + y[1])` and `lambda (x0, x1), (y0, y1): (x0 + y0, x1 + y1)`.  The result of applying either of these functions to tuples  `(1, 2)` and `(3, 4)` would be the tuple `(4, 6)`.\n\nFor this exercise: you'll create one-parameter functions `swap1` and `swap2` that swap the order of a tuple; a one-parameter function `swapOrder` that takes in a tuple with three values and changes the order to: second element, third element, first element; and finally, a three-parameter function `sumThree` that takes in three tuples, each with two values, and returns a tuple containing two values: the sum of the first element of each tuple and the sum of second element of each tuple."],"metadata":{}},{"cell_type":"code","source":["# Examples.  Note that the spacing has been modified to distinguish parameters from tuples.\n\n# One-parameter function\na1 = lambda x: x[0] + x[1]\na2 = lambda (x0, x1): x0 + x1\nprint 'a1( (3,4) ) = {0}'.format( a1( (3,4) ) )\nprint 'a2( (3,4) ) = {0}'.format( a2( (3,4) ) )\n\n# Two-parameter function\nb1 = lambda x, y: (x[0] + y[0], x[1] + y[1])\nb2 = lambda (x0, x1), (y0, y1): (x0 + y0, x1 + y1)\nprint '\\nb1( (1,2), (3,4) ) = {0}'.format( b1( (1,2), (3,4) ) )\nprint 'b2( (1,2), (3,4) ) = {0}'.format( b2( (1,2), (3,4) ) )"],"metadata":{},"outputs":[],"execution_count":50},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Use both syntaxes to create a function that takes in a tuple of two values and swaps their order\n# E.g. (1, 2) => (2, 1)\nswap1 = lambda x: (x[1],x[0])\nswap2 = lambda (x0, x1): (x1,x0)\nprint 'swap1((1, 2)) = {0}'.format(swap1((1, 2)))\nprint 'swap2((1, 2)) = {0}'.format(swap2((1, 2)))\n\n# Using either syntax, create a function that takes in a tuple with three values and returns a tuple\n# of (2nd value, 3rd value, 1st value).  E.g. (1, 2, 3) => (2, 3, 1)\nswapOrder = lambda (x0, x1,x2): (x1,x2,x0)\nprint 'swapOrder((1, 2, 3)) = {0}'.format(swapOrder((1, 2, 3)))\n\n# Using either syntax, create a function that takes in three tuples each with two values.  The\n# function should return a tuple with the values in the first position summed and the values in the\n# second position summed. E.g. (1, 2), (3, 4), (5, 6) => (1 + 3 + 5, 2 + 4 + 6) => (9, 12)\nsumThree = lambda x0,x1,x2: (x0[0]+x1[0]+x2[0],x0[1]+x1[1]+x2[1])\nprint 'sumThree((1, 2), (3, 4), (5, 6)) = {0}'.format(sumThree((1, 2), (3, 4), (5, 6)))"],"metadata":{},"outputs":[],"execution_count":51},{"cell_type":"code","source":["# TEST Lambda expression arguments (4c)\nTest.assertEquals(swap1((1, 2)), (2, 1), 'incorrect definition for swap1')\nTest.assertEquals(swap2((1, 2)), (2, 1), 'incorrect definition for swap2')\nTest.assertEquals(swapOrder((1, 2, 3)), (2, 3, 1), 'incorrect definition for swapOrder')\nTest.assertEquals(sumThree((1, 2), (3, 4), (5, 6)), (9, 12), 'incorrect definition for sumThree')"],"metadata":{},"outputs":[],"execution_count":52},{"cell_type":"markdown","source":["### (4d) Restrictions on lambda expressions\n\n[Lambda expressions](https://docs.python.org/2/reference/expressions.html#lambda) consist of a single [expression statement](https://docs.python.org/2/reference/simple_stmts.html#expression-statements) and cannot contain other [simple statements](https://docs.python.org/2/reference/simple_stmts.html).  In short, this means that the lambda expression needs to evaluate to a value and exist on a single logical line.  If more complex logic is necessary, use `def` in place of `lambda`.\n\nExpression statements evaluate to a value (sometimes that value is None).  Lambda expressions automatically return the value of their expression statement.  In fact, a `return` statement in a `lambda` would raise a `SyntaxError`.\n\n The following Python keywords refer to simple statements that cannot be used in a lambda expression: `assert`, `pass`, `del`, `print`, `return`, `yield`, `raise`, `break`, `continue`, `import`, `global`, and `exec`.  Also, note that assignment statements (`=`) and augmented assignment statements (e.g. `+=`) cannot be used either."],"metadata":{}},{"cell_type":"code","source":["# Just run this code\n# This code will fail with a syntax error, as we can't use print in a lambda expression\nimport traceback\ntry:\n    exec \"lambda x: print x\"\nexcept:\n    traceback.print_exc()"],"metadata":{},"outputs":[],"execution_count":54},{"cell_type":"markdown","source":["### (4e) Functional programming\n\nThe `lambda` examples we have shown so far have been somewhat contrived.  This is because they were created to demonstrate the differences and similarities between `lambda` and `def`.  An excellent use case for lambda expressions is functional programming.  In functional programming, you will often pass functions to other functions as parameters, and `lambda` can be used to reduce the amount of code necessary and to make the code more readable.\nSome commonly used functions in functional programming are map, filter, and reduce.  Map transforms a series of elements by applying a function individually to each element in the series.  It then returns the series of transformed elements.  Filter also applies a function individually to each element in a series; however, with filter, this function evaluates to `True` or `False` and only elements that evaluate to `True` are retained.  Finally, reduce operates on pairs of elements in a series.  It applies a function that takes in two values and returns a single value.  Using this function, reduce is able to, iteratively, \"reduce\" a series to a single value.\n\nFor this exercise, you'll create three simple `lambda` functions, one each for use in map, filter, and reduce.  The map `lambda` will multiply its input by 5, the filter `lambda` will evaluate to `True` for even numbers, and the reduce `lambda` will add two numbers.\n\n> Note:\n> * We have created a class called `FunctionalWrapper` so that the syntax for this exercise matches the syntax you'll see in PySpark.\n> * Map requires a one parameter function that returns a new value, filter requires a one parameter function that returns `True` or `False`, and reduce requires a two parameter function that combines the two parameters and returns a new value."],"metadata":{}},{"cell_type":"code","source":["# Create a class to give our examples the same syntax as PySpark\nclass FunctionalWrapper(object):\n    def __init__(self, data):\n        self.data = data\n    def map(self, function):\n        \"\"\"Call `map` on the items in `data` using the provided `function`\"\"\"\n        return FunctionalWrapper(map(function, self.data))\n    def reduce(self, function):\n        \"\"\"Call `reduce` on the items in `data` using the provided `function`\"\"\"\n        return reduce(function, self.data)\n    def filter(self, function):\n        \"\"\"Call `filter` on the items in `data` using the provided `function`\"\"\"\n        return FunctionalWrapper(filter(function, self.data))\n    def __eq__(self, other):\n        return (isinstance(other, self.__class__)\n            and self.__dict__ == other.__dict__)\n    def __getattr__(self, name):  return getattr(self.data, name)\n    def __getitem__(self, k):  return self.data.__getitem__(k)\n    def __repr__(self):  return 'FunctionalWrapper({0})'.format(repr(self.data))\n    def __str__(self):  return 'FunctionalWrapper({0})'.format(str(self.data))"],"metadata":{},"outputs":[],"execution_count":56},{"cell_type":"code","source":["# Map example\n\n# Create some data\nmapData = FunctionalWrapper(range(5))\n\n# Define a function to be applied to each element\nf = lambda x: x + 3\n\n# Imperative programming: loop through and create a new object by applying f\nmapResult = FunctionalWrapper([])  # Initialize the result\nfor element in mapData:\n    mapResult.append(f(element))  # Apply f and save the new value\nprint 'Result from for loop: {0}'.format(mapResult)\n\n# Functional programming: use map rather than a for loop\nprint 'Result from map call: {0}'.format(mapData.map(f))\n\n# Note that the results are the same but that the map function abstracts away the implementation\n# and requires less code"],"metadata":{},"outputs":[],"execution_count":57},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ndataset = FunctionalWrapper(range(10))\n\n# Multiply each element by 5\nmapResult = dataset.map(lambda x: x*5)\n# Keep the even elements\n# Note that \"x % 2\" evaluates to the remainder of x divided by 2\nfilterResult = dataset.filter(lambda x: x%2==0)\n# Sum the elements\nreduceResult = dataset.reduce(lambda x,y: x+y)\n\nprint 'mapResult: {0}'.format(mapResult)\nprint '\\nfilterResult: {0}'.format(filterResult)\nprint '\\nreduceResult: {0}'.format(reduceResult)"],"metadata":{},"outputs":[],"execution_count":58},{"cell_type":"code","source":["# TEST Functional programming (4e)\nTest.assertEquals(mapResult, FunctionalWrapper([0, 5, 10, 15, 20, 25, 30, 35, 40, 45]),\n                  'incorrect value for mapResult')\nTest.assertEquals(filterResult, FunctionalWrapper([0, 2, 4, 6, 8]),\n                  'incorrect value for filterResult')\nTest.assertEquals(reduceResult, 45, 'incorrect value for reduceResult')"],"metadata":{},"outputs":[],"execution_count":59},{"cell_type":"markdown","source":["### (4f) Composability\n\nSince our methods for map and filter in the `FunctionalWrapper` class return `FunctionalWrapper` objects, we can compose (or chain) together our function calls.  For example, `dataset.map(f1).filter(f2).reduce(f3)`, where `f1`, `f2`, and `f3` are functions or lambda expressions, first applies a map operation to `dataset`, then filters the result from map, and finally reduces the result from the first two operations.\n\n Note that when we compose (chain) an operation, the output of one operation becomes the input for the next operation, and operations are applied from left to right.  It's likely you've seen chaining used with Python strings.  For example, `'Split this'.lower().split(' ')` first returns a new string object `'split this'` and then `split(' ')` is called on that string to produce `['split', 'this']`.\n\nFor this exercise, reuse your lambda expressions from (4e) but apply them to `dataset` in the sequence: map, filter, reduce.\n\n> Note:\n> * Since we are composing the operations our result will be different than in (4e).\n> * We can write our operations on separate lines to improve readability."],"metadata":{}},{"cell_type":"code","source":["# Example of a multi-line expression statement\n# Note that placing parentheses around the expression allows it to exist on multiple lines without\n# causing a syntax error.\n(dataset\n .map(lambda x: x + 2)\n .reduce(lambda x, y: x * y))"],"metadata":{},"outputs":[],"execution_count":61},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Multiply the elements in dataset by five, keep just the even values, and sum those values\nfinalSum = dataset.map(lambda x: x*5).filter(lambda x: x%2 ==0).reduce(lambda x,y: x+y)\nprint finalSum"],"metadata":{},"outputs":[],"execution_count":62},{"cell_type":"code","source":["# TEST Composability (4f)\nTest.assertEquals(finalSum, 100, 'incorrect value for finalSum')"],"metadata":{},"outputs":[],"execution_count":63},{"cell_type":"markdown","source":["## Appendix A: Submitting Your Exercises to the Autograder\n\nThis section guides you through Step 2 of the grading process (\"Submit to Autograder\").\n\nOnce you confirm that your lab notebook is passing all tests, you can submit it first to the course autograder and then second to the edX website to receive a grade.\n\n** Note that you can only submit to the course autograder once every 1 minute. **"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(a): Restart your cluster by clicking on the dropdown next to your cluster name and selecting \"Restart Cluster\".\n\nYou can do this step in either notebook, since there is one cluster for your notebooks.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_restart.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(b): _IN THIS NOTEBOOK_, click on \"Run All\" to run all of the cells.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_runall.png\" alt=\"Drawing\" style=\"height: 80px\"/>\n\nThis step will take some time.\n\nWait for your cluster to finish running the cells in your lab notebook before proceeding."],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(c): Publish this notebook\n\nPublish _this_ notebook by clicking on the \"Publish\" button at the top.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish0.png\" alt=\"Drawing\" style=\"height: 150px\"/>\n\nWhen you click on the button, you will see the following popup.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish1.png\" alt=\"Drawing\" />\n\nWhen you click on \"Publish\", you will see a popup with your notebook's public link. **Copy the link and set the `notebook_URL` variable in the AUTOGRADER notebook (not this notebook).**\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish2.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(d): Set the notebook URL and Lab ID in the Autograder notebook, and run it\n\nGo to the Autograder notebook and paste the link you just copied into it, so that it is assigned to the `notebook_url` variable.\n\n```\nnotebook_url = \"...\" # put your URL here\n```\n\nThen, find the line that looks like this:\n\n```\nlab = <FILL IN>\n```\nand change `<FILL IN>` to \"CS120x-lab1a\":\n\n```\nlab = \"CS120x-lab1a\"\n```\n\nThen, run the Autograder notebook to submit your lab."],"metadata":{}},{"cell_type":"markdown","source":["### <img src=\"http://spark-mooc.github.io/web-assets/images/oops.png\" style=\"height: 200px\"/> If things go wrong\n\nIt's possible that your notebook looks fine to you, but fails in the autograder. (This can happen when you run cells out of order, as you're working on your notebook.) If that happens, just try again, starting at the top of Appendix A."],"metadata":{}}],"metadata":{"name":"cs120_lab1a_math_review","notebookId":3356982636598526},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------
/cs120_lab1b_word_count_rdd.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>."],"metadata":{}},{"cell_type":"markdown","source":["#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n# Word Count Lab: Building a word count application\n\nThis lab will build on the techniques covered in the Spark tutorial to develop a simple word count application.  The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data.  In this lab, we will write code that calculates the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page).\n\nThis could also be scaled to find the most common words in Wikipedia.\n\n## During this lab we will cover:\n* *Part 1:* Creating a base RDD and pair RDDs\n* *Part 2:* Counting with pair RDDs\n* *Part 3:* Finding unique words and a mean value\n* *Part 4:* Apply word count to a file\n* *Appendix A:* Submitting your exercises to the Autograder\n\n> Note that for reference, you can look up the details of the relevant methods in:\n> * [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)"],"metadata":{}},{"cell_type":"code","source":["labVersion = 'cs120x-lab1b-1.0.0'"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["## Part 1: Creating a base RDD and pair RDDs"],"metadata":{}},{"cell_type":"markdown","source":["In this part of the lab, we will explore creating a base RDD with `parallelize` and using pair RDDs to count words."],"metadata":{}},{"cell_type":"markdown","source":["### (1a) Create a base RDD\nWe'll start by generating a base RDD by using a Python list and the `sc.parallelize` method.  Then we'll print out the type of the base RDD."],"metadata":{}},{"cell_type":"code","source":["wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']\nwordsRDD = sc.parallelize(wordsList, 4)\n# Print out the type of wordsRDD\nprint type(wordsRDD)"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"markdown","source":["### (1b) Pluralize and test\n\nLet's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word.  Please replace `<FILL IN>` with your solution.  If you have trouble, the next cell has the solution.  After you have defined `makePlural` you can run the third cell which contains a test.  If you implementation is correct it will print `1 test passed`.\n\nThis is the general form that exercises will take, except that no example solution will be provided.  Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `<FILL IN>` sections.  The cell that needs to be modified will have `# TODO: Replace <FILL IN> with appropriate code` on its first line.  Once the `<FILL IN>` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution.  The last code cell before the next markdown section will contain the tests."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ndef makePlural(word):\n    \"\"\"Adds an 's' to `word`.\n    \n\n    Note:\n        This is a simple function that only adds an 's'.  No attempt is made to follow proper\n        pluralization rules.\n\n    Args:\n        word (str): A string.\n\n    Returns:\n        str: A string with 's' added to it.\n    \"\"\"\n    word = word + 's'\n    return word\n\nprint makePlural('cat')"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"code","source":["# One way of completing the function\ndef makePlural(word):\n    return word + 's'\n\nprint makePlural('cat')"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"code","source":["# Load in the testing code and check to see if your answer is correct\n# If incorrect it will report back '1 test failed' for each failed test\n# Make sure to rerun any cell you change before trying the test again\nfrom databricks_test_helper import Test\n# TEST Pluralize and test (1b)\nTest.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"markdown","source":["### (1c) Apply `makePlural` to the base RDD\n\nNow pass each item in the base RDD into a [map()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) transformation that applies the `makePlural()` function to each element. And then call the [collect()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) action to see the transformed RDD."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\npluralRDD = wordsRDD.map(makePlural\n                        )\nprint pluralRDD.collect()"],"metadata":{},"outputs":[],"execution_count":13},{"cell_type":"code","source":["# TEST Apply makePlural to the base RDD(1c)\nTest.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n                  'incorrect values for pluralRDD')"],"metadata":{},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["### (1d) Pass a `lambda` function to `map`\n\nLet's create the same RDD using a `lambda` function."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\npluralLambdaRDD = wordsRDD.map(lambda x:x+'s')\nprint pluralLambdaRDD.collect()"],"metadata":{},"outputs":[],"execution_count":16},{"cell_type":"code","source":["# TEST Pass a lambda function to map (1d)\nTest.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n                  'incorrect values for pluralLambdaRDD (1d)')"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"markdown","source":["### (1e) Length of each word\n\nNow use `map()` and a `lambda` function to return the number of characters in each word.  We'll `collect` this result directly into a variable."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\npluralLengths = (pluralRDD\n                 .map(lambda x: len(x))\n                 .collect())\nprint pluralLengths"],"metadata":{},"outputs":[],"execution_count":19},{"cell_type":"code","source":["# TEST Length of each word (1e)\n\nTest.assertEquals(pluralLengths, [4, 9, 4, 4, 4],\n                  'incorrect values for pluralLengths')"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"markdown","source":["### (1f) Pair RDDs\n\nThe next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple `(k, v)` where `k` is the key and `v` is the value. In this example, we will create a pair consisting of `('<word>', 1)` for each word element in the RDD.\nWe can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nwordPairs = wordsRDD.map(lambda x: (x, 1))\nprint wordPairs.collect()"],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"code","source":["# TEST Pair RDDs (1f)\nTest.assertEquals(wordPairs.collect(),\n                  [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],\n                  'incorrect value for wordPairs')"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"markdown","source":["## Part 2: Counting with pair RDDs"],"metadata":{}},{"cell_type":"markdown","source":["Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.\n\nA naive approach would be to `collect()` all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations."],"metadata":{}},{"cell_type":"markdown","source":["### (2a) `groupByKey()` approach\nAn approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.\n\nThere are two problems with using `groupByKey()`:\n  + The operation requires a lot of data movement to move all the values into the appropriate partitions.\n  + The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.\n\nUse `groupByKey()` to generate a pair RDD of type `('word', iterator)`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Note that groupByKey requires no parameters\nwordsGrouped = wordPairs.groupByKey()\nfor key, value in wordsGrouped.collect():\n    print '{0}: {1}'.format(key, list(value))"],"metadata":{},"outputs":[],"execution_count":27},{"cell_type":"code","source":["# TEST groupByKey() approach (2a)\nTest.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),\n                  [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],\n                  'incorrect value for wordsGrouped')"],"metadata":{},"outputs":[],"execution_count":28},{"cell_type":"markdown","source":["### (2b) Use `groupByKey()` to obtain the counts\n\nUsing the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.\n\nNow sum the iterator using a `map()` transformation.  The result should be a pair RDD consisting of (word, count) pairs."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nwordCountsGrouped = wordPairs.groupByKey().map(lambda x:(x[0],sum(x[1])))\nprint wordCountsGrouped.collect()"],"metadata":{},"outputs":[],"execution_count":30},{"cell_type":"code","source":["# TEST Use groupByKey() to obtain the counts (2b)\nTest.assertEquals(sorted(wordCountsGrouped.collect()),\n                  [('cat', 2), ('elephant', 1), ('rat', 2)],\n                  'incorrect value for wordCountsGrouped')"],"metadata":{},"outputs":[],"execution_count":31},{"cell_type":"markdown","source":["** (2c) Counting using `reduceByKey` **\n\nA better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\n# Note that reduceByKey takes in a function that accepts two values and returns a single value\nwordCounts = wordPairs.reduceByKey(lambda x,y:x+y)\nprint wordCounts.collect()"],"metadata":{},"outputs":[],"execution_count":33},{"cell_type":"code","source":["# TEST Counting using reduceByKey (2c)\nTest.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],\n                  'incorrect value for wordCounts')"],"metadata":{},"outputs":[],"execution_count":34},{"cell_type":"markdown","source":["### (2d) All together\n\nThe expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nwordCountsCollected = (wordsRDD\n                       .map(lambda x: (x,1))\n                       .reduceByKey(lambda x,y:x+y)\n                       .collect())\nprint wordCountsCollected"],"metadata":{},"outputs":[],"execution_count":36},{"cell_type":"code","source":["# TEST All together (2d)\nTest.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],\n                  'incorrect value for wordCountsCollected')"],"metadata":{},"outputs":[],"execution_count":37},{"cell_type":"markdown","source":["## Part 3: Finding unique words and a mean value"],"metadata":{}},{"cell_type":"markdown","source":["### (3a) Unique words\n\nCalculate the number of unique words in `wordsRDD`.  You can use other RDDs that you have already created to make this easier."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nuniqueWords = wordsRDD.map(lambda x: (x,1)).groupByKey().map(lambda x:x[0]).collect()\nprint uniqueWords"],"metadata":{},"outputs":[],"execution_count":40},{"cell_type":"code","source":["## ANSWER\nuniqueWords = wordCounts.count()\nprint uniqueWords"],"metadata":{},"outputs":[],"execution_count":41},{"cell_type":"code","source":["# TEST Unique words (3a)\nTest.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')"],"metadata":{},"outputs":[],"execution_count":42},{"cell_type":"code","source":["wordCounts.collect()"],"metadata":{},"outputs":[],"execution_count":43},{"cell_type":"markdown","source":["### (3b) Mean using `reduce`\n\nFind the mean number of words per unique word in `wordCounts`.\n\nUse a `reduce()` action to sum the counts in `wordCounts` and then divide by the number of unique words.  First `map()` the pair RDD `wordCounts`, which consists of (key, value) pairs, to an RDD of values."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nfrom operator import add\ntotalCount = wordCounts.map(lambda x:x[1]).reduce(add)\naverage = totalCount / float(wordCounts.count())\nprint totalCount\nprint round(average, 2)"],"metadata":{},"outputs":[],"execution_count":45},{"cell_type":"code","source":["# TEST Mean using reduce (3b)\nTest.assertEquals(round(average, 2), 1.67, 'incorrect value of average')"],"metadata":{},"outputs":[],"execution_count":46},{"cell_type":"markdown","source":["## Part 4: Apply word count to a file"],"metadata":{}},{"cell_type":"markdown","source":["In this section we will finish developing our word count application.  We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data."],"metadata":{}},{"cell_type":"markdown","source":["### (4a) `wordCount` function\n\nFirst, define a function for word counting.  You should reuse the techniques that have been covered in earlier parts of this lab.  This function should take in an RDD that is a list of words like `wordsRDD` and return a pair RDD that has all of the words and their associated counts."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ndef wordCount(wordListRDD):\n    \"\"\"Creates a pair RDD with word counts from an RDD of words.\n\n    Args:\n        wordListRDD (RDD of str): An RDD consisting of words.\n\n    Returns:\n        RDD of (str, int): An RDD consisting of (word, count) tuples.\n    \"\"\"\n    return wordListRDD.map(lambda x:(x,1)).reduceByKey(add)\nprint wordCount(wordsRDD).collect()"],"metadata":{},"outputs":[],"execution_count":50},{"cell_type":"code","source":["# TEST wordCount function (4a)\nTest.assertEquals(sorted(wordCount(wordsRDD).collect()),\n                  [('cat', 2), ('elephant', 1), ('rat', 2)],\n                  'incorrect definition for wordCount function')"],"metadata":{},"outputs":[],"execution_count":51},{"cell_type":"markdown","source":["### (4b) Capitalization and punctuation\n\nReal world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:\n  + Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).\n  + All punctuation should be removed.\n  + Any leading or trailing spaces on a line should be removed.\n\nDefine the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces.  Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful.\nIf you are unfamiliar with regular expressions, you may want to review [this tutorial](https://developers.google.com/edu/python/regular-expressions) from Google.  Also, [this website](https://regex101.com/#python) is  a great resource for debugging your regular expression."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nimport re\nimport string\ndef removePunctuation(text):\n    \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n\n    Note:\n        Only spaces, letters, and numbers should be retained.  Other characters should should be\n        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after\n        punctuation is removed.\n\n    Args:\n        text (str): A string.\n\n    Returns:\n        str: The cleaned up string.\n    \"\"\"\n    return text.lower().translate(None, string.punctuation).strip()\n  \nprint removePunctuation('Hi, you!')\nprint removePunctuation(' No under_score!')\nprint removePunctuation(' *      Remove punctuation then spaces  * ')"],"metadata":{},"outputs":[],"execution_count":53},{"cell_type":"code","source":["# TEST Capitalization and punctuation (4b)\nTest.assertEquals(removePunctuation(\" The Elephant's 4 cats. \"),\n                  'the elephants 4 cats',\n                  'incorrect definition for removePunctuation function')"],"metadata":{},"outputs":[],"execution_count":54},{"cell_type":"markdown","source":["### (4c) Load a text file\n\nFor the next part of this lab, we will use the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We also apply the recently defined `removePunctuation()` function using a `map()` transformation to strip out the punctuation and change all text to lower case.  Since the file is large we use `take(15)`, so that we only print 15 lines."],"metadata":{}},{"cell_type":"code","source":["%fs"],"metadata":{},"outputs":[],"execution_count":56},{"cell_type":"code","source":["# Just run this code\nimport os.path\nfileName = \"dbfs:/\" + os.path.join('databricks-datasets', 'cs100', 'lab1', 'data-001', 'shakespeare.txt')\n\nshakespeareRDD = sc.textFile(fileName, 8).map(removePunctuation)\nprint '\\n'.join(shakespeareRDD.zipWithIndex().map(lambda (l, num): '{0}: {1}'.format(num, l))).take(15)"],"metadata":{},"outputs":[],"execution_count":57},{"cell_type":"markdown","source":["### (4d) Words from lines\n\nBefore we can use the `wordcount()` function, we have to address two issues with the format of the RDD:\n  + The first issue is that  that we need to split each line by its spaces. ** Performed in (4d). **\n  + The second issue is we need to filter out empty lines. ** Performed in (4e). **\n\nApply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string [split()](https://docs.python.org/2/library/string.html#string.split) function. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be.\n\n> Note:\n> * Do not use the default implemenation of `split()`, but pass in a separator value.  For example, to split `line` by commas you would use `line.split(',')`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nshakespeareWordsRDD = shakespeareRDD.map(lambda x: x.split('\\n'))\nshakespeareWordCount = shakespeareWordsRDD.count()\nprint shakespeareWordsRDD.top(5)\nprint shakespeareWordCount"],"metadata":{},"outputs":[],"execution_count":59},{"cell_type":"code","source":["# TEST Words from lines (4d)\n# This test allows for leading spaces to be removed either before or after\n# punctuation is removed.\nTest.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n                'incorrect value for shakespeareWordCount')\nTest.assertEquals(shakespeareWordsRDD.top(5),\n                  [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],\n                  'incorrect value for shakespeareWordsRDD')"],"metadata":{},"outputs":[],"execution_count":60},{"cell_type":"markdown","source":["** (4e) Remove empty elements **\n\nThe next step is to filter out the empty elements.  Remove all entries where the word is `''`."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\nshakeWordsRDD = shakespeareWordsRDD.filter(lambda x:x=='')\nshakeWordCount = shakeWordsRDD.count()\nprint shakeWordCount"],"metadata":{},"outputs":[],"execution_count":62},{"cell_type":"code","source":["# TEST Remove empty elements (4e)\nTest.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')"],"metadata":{},"outputs":[],"execution_count":63},{"cell_type":"markdown","source":["### (4f) Count the words\n\nWe now have an RDD that is only words.  Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n\nYou'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.\nUse the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts."],"metadata":{}},{"cell_type":"code","source":["# TODO: Replace <FILL IN> with appropriate code\ntop15WordsAndCounts = <FILL IN>\nprint '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))"],"metadata":{},"outputs":[],"execution_count":65},{"cell_type":"code","source":["# TEST Count the words (4f)\nTest.assertEquals(top15WordsAndCounts,\n                  [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n                   (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n                   (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n                  'incorrect value for top15WordsAndCounts')"],"metadata":{},"outputs":[],"execution_count":66},{"cell_type":"markdown","source":["## Appendix A: Submitting Your Exercises to the Autograder\n\nThis section guides you through Step 2 of the grading process (\"Submit to Autograder\").\n\nOnce you confirm that your lab notebook is passing all tests, you can submit it first to the course autograder and then second to the edX website to receive a grade.\n\n** Note that you can only submit to the course autograder once every 1 minute. **"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(a): Restart your cluster by clicking on the dropdown next to your cluster name and selecting \"Restart Cluster\".\n\nYou can do this step in either notebook, since there is one cluster for your notebooks.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_restart.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(b): _IN THIS NOTEBOOK_, click on \"Run All\" to run all of the cells.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/submit_runall.png\" alt=\"Drawing\" style=\"height: 80px\"/>\n\nThis step will take some time.\n\nWait for your cluster to finish running the cells in your lab notebook before proceeding."],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(c): Publish this notebook\n\nPublish _this_ notebook by clicking on the \"Publish\" button at the top.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish0.png\" alt=\"Drawing\" style=\"height: 150px\"/>\n\nWhen you click on the button, you will see the following popup.\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish1.png\" alt=\"Drawing\" />\n\nWhen you click on \"Publish\", you will see a popup with your notebook's public link. **Copy the link and set the `notebook_URL` variable in the AUTOGRADER notebook (not this notebook).**\n\n<img src=\"http://spark-mooc.github.io/web-assets/images/Lab0_Publish2.png\" alt=\"Drawing\" />"],"metadata":{}},{"cell_type":"markdown","source":["### Step 2(d): Set the notebook URL and Lab ID in the Autograder notebook, and run it\n\nGo to the Autograder notebook and paste the link you just copied into it, so that it is assigned to the `notebook_url` variable.\n\n```\nnotebook_url = \"...\" # put your URL here\n```\n\nThen, find the line that looks like this:\n\n```\nlab = <FILL IN>\n```\nand change `<FILL IN>` to \"CS120x-lab1b\":\n\n```\nlab = \"CS120x-lab1b\"\n```\n\nThen, run the Autograder notebook to submit your lab."],"metadata":{}},{"cell_type":"markdown","source":["### <img src=\"http://spark-mooc.github.io/web-assets/images/oops.png\" style=\"height: 200px\"/> If things go wrong\n\nIt's possible that your notebook looks fine to you, but fails in the autograder. (This can happen when you run cells out of order, as you're working on your notebook.) If that happens, just try again, starting at the top of Appendix A."],"metadata":{}}],"metadata":{"name":"cs120_lab1b_word_count_rdd","notebookId":1536738296898082},"nbformat":4,"nbformat_minor":0}
2 | 


--------------------------------------------------------------------------------