├── .gitignore
├── LICENSE
├── Lectures
    ├── Week 10 - Streaming.pdf
    ├── Week 11 - Distributed Message Queues.pdf
    ├── Week 12 - Cloud Functions.pdf
    ├── Week 14 - Data Oriented Querying.pdf
    ├── Week 6 - Advanced Spark.pdf
    ├── Week 8 - NoSQL.pdf
    ├── Week 9 - Clouds.pdf
    ├── Week1.pdf
    ├── Week2 - MapReduce.pdf
    ├── Week3 - Mapreduce Cloud Style.pdf
    ├── Week4 - Data.pdf
    ├── Week5 - Spark.pdf
    └── Week7 - Databases.pdf
├── MPs
    ├── MP0
    │   └── README.md
    ├── MP1
    │   ├── README.md
    │   ├── bigram_count.py
    │   ├── common_friends.py
    │   ├── friend_graph.txt
    │   ├── friend_graph_example.txt
    │   ├── map_reducer.py
    │   ├── sherlock.txt
    │   ├── utils.py
    │   └── word_count.py
    ├── MP2
    │   ├── README.md
    │   ├── book.txt
    │   ├── twitter_follower_mapper.py
    │   ├── twitter_follower_reducer.py
    │   ├── wikipedia_links_mapper.py
    │   ├── wikipedia_links_reducer.py
    │   ├── word_count_mapper.py
    │   └── word_count_reducer.py
    ├── MP3
    │   ├── README.md
    │   ├── prob1_mapper.py
    │   ├── prob1_reducer.py
    │   ├── prob2_mapper.py
    │   ├── prob2_reducer.py
    │   ├── prob3_mapper.py
    │   └── prob3_reducer.py
    ├── MP4
    │   ├── README.md
    │   ├── best_reviews.py
    │   ├── descriptors_of_good_business.py
    │   ├── least_expensive_cities.py
    │   └── yelp_reviewer_accuracy.py
    ├── MP5
    │   ├── README.md
    │   ├── amazon_helpfulness_regression.py
    │   ├── amazon_review_classification.py
    │   ├── bayes_binary_tfidf.py
    │   ├── bayes_tfidf.py
    │   └── yelp_clustering.py
    ├── MP6
    │   ├── README.md
    │   ├── aggregation_aggravation.py
    │   ├── jaunting_with_joins.py
    │   └── quizzical_queries.py
    ├── MP7
    │   ├── README.md
    │   ├── graph.png
    │   ├── main.tf
    │   └── startup.sh
    ├── MP8
    │   ├── README.md
    │   ├── problem1.py
    │   ├── problem2.py
    │   └── problem3.py
    └── MP9
    │   ├── README.md
    │   └── assignment.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | University of Illinois/NCSA Open Source License
 2 | 
 3 | Copyright (c) 2017 LCDM@UIUC
 4 | All rights reserved.
 5 | 
 6 | Developed by: 		LCDM@UIUC - Professor Robert J. Brunner and CS199: ACC Course Staffs
 7 |                     http://lcdm.illinois.edu
 8 | 
 9 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal with the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
10 | 
11 |     * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimers.
12 | 
13 |     * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided with the distribution.
14 | 
15 |     * Neither the names of the course development team, LCDM@UIUC, nor the names of its contributors may be used to endorse or promote products derived from this Software without specific prior written permission.
16 | 
17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE SOFTWARE.
18 | 


--------------------------------------------------------------------------------
/Lectures/Week 10 - Streaming.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 10 - Streaming.pdf


--------------------------------------------------------------------------------
/Lectures/Week 11 - Distributed Message Queues.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 11 - Distributed Message Queues.pdf


--------------------------------------------------------------------------------
/Lectures/Week 12 - Cloud Functions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 12 - Cloud Functions.pdf


--------------------------------------------------------------------------------
/Lectures/Week 14 - Data Oriented Querying.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 14 - Data Oriented Querying.pdf


--------------------------------------------------------------------------------
/Lectures/Week 6 - Advanced Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 6 - Advanced Spark.pdf


--------------------------------------------------------------------------------
/Lectures/Week 8 - NoSQL.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 8 - NoSQL.pdf


--------------------------------------------------------------------------------
/Lectures/Week 9 - Clouds.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week 9 - Clouds.pdf


--------------------------------------------------------------------------------
/Lectures/Week1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week1.pdf


--------------------------------------------------------------------------------
/Lectures/Week2 - MapReduce.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week2 - MapReduce.pdf


--------------------------------------------------------------------------------
/Lectures/Week3 - Mapreduce Cloud Style.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week3 - Mapreduce Cloud Style.pdf


--------------------------------------------------------------------------------
/Lectures/Week4 - Data.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week4 - Data.pdf


--------------------------------------------------------------------------------
/Lectures/Week5 - Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week5 - Spark.pdf


--------------------------------------------------------------------------------
/Lectures/Week7 - Databases.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/Lectures/Week7 - Databases.pdf


--------------------------------------------------------------------------------
/MPs/MP0/README.md:
--------------------------------------------------------------------------------
  1 | # MP 0: Introduction to Docker
  2 | 
  3 | ## Introduction
  4 | 
  5 | This lab will introduce Docker and Dockerfiles. You will be creating a container that runs Ubuntu, and python3.
  6 | 
  7 | ### Windows Users Only
  8 | **Disclaimer:** Docker is available for windows, but we recommend using a UNIX-based system.
  9 | 
 10 | Start off by downloading VirtualBox. VirtualBox can be downloaded from this link:
 11 | 
 12 | https://www.virtualbox.org/wiki/Downloads
 13 | 
 14 | For instructions on how to setup Linux on a virtual machine to start the MP, follow this link:
 15 | 
 16 | http://www.psychocats.net/ubuntu/virtualbox
 17 | 
 18 | ## Docker Setup
 19 | Create an account with Docker and install docker community (not enterprise). This can be found under these two links:
 20 | 
 21 | https://hub.docker.com/
 22 | 
 23 | https://docs.docker.com/engine/installation/#supported-platforms
 24 | 
 25 | In this MP, you will create a Dockerfile that has the following properties:
 26 | 
 27 | #### Goals
 28 | 1. Inherits from the Ubuntu public image.
 29 | 2. Implements labels for a maintainer and a class.
 30 | 3. Installs python3.
 31 | 4. Creates a few environment variables.
 32 | 5. Writes to a file in the container
 33 | 6. Sets the current user to root.
 34 | 7. Runs bash whenever the container is run.
 35 | 
 36 | For information on building a Dockerfile, look at the [docs](https://docs.docker.com/engine/reference/builder/).
 37 | 
 38 | ### Problem 1
 39 | * Open up your terminal
 40 | * Create a new directory and `cd` into it.
 41 | * Create a file called `Dockerfile` and open it in your favorite editor.
 42 | 
 43 | _These questions should be viewed as guiding questions, you don't need to submit answers for these_
 44 | 
 45 | Inherit from the standard `ubuntu` image on dockerhub. (Why do we do this?)
 46 | 
 47 | You can build and test your Dockerfile by running these commands:
 48 | 
 49 | ```
 50 | $ docker build -t <container name> .
 51 | ```
 52 | Now, you have a container named whatever you specified. Try running.
 53 | ```
 54 | $ docker run -i -t <container name>
 55 | ```
 56 | 
 57 | If you see something like:
 58 | 
 59 | ```
 60 | root@8a788562c667:/# 
 61 | ```
 62 | 
 63 | You've been launched into bash, and you have made your first container!
 64 | 
 65 | ### Problem 2
 66 | Once you have your initial Dockerfile up and running, you can exit out of it by typing `exit`, or `Ctrl-D`. Now, we are going to make some modifications to the Dockerfile. Include the following docker LABELs in the file:
 67 | * `NETID` as your net-id
 68 | * `CLASS` as CS199
 69 | 
 70 | Now, if you `docker build ...` and `docker run ...` and then run `docker inspect <container name>` outside your container you should be able to see the two labels like:
 71 | ```
 72 | "NETID":<your net-id>
 73 | "CLASS":"CS199"
 74 | ```
 75 | 
 76 | ### Problem 3
 77 | Stop your container instance again and with your labels set up, download python 3.6.1 in the container. A simple way of doing this is by using `apt-get` commands.
 78 | 
 79 | You should be able to do the following once python3 is setup in your container:
 80 | 
 81 | ```
 82 | root@8a788562c667:/# python3
 83 | >>>
 84 | ```
 85 | 
 86 | ##### Note:
 87 | You might need to look into `conda`, `apt-get update` and `pip` to install python3.
 88 | 
 89 | ### Problem 4
 90 | Finally, Create `/data/info.txt` and put `Karl the fog is the best cloud computing platform`. You should do this in the Dockerfile, not in the container (Hint: use the `RUN` command with cat). When you run your container, you should be able to see this file in the data folder. Also, specify the user that the container should run
 91 | 
 92 | You should now be able to run `cat /data/info.txt` in your container to see your text message.
 93 | 
 94 | ### Problem 5
 95 | Create a new user called `cs199`. Change the user who logs in to the default shell to be the `cs199` user. Also, set and environment variable called `NAME` to be your first name. When you log run the dockerfile, you should see something like.
 96 | 
 97 | ```
 98 | cs199@8a788562c667:/# whoami
 99 | cs199
100 | cs199@8a788562c667:/# echo $NAME
101 | me
102 | ```
103 | 
104 | ## Deliverables
105 | By the end of the Lab, you should have **one** Dockerfile that solves all 4 of the problems. We will check off if you have all **7** of the requirements listed above for one point each, totaling 7 points for this lab. Please, if you are having any troubles setting up, let us know during office hours.
106 | 
107 | Submit your Dockerfile as MP0 on Moodle. That's it!
108 | 


--------------------------------------------------------------------------------
/MPs/MP1/README.md:
--------------------------------------------------------------------------------
  1 | # MP 1: Introduction to MapReduce
  2 | 
  3 | ## Introduction
  4 | 
  5 | This MP will introduce the map/reduce computing paradigm. In essence, map/reduce breaks tasks down into a map phase (where an algorithm is mapped onto data) and a reduce phase, where the outputs of the map phase are aggregated into a concise output. The map phase is designed to be parallel, so as to allow wide distribution of computation.
  6 | 
  7 | The map phase identifies keys and associates with them a value. The reduce phase collects keys and aggregates their values. The standard example used to demonstrate this programming approach is a word count problem, where words (or tokens) are the keys and the number of occurrences of each word (or token) is the value.
  8 | 
  9 | As this technique was popularized by large web search companies like Google and Yahoo who were processing large quantities of unstructured text data, this approach quickly became popular for a wide range of problems. The standard MapReduce approach uses Hadoop, which was built using Java. However, to introduce you to this topic without adding the extra overhead of learning Hadoop's idiosyncrasies, we will be 'simulating' a map/reduce workload in pure Python.
 10 | 
 11 | ## Example: Word Count
 12 | 
 13 | This example displays the type of programs we can build from simple map/reduce functions. Suppose our task is to come up with a count of the occurrences of each word in a large set of text. We could simply iterate through the text and count the words as we saw them, but this would be slow and non-parallelizable.
 14 | 
 15 | Instead, we break the text up into chunks, and then split those chunks into words. This is the ‘map’ phase (i.e. the input text is mapped to a list of words). Then, we can ‘reduce’ this data into a coherent word count that holds for the entire text set. We do this by accumulating the count of each word in each chunk using our reduce function.
 16 | 
 17 | Take a look at `map_reducer.py` and `word_count.py` to see the example we’ve constructed for you. Notice that the `map` stage is being run on a multiprocess pool. This is functionally analogous to a cloud computing application, the difference being in the cloud, this work would be distributed amongst multiple nodes, whereas in our toy MapReduce, all the processes run on a single machine.
 18 | 
 19 | Run `python word_count.py` to see our simple map/reduce example. You can adjust `NUM_WORKERS` in `map_reducer.py` to see how we make (fairly small) performance gains from parallelizing the work. (Hint: running `time python word_count.py` will give you a better idea of the runtime).
 20 | 
 21 | ## Exercise: Bigram Count
 22 | 
 23 | Suppose now that instead of trying to count the individual words, we want to get counts of the occurrences word [bigrams](https://en.wikipedia.org/wiki/Bigram) - that is, pairs of words that are *adjacent* to each other in the text (Bigrams are **not** just all the pairs of the words in the text).
 24 | 
 25 | For example, if our line of text was `“cat dog sheep horse”`, we’d have the bigrams `(“cat”, “dog”)`, `(“dog, “sheep”)` and `(“sheep”, “horse”)`.
 26 | 
 27 | Construct a map function and reduce function that will accomplish this goal.
 28 | 
 29 | **Note:** For the purposes of this exercise, we’ll only consider bigrams that occur on the same line. So, you don’t need to worry about pairs that occur between line breaks.
 30 | 
 31 | ### Example:
 32 | Input: (stdin)
 33 | ```
 34 | the dog in the tree
 35 | the cat in the hat
 36 | in summer the dog swam
 37 | ```
 38 | 
 39 | Output: (stdout)
 40 | ```
 41 | (the, dog): 2
 42 | (dog, in): 1
 43 | (in, the): 2
 44 | (the, tree): 1
 45 | (the, cat): 1
 46 | (cat, in): 1
 47 | (the, hat): 1
 48 | (in, summer): 1
 49 | (summer, the): 1
 50 | (dog, swam): 1
 51 | ```
 52 | 
 53 | Note that the order of the output is not important in this exercise.
 54 | 
 55 | ## Exercise: Common Friends
 56 | 
 57 | Suppose we’re running a social network and we want a fast way to calculate a list of common friends for pairs of users in our site. This can be done with a map/reduce procedure.
 58 | 
 59 | You’ll be given input of a friend ‘graph’ that looks like this:
 60 | 
 61 | ```
 62 | A|B
 63 | B|A,C,D
 64 | C|B,D
 65 | D|B,C,E
 66 | E|D
 67 | ```
 68 | The graph can be visualized as
 69 | ``` 
 70 | A-B - D-E
 71 |    \ /
 72 |     C
 73 | ```
 74 | Read this as: A is friends with B, B is friends with A, C and D, and so on. Our desired output is as follows:
 75 | 
 76 | ```
 77 | (A,C): [B]
 78 | (A,D): [B]
 79 | (B,C): [D]
 80 | (B,D): [C]
 81 | (B,E): [D]
 82 | (C,D): [B]
 83 | (C,E): [D]
 84 | ```
 85 | 
 86 | Read this as: A and C have B in common as a friend, A and D have B in common as a friend, and B and C have D in common as a friend, and so on. None of the other relationships have common friends.
 87 | (For example, A and E have no common friends)
 88 | 
 89 | Note that each list of common friends should only be outputted once. (i.e. There should not be entries for `(A, C)` and `(C, A)`)
 90 | 
 91 | #### Suggestions:
 92 | Your mapper stage should take each line of the friend graph and produce a multiple key/value pairs keyed with a relationship, with value being immediately known common friends.
 93 | The reducer phase should take all of these relationships and output common friends for each pair. (Hint: Lookup set union)
 94 | 
 95 | **Note:** This problem is a bit challenging. Try your best to figure it out before asking for help. :)
 96 | 
 97 | ## Submission
 98 | 
 99 | MP 1 is due on **Wednesday, September 13nd, 2017** at 11:55PM.
100 | 
101 | Please zip the files and upload it to [Moodle](learn.illinois.edu). Place all your source files into a single folder and zip it. When we unzip your files, we should be left with a single folder that contains your source files.
102 | 


--------------------------------------------------------------------------------
/MPs/MP1/bigram_count.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | from operator import itemgetter
 3 | 
 4 | 
 5 | def bigram_mapper(line):
 6 |     ''' write your code here! '''
 7 |     pass
 8 | 
 9 | 
10 | def bigram_reducer(bigram_tuples):
11 |     ''' write your code here! '''
12 |     pass
13 | 
14 | if __name__ == '__main__':
15 |     with open('sherlock.txt') as f:
16 |         lines = f.readlines()
17 |     mr = MapReduce(bigram_mapper, bigram_reducer)
18 |     bigram_counts = mr(lines)
19 |     sorted_bgc = sorted(bigram_counts, key=itemgetter(1), reverse=True)
20 |     for word, count in sorted_bgc[:100]:
21 |         print('{}\t{}'.format(word, count))
22 | 


--------------------------------------------------------------------------------
/MPs/MP1/common_friends.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | 
 3 | 
 4 | def friend_mapper(line):
 5 |     ''' write your code here! '''
 6 |     pass
 7 | 
 8 | 
 9 | def friend_reducer(friend_tuples):
10 |     ''' write your code here! '''
11 |     pass
12 | 
13 | 
14 | def _run_common_friend_finder(filename):
15 |     with open(filename) as f:
16 |         lines = f.readlines()
17 |     mr = MapReduce(friend_mapper, friend_reducer)
18 |     common_friends = mr(lines)
19 |     for relationship, friends in common_friends:
20 |         print('{}\t{}'.format(relationship, friends))
21 | 
22 | if __name__ == '__main__':
23 |     print('friend_graph_example.txt')
24 |     _run_common_friend_finder('friend_graph_example.txt')
25 | 
26 |     print('friend_graph.txt')
27 |     _run_common_friend_finder('friend_graph.txt')
28 | 


--------------------------------------------------------------------------------
/MPs/MP1/friend_graph.txt:
--------------------------------------------------------------------------------
1 | A|B,D,E,F
2 | B|A,C,E
3 | C|B,F
4 | D|A,E
5 | E|A,B,D
6 | F|A,C
7 | 


--------------------------------------------------------------------------------
/MPs/MP1/friend_graph_example.txt:
--------------------------------------------------------------------------------
1 | A|B
2 | B|A,C,D
3 | C|B,D
4 | D|B,C,E
5 | E|D
6 | 


--------------------------------------------------------------------------------
/MPs/MP1/map_reducer.py:
--------------------------------------------------------------------------------
 1 | import multiprocessing
 2 | import itertools
 3 | from operator import itemgetter
 4 | 
 5 | NUM_WORKERS = 10
 6 | 
 7 | 
 8 | class MapReduce(object):
 9 |     def __init__(self, map_func, reduce_func):
10 |         # Function for the map phase
11 |         self.map_func = map_func
12 | 
13 |         # Function for the reduce phase
14 |         self.reduce_func = reduce_func
15 | 
16 |         # Pool of processes to parallelize computation
17 |         self.proccess_pool = multiprocessing.Pool(NUM_WORKERS)
18 | 
19 |     def kv_sort(self, mapped_values):
20 |         return sorted(list(mapped_values), key=itemgetter(0))
21 | 
22 |     def __call__(self, data_in):
23 |         # Run the map phase in our process pool
24 |         map_phase = self.proccess_pool.map(self.map_func, data_in)
25 | 
26 |         # Sort the resulting mapped data
27 |         sorted_map = self.kv_sort(itertools.chain(*map_phase))
28 | 
29 |         # Run our reduce function
30 |         reduce_phase = self.reduce_func(sorted_map)
31 | 
32 |         # Return the results
33 |         return reduce_phase
34 | 


--------------------------------------------------------------------------------
/MPs/MP1/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import string
3 | 
4 | 
5 | def strip_punctuation(str_in):
6 |     # Strip punctuation from word (don't worry too much about this)
7 |     return re.sub('[%s]' % re.escape(string.punctuation), '', str_in)
8 | 


--------------------------------------------------------------------------------
/MPs/MP1/word_count.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | from operator import itemgetter
 3 | from utils import strip_punctuation
 4 | 
 5 | 
 6 | def string_to_words(str_in):
 7 |     ''' str_in a line of text '''
 8 |     words = []
 9 |     # Split string into words
10 |     for word in str_in.strip().split():
11 |         # Strip punctuation
12 |         word = strip_punctuation(word)
13 | 
14 |         # Note each individual instance of a word
15 |         words.append((word, 1))
16 |     return words
17 | 
18 | 
19 | def word_count_reducer(word_tuples):
20 |     # Dict to count the instances of each word
21 |     words = {}
22 | 
23 |     for entry in word_tuples:
24 |         word, count = entry
25 | 
26 |         # Add 1 to our word counts for each word we see
27 |         if word in words:
28 |             words[word] += 1
29 |         else:
30 |             words[word] = 1
31 | 
32 |     return words.items()
33 | 
34 | if __name__ == '__main__':
35 |     with open('sherlock.txt') as f:
36 |         lines = f.readlines()
37 | 
38 |     # Construct our MapReducer
39 |     mr = MapReduce(string_to_words, word_count_reducer)
40 |     # Call MapReduce on our input set
41 |     word_counts = mr(lines)
42 |     sorted_wc = sorted(word_counts, key=itemgetter(1), reverse=True)
43 |     for word, count in sorted_wc[:100]:
44 |         print('{}\t{}'.format(word, count))
45 | 


--------------------------------------------------------------------------------
/MPs/MP2/README.md:
--------------------------------------------------------------------------------
  1 | # MP 2: Introduction to Map/Reduce on Hadoop
  2 | 
  3 | ## Introduction
  4 | 
  5 | In this MP, we introduce the map/reduce programming
  6 | paradigm. Simply put, this approach to computing breaks tasks down into
  7 | a map phase (where an algorithm is mapped onto data) and a reduce phase,
  8 | where the outputs of the map phase are aggregated into a concise output.
  9 | The map phase is designed to be parallel, and to move the computation to
 10 | the data, which, when using HDFS, can be widely distributed. In this
 11 | case, a map phase can be executed against a large quantity of data very
 12 | quickly. The map phase identifies keys and associates with them a value.
 13 | The reduce phase collects keys and aggregates their values. The standard
 14 | example used to demonstrate this programming approach is a word count
 15 | problem, where words (or tokens) are the keys) and the number of
 16 | occurrences of each word (or token) is the value.
 17 | 
 18 | As this technique was popularized by large web search companies like
 19 | Google and Yahoo who were processing large quantities of unstructured
 20 | text data, this approach quickly became popular for a wide range of
 21 | problems.  Of course, not every problem can be transformed into a
 22 | map-reduce approach, which is why we will explore Spark in several
 23 | weeks. The standard MapReduce approach uses Hadoop, which was built
 24 | using Java. Rather than switching to a new language, however, we will
 25 | use Hadoop Streaming to execute Python code. In the rest of this
 26 | MP, we introduce a simple Python WordCount example code. We first
 27 | demonstrate this code running at the Unix command line, before switching to running the code by using Hadoop Streaming.
 28 | 
 29 | ### Mapper: Word Count
 30 | 
 31 | The first Python code we will write is the map Python program. This
 32 | program simply reads data from `STDIN`, tokenizes each line into words and
 33 | outputs each word on a separate line along with a count of one. Thus our
 34 | map program generates a list of word tokens as the keys and the value is
 35 | always one.
 36 | 
 37 | ```python
 38 | #!/usr/bin/python
 39 | 
 40 | # These examples are based off the blog post by Michale Noll:
 41 | # 
 42 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
 43 | #
 44 | 
 45 | import sys
 46 | 
 47 | # We explicitly define the word/count separator token.
 48 | sep = '\t'
 49 | 
 50 | # We open STDIN and STDOUT
 51 | with sys.stdin as fin:
 52 |     with sys.stdout as fout:
 53 |     
 54 |         # For every line in STDIN
 55 |         for line in fin:
 56 |         
 57 |             # Strip off leading and trailing whitespace
 58 |             line = line.strip()
 59 |             
 60 |             # We split the line into word tokens. Use whitespace to split.
 61 |             # Note we don't deal with punctuation.
 62 |             
 63 |             words = line.split()
 64 |             
 65 |             # Now loop through all words in the line and output
 66 | 
 67 |             for word in words:
 68 |                 fout.write("{0}{1}1\n".format(word, sep))
 69 | ```
 70 | 
 71 | ### Reducer: Word Count
 72 | 
 73 | The second Python program we write is our reduce program. In this code,
 74 | we read key-value pairs from `STDIN` and use the fact that the Hadoop
 75 | process first sorts all key-value pairs before sending the map output to
 76 | the reduce process to accumulate the cumulative count of each word. The
 77 | following code could easily be made more sophisticated by using `yield`
 78 | statements and iterators, but for clarity we use the simple approach of
 79 | tracking when the current word becomes different than the previous word
 80 | to output the key-cumulative count pairs.
 81 | 
 82 | ```python
 83 | #!/usr/bin/python
 84 | 
 85 | import sys
 86 | 
 87 | # We explicitly define the word/count separator token.
 88 | sep = '\t'
 89 | 
 90 | # We open STDIN and STDOUT
 91 | with sys.stdin as fin:
 92 |     with sys.stdout as fout:
 93 |     
 94 |         # Keep track of current word and count
 95 |         cword = None
 96 |         ccount = 0
 97 |         word = None
 98 |    
 99 |         # For every line in STDIN
100 |         for line in fin:
101 |         
102 |             # Strip off leading and trailing whitespace
103 |             # Note by construction, we should have no leading white space
104 |             line = line.strip()
105 |             
106 |             # We split the line into a word and count, based on predefined
107 |             # separator token.
108 |             #
109 |             # Note we haven't dealt with punctuation.
110 |             
111 |             word, scount = line.split('\t', 1)
112 |             
113 |             # We will assume count is always an integer value
114 |             
115 |             count = int(scount)
116 |             
117 |             # word is either repeated or new
118 |             
119 |             if cword == word:
120 |                 ccount += count
121 |             else:
122 |                 # We have to handle first word explicitly
123 |                 if cword != None:
124 |                     fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount))
125 |                 
126 |                 # New word, so reset variables
127 |                 cword = word
128 |                 ccount = count
129 |         else:
130 |             # Output final word count
131 |             if cword == word:
132 |                 fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount))
133 | ```
134 | 
135 | ### Testing Python Map-Reduce
136 | 
137 | Before we begin using Hadoop, we should first test our Python codes out
138 | to ensure they work as expected. First, we should change the permissions
139 | of the two programs to be executable, which we can do with the Unix
140 | `chmod` command.
141 | 
142 | ```sh
143 | chmod u+x /path/to/mp2/word_count_mapper.py
144 | chmod u+x /path/to/mp2/word_count_reducer.py
145 | ```
146 | 
147 | #### Testing Mapper.py
148 | 
149 | To test out the map Python code, we can run the Python `word_count_mapper.py` code
150 | and specify that the code should redirect STDIN to read the book text
151 | data. This is done in the following code cell, we pipe the output into
152 | the Unix `head` command in order to restrict the output, which would be
153 | one line per word found in the book text file. In the second code cell,
154 | we next pipe the output of  `word_count_mapper.py` into the Unix `sort` command,
155 | which is done automatically by Hadoop. To see the result of this
156 | operation, we next pipe the result into the Unix `uniq` command to count
157 | duplicates, pipe this result into a new sort routine to sort the output
158 | by the number of occurrences of a word, and finally display the last few
159 | lines with the Unix `tail` command to verify the program is operating
160 | correctly.
161 | 
162 | With these sequence of Unix commands, we have (in a single-node)
163 | replicated the steps performed by Hadoop MapReduce: Map, Sort, and
164 | Reduce.
165 | 
166 | 
167 | 
168 | 
169 | ```sh
170 | cd /path/to/mp2
171 | 
172 | ./word_count_mapper.py <  book.txt | wc -l
173 | ```
174 | 
175 | ```sh
176 | cd /path/to/mp2
177 | 
178 | ./word_count_mapper.py <  book.txt | sort -n -k 1 | \
179 |  uniq -c | sort -n -k 1 | tail -10
180 | ```
181 | 
182 | #### Testing Reducer.py
183 | 
184 | To test out the reduce Python code, we run the previous code cell, but
185 | rather than piping the result into the Unix `tail` command, we pipe the
186 | result of the sort command into the Python `word_count_reducer.py` code. This
187 | simulates the Hadoop model, where the map output is key sorted before
188 | being passed into the reduce process. First, we will simply count the
189 | number of lines displayed by the reduce process, which will indicate the
190 | number of  unique _word tokens_ in the book. Next, we will sort the
191 | output by the number of times each word token appears and display the
192 | last few lines to compare with the previous results.
193 | 
194 | 
195 | ```sh
196 | cd /path/to/mp2
197 | 
198 | ./word_count_mapper.py <  book.txt | sort -n -k 1 | \
199 | ./word_count_reducer.py | wc -l
200 | ```
201 | 
202 | ```sh
203 | cd /path/to/mp2
204 | 
205 | ./word_count_mapper.py <  book.txt | sort -n -k 1 | \
206 | ./word_count_reducer.py | sort -n -k 2 | tail -10
207 | ```
208 | 
209 | ## Python Hadoop Streaming
210 | 
211 | ### Introduction
212 | 
213 | We are now ready to actually run our Python codes via Hadoop Streaming.
214 | The main command to perform this task is `hadoop`.
215 | 
216 | Running this Hadoop command by supplying the `-help` flag will provide
217 | a useful summary of the different options. Note that `jar` is short for
218 | Java Archive, which is a compressed archive of compiled Java code that
219 | can be executed to perform different operations. In this case, we will
220 | run the Java Hadoop streaming jar file to enable our Python code to work
221 | within Hadoop.
222 | 
223 | 
224 | ```sh
225 | # Run the Map Reduce task within Hadoop
226 | hadoop --help
227 | ```
228 | 
229 |     Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
230 |       CLASSNAME            run the class named CLASSNAME
231 |      or
232 |       where COMMAND is one of:
233 |       fs                   run a generic filesystem user client
234 |       version              print the version
235 |       jar <jar>            run a jar file
236 |                            note: please use "yarn jar" to launch
237 |                                  YARN applications, not this command.
238 |       checknative [-a|-h]  check native hadoop and compression libraries availability
239 |       distcp <srcurl> <desturl> copy file or directories recursively
240 |       archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
241 |       classpath            prints the class path needed to get the
242 |       credential           interact with credential providers
243 |                            Hadoop jar and the required libraries
244 |       daemonlog            get/set the log level for each daemon
245 |       trace                view and modify Hadoop tracing settings
246 |     
247 |     Most commands print help when invoked w/o parameters.
248 | 
249 | 
250 | For our map/reduce Python example to
251 | run successfully, we will need to specify five flags:
252 | 
253 | 1. `-files`: a comma separated list of files to be copied to the Hadoop cluster.
254 | 2. `-input`: the HDFS input file(s) to be used for the map task.
255 | 3. `-output`: the HDFS output directory, used for the reduce task.
256 | 4. `-mapper`: the command to run for the map task.
257 | 5. `-reducer`: the command to run for the reduce task.
258 | 
259 | Given our previous setup, we will eventually run the full command as follows:
260 | 
261 | ```
262 | 	# DON'T RUN ME YET!
263 |     hadoop $STREAMING -files word_count_mapper.py,word_count_reducer.py -input wc/in \
264 |         -output wc/out -mapper word_count_mapper.py -reducer word_count_reducer.py 
265 | ```
266 | When this command is run, a series of messages will be displayed to the
267 | screen (via STDERR) showing the progress of our Hadoop Streaming task.
268 | At the end of the stream of information messages will be a statement
269 | indicating the location of the output directory as shown below. Note, we
270 | can append Bash redirection to ignore the Hadoop messages, simply by
271 | appending `2> /dev/null` to the end of any Hadoop command, which sends
272 | all STDERR messages to a non-existent Unix device, which is akin to
273 | nothing. 
274 | 
275 | For example, to ignore any messages from the `hdfs dfs -rm -r -f wc/out`
276 | command, we would use the following syntax:
277 | 
278 | ```bash
279 | hdfs dfs -rm -r -f wc/out 2> /dev/null
280 | ```
281 | 
282 | Doing this, however, does hide all messages, which can make debugging
283 | problems more difficult. As a result, you should only do this when your
284 | commands work correctly.
285 | 
286 | ### Putting files in HDFS
287 | In order for Hadoop to be able to access our raw data (the book text) we first have to copy it into the file system that Hadoop uses natively, HDFS.
288 | 
289 | To do this, we'll run a series of HDFS commands that will copy our local `book.txt` into the distributed file system.
290 | 
291 | ```
292 | # Make a directory for our book data
293 | hdfs dfs -mkdir -p wc/in
294 | 
295 | # Copy our book to our new folder
296 | hdfs dfs -copyFromLocal book.txt wc/in/book.txt
297 | 
298 | # Check to see that our book has made it to the folder
299 | hdfs dfs -ls wc/in
300 | hdfs dfs -tail wc/in/book.txt
301 | ```
302 | 
303 | ### Running the Hadoop Job
304 | Now that our data is in Hadoop HDFS, we can actually execute the streaming job that will run our word count map/reduce.
305 | 
306 | ```sh
307 | # Delete output directory (if it exists)
308 | hdfs dfs -rm -r -f wc/out
309 | 
310 | # Run the Map Reduce task within Hadoop
311 | hadoop jar $STREAMING \
312 |     -files word_count_mapper.py,word_count_reducer.py -input wc/in \
313 |     -output wc/out -mapper word_count_mapper.py -reducer word_count_reducer.py
314 | ```
315 | 
316 | ### Hadoop Results
317 | 
318 | In order to view the results of our Hadoop Streaming task, we must use
319 | HDFS DFS commands to examine the directory and files generated by our
320 | Python Map/Reduce programs. The following list of DFS commands might
321 | prove useful to view the results of this map/reduce job.
322 | 
323 | ```bash
324 | # List the wc directory
325 | hdfs dfs -ls wc
326 | 
327 | # List the output directory
328 | hdfs dfs -ls wc/out
329 | 
330 | # Do a line count on our output
331 | hdfs dfs -count -h wc/out/part-00000
332 | 
333 | # Tail the output
334 | hdfs dfs -tail wc/out/part-00000
335 | ```
336 | 
337 | Note that these
338 | Hadoop HDFS commands can be intermixed with Unix commands to perform
339 | additional text processing. The important point is that direct file I/O
340 | operations must use HDFS commands to work with the HDFS file system.
341 | 
342 | The output should match the Python
343 | only map-reduce approach.
344 | 
345 | ### Hadoop Cleanup
346 | 
347 | Following the successful run of our map/reduce Python programs, we have
348 | created a new directory `wc/out` in the HDFS, which contains two files. If we wish
349 | to rerun this Hadoop Streaming map/reduce task, we must either specify a
350 | different output directory, or else we must clean up the results of the
351 | previous run. To remove the output directory, we can simply use the HDFS
352 | `-rm -r -f wc/out` command, which will immediately delete the `wc/out`
353 | directory. The successful completion of this command is indicated by
354 | Hadoop, and this can also be verified by listing the contents of the
355 | `wc` directory.
356 | 
357 | ```sh
358 | hdfs dfs -ls wc
359 | ```
360 | 
361 | 
362 | ## MP Assignments
363 | 
364 | In the preceding activity, we introduced Hadoop map/reduce by using a
365 | simple word count task. Now that you've seen how Hadoop mappers/reducers
366 | work, write streaming map/reduce programs that accomplish the following tasks.
367 | 
368 | 
369 | **Note:** These programs should be able to run on large datasets in a decentralized fashion.
370 | You must keep this in consideration for your output to be correct.
371 | 
372 | ### Starting Your Streaming Jobs on the Cluster
373 | **IMPORTANT:** We've made it easier for you to run Hadoop Streaming jobs on the cluster.
374 | 
375 | Run the following command to start your job:
376 | 
377 | ```
378 | mapreduce_streaming MAPPER REDUCER INPUT OUTPUT
379 | ```
380 | 
381 | `MAPPER` and `REDUCER` refer to your python files, `INPUT` is the HDFS path of the input data file(s), and `OUTPUT` is the directory to place the results of your job in.
382 | 
383 | ### Assignments
384 | 
385 | #### Assignment 1: Finding Mutual Followers
386 | In `/shared/twitter_followers` on HDFS you will find a list of Twitter follower relationships.
387 | 
388 | A row in this dataset, `a b`, implies that user with id `a` follows the user with id `b`.
389 | Example data:
390 | 
391 | ```
392 | 1 2
393 | 1 3
394 | 1 4
395 | 2 1
396 | 2 3
397 | 3 6
398 | 4 1
399 | ```
400 | 
401 | Your task is to find all *mutual follower relationships*. Write a Mapper / Reducer that outputs
402 | all pairs `a b` such that `a` follows `b` and `b` follows `a`.
403 | 
404 | Your output should be formatted the same as the input data. Note that for each pair of mutual followers
405 | `a b`, you should only output 1 record. Output each pair in sorted order (`a < b`).
406 | 
407 | #### Assignment 2: Most Linked-to Wikipedia Articles
408 | In `/shared/wikipedia_paths_parsed` on HDFS you will find a list of user-generated Wikipedia sessions.
409 | Each line represents a user session, and shows the links that they clicked through to visit various articles.
410 | 
411 | For example:
412 | 
413 | ```
414 | Computer_programming;Linguistics;Culture;Popular_culture
415 | ```
416 | 
417 | Interprit this as: The user started on the `Computer_programming` article, then clicked a link to view
418 | `Linguistics`, then clicked a link to view `Culture`, and so on.
419 | 
420 | Your task is to find the pages on Wikipedia that have the most links to them. For the purposes of this
421 | problem, we assume that if a user clicks from page `A` to page `B`, then there exists a link on Wikipedia
422 | from page `A -> B`. For every page `B`, we want to find the number of **unique** pages `A` that link to `B`.
423 | 
424 | Do not count initial pages as having links to them.
425 | 
426 | Output your results in the following format: (i.e. space-separated)
427 | `ARTICLE_NAME LINK_COUNT`
428 | 
429 | ### Accessing the Hadoop Job UI
430 | 
431 | The Haddop Web UIs provide information about current and past Hadoop jobs. They're also useful in debugging failed jobs.
432 | 
433 | Because of the way our networking is setup, you will need to use SSH or PuTTY to access these web interfaces.
434 | 
435 | #### Mac / Linux Users
436 | 
437 | Use the additional `-L` flag when you SSH to the cluster to tunnel ports to the destinations listed below.
438 | 
439 | For example:
440 | 
441 | `ssh <netid>@<cluster_ip> -i <ssh_key_file> -L 8000:192-168-100-17.local:19888 -L 8001:192-168-100-17.local:8088`
442 | 
443 | While this command is running, you can visit `localhost:8000` and `localhost:8001` in your browser to view the Hadoop web UIs.
444 | 
445 | #### Windows Users
446 | 
447 | Follow [this](http://realprogrammers.com/how_to/set_up_an_ssh_tunnel_with_putty.html) tutorial, and forward the destinations listed below.
448 | 
449 | While PuTTY is active, you can visit `localhost:<PORT>` in your browser, where `<PORT>` is the Source Port you registered in PuTTY.
450 | 
451 | #### Destinations
452 | 
453 | (These are IP addresses accessible only within the cluster's network, which is why you need to use SSH or PuTTY to open a "tunnel" to view them)
454 | 
455 | - `192-168-100-17.local:19888` - Hadoop JobHistory UI
456 | - `192-168-100-17.local:8088` - Cluset / Scheduler Metrics UI
457 | 
458 | ### Suggested Workflow
459 | We also include the data files for this MP in our normal non-HDFS file system for your use. You can find them in `/mnt/datasetvolume`
460 | 
461 | 1. Write your map/reduce and test it with regular unix commands:
462 | 
463 |   ```
464 |   cat /mnt/datasetvolume/<DATASET> | ./<MAPPER>.py | sort | ./<REDUCER>.py
465 |   ```
466 | 
467 | 2. Test your map/reduce program on Hadoop:
468 | 
469 |   ```
470 |   hdfs dfs -mkdir -p twitter
471 |   hdfs dfs -rm -r twitter/out
472 |   mapreduce <MAPPER>.py <REDUCER>.py /shared/twitter_followers twitter/out
473 |   ```
474 | 
475 | ### Attribution
476 | 
477 | Data for this MP is from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/index.html)
478 | 


--------------------------------------------------------------------------------
/MPs/MP2/twitter_follower_mapper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 
3 | import sys
4 | 
5 | with sys.stdin as fin:
6 |     with sys.stdout as fout:
7 |         ''' write your code here '''
8 |         pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP2/twitter_follower_reducer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 
3 | import sys
4 | 
5 | with sys.stdin as fin:
6 |     with sys.stdout as fout:
7 |         ''' write your code here '''
8 |         pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP2/wikipedia_links_mapper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 
3 | import sys
4 | 
5 | with sys.stdin as fin:
6 |     with sys.stdout as fout:
7 |         ''' write your code here '''
8 |         pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP2/wikipedia_links_reducer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 
3 | import sys
4 | 
5 | with sys.stdin as fin:
6 |     with sys.stdout as fout:
7 |         ''' write your code here '''
8 |         pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP2/word_count_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # These examples are based off the blog post by Michale Noll:
 4 | #
 5 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
 6 | #
 7 | 
 8 | import sys
 9 | 
10 | # We explicitly define the word/count separator token.
11 | sep = '\t'
12 | 
13 | # We open STDIN and STDOUT
14 | with sys.stdin as fin:
15 |     with sys.stdout as fout:
16 | 
17 |         # For every line in STDIN
18 |         for line in fin:
19 | 
20 |             # Strip off leading and trailing whitespace
21 |             line = line.strip()
22 | 
23 |             # We split the line into word tokens. Use whitespace to split.
24 |             # Note we don't deal with punctuation.
25 | 
26 |             words = line.split()
27 | 
28 |             # Now loop through all words in the line and output
29 | 
30 |             for word in words:
31 |                 fout.write("{0}{1}1\n".format(word, sep))
32 | 


--------------------------------------------------------------------------------
/MPs/MP2/word_count_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We explicitly define the word/count separator token.
 6 | sep = '\t'
 7 | 
 8 | # We open STDIN and STDOUT
 9 | with sys.stdin as fin:
10 |     with sys.stdout as fout:
11 | 
12 |         # Keep track of current word and count
13 |         cword = None
14 |         ccount = 0
15 |         word = None
16 | 
17 |         # For every line in STDIN
18 |         for line in fin:
19 | 
20 |             # Strip off leading and trailing whitespace
21 |             # Note by construction, we should have no leading white space
22 |             line = line.strip()
23 | 
24 |             # We split the line into a word and count, based on predefined
25 |             # separator token.
26 |             #
27 |             # Note we haven't dealt with punctuation.
28 | 
29 |             word, scount = line.split('\t', 1)
30 | 
31 |             # We will assume count is always an integer value
32 | 
33 |             count = int(scount)
34 | 
35 |             # word is either repeated or new
36 | 
37 |             if cword == word:
38 |                 ccount += count
39 |             else:
40 |                 # We have to handle first word explicitly
41 |                 if cword is not None:
42 |                     fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount))
43 | 
44 |                 # New word, so reset variables
45 |                 cword = word
46 |                 ccount = count
47 |         else:
48 |             # Output final word count
49 |             if cword == word:
50 |                 fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount))
51 | 


--------------------------------------------------------------------------------
/MPs/MP3/README.md:
--------------------------------------------------------------------------------
  1 | # MP 3: Hadoop Map/Reduce on "Real" Data
  2 | 
  3 | ## Introduction
  4 | In the past 2 labs, you were introduced to the concept of Map/Reduce and how we can execute Map/Reduce Python scripts on Hadoop with Hadoop Streaming.
  5 | 
  6 | This week, we'll give you access to a modestly large (~60GB) Twitter dataset. You'll be using the skills you learned in the last two weeks to perform some more complex transformations on this dataset.
  7 | 
  8 | Like in the last MP, we'll be using Python to write mappers and reducers. We've included a helpful shell script to make running mapreduce jobs easier. The script is in your `bin` folder, so you can just run it as `mapreduce`, but we've included the source in this MP so you can see what it's doing.
  9 | 
 10 | Here's the usage for that command:
 11 | 
 12 | ```
 13 | Usage: ./mapreduce map-script reduce-scripe hdfs-input-path hdfs-output-path
 14 | 
 15 | Example: ./mapreduce mapper.py reducer.py /tmp/helloworld.txt /user/quinnjarr
 16 | ```
 17 | 
 18 | ## The Dataset
 19 | 
 20 | The dataset is located in `/shared/snapTwitterData` in HDFS. You'll find these files: 
 21 | 
 22 | ```
 23 | /shared/snapTwitterData/tweets2009-06.tsv
 24 | /shared/snapTwitterData/tweets2009-07.tsv
 25 | /shared/snapTwitterData/tweets2009-08.tsv
 26 | /shared/snapTwitterData/tweets2009-09.tsv
 27 | /shared/snapTwitterData/tweets2009-10.tsv
 28 | /shared/snapTwitterData/tweets2009-11.tsv
 29 | /shared/snapTwitterData/tweets2009-12.tsv
 30 | ```
 31 | 
 32 | Each file is a `tsv` (tab-separated value) file. The schema of the file is as follows:
 33 | 
 34 | ```
 35 | POST_DATETIME <tab> TWITTER_USER_URL <tab> TWEET_TEXT
 36 | ```
 37 | 
 38 | Example:
 39 | 
 40 | ```
 41 | 2009-10-31 23:59:58	http://twitter.com/sometwitteruser	Wow, CS199 is really a good course
 42 | ```
 43 | 	
 44 | ## MP Activities
 45 | **MP 3 is due on Wednesday, September 27th, 2017 at 11:55PM.**
 46 | 
 47 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
 48 | 
 49 | 1. Write a map/reduce program to determine the the number of @-replies each user received. You may assume a tweet is an @-reply iff the tweet starts with an `@` character. You may also assume that each @-reply is only a reply to a single user. (i.e. `@foo @bar hello world` is an @-reply to `@foo`, but not `@bar`) Your output should be in this format (space-separated):
 50 | 
 51 | 	```
 52 | 	<USER_HANDLE  (Including @ symbol)> <NUMBER_REPLIES_RECEIVED>
 53 | 
 54 | 	Example:
 55 | 	@jack 123
 56 | 	```
 57 | 
 58 | 2. Write a map/reduce program to determine the user with the most Tweets for every given day in the dataset. (If there's a tie, break the tie by sorting alphabetically on users' handles) Your output should be in this format (space-separated):
 59 | 
 60 | 	```
 61 | 	<YYYY-MM-DD> <USER_HANDLE (Including @ symbol)>
 62 | 
 63 | 	Example:
 64 | 	2016-01-01 @jack
 65 | 	```
 66 | 
 67 | 3. Write a map/reduce program to determine the size of each users' vocabulary -- that is, determine the number of unique words used by each user accross all their Tweets. For the purposes of this problem, you should use the `.split()` method to split each Tweet into words. Your output should be in this format (space-separated):
 68 | 
 69 | 	```
 70 | 	<USER_HANDLE> <NUMBER_OF_WORDS>
 71 | 
 72 | 	Example:
 73 | 	@jack 123
 74 | 	```
 75 | 
 76 | 
 77 | ### Don't lose your progress!
 78 | 
 79 | Hadoop jobs can take a very long time to complete. If you don't take precautions, you'll lose all your progress if something happens to your SSH session.
 80 | 
 81 | To mitigate this, we have installed `tmux` on the cluster. Tmux is a tool that lets us persist shell sessions even when we lose SSH connection.
 82 | 
 83 | 1. Run `tmux` to enter into a tmux session.
 84 | 2. Run some command that will take a long time (`ping google.com`)
 85 | 3. Exit out of your SSH session.
 86 | 4. Log back into the server and run `tmux attach` and you should find your session undisturbed.
 87 | 
 88 | ### Suggested Workflow
 89 | 
 90 | 1. Write your map/reduce and test it with regular unix commands:
 91 | 
 92 | 	```
 93 | 	head -n 10000 /mnt/volume/snapTwitterData/tweets2009-06.tsv | ./<MAPPER>.py | sort | ./<REDUCER>.py
 94 | 	```
 95 | 
 96 | 2. Test your map/reduce with a single Tweet file on Hadoop:
 97 | 
 98 | 	```
 99 | 	hdfs dfs -mkdir -p twitter
100 | 	hdfs dfs -rm -r twitter/out
101 | 	mapreduce_streaming <MAPPER>.py <REDUCER>.py /shared/snapTwitterData/tweets2009-06.tsv twitter/out
102 | 	```
103 | 	
104 | 3. Run your map/reduce on the full dataset:
105 | 	```
106 | 	hdfs dfs -mkdir -p twitter
107 | 	hdfs dfs -rm -r twitter/out
108 | 	mapreduce_streaming <MAPPER>.py <REDUCER>.py /shared/snapTwitterData/*.tsv twitter/out
109 | 	```
110 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob1_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob1_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob2_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob2_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob3_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP3/prob3_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/MPs/MP4/README.md:
--------------------------------------------------------------------------------
  1 | # MP 4: Spark
  2 | 
  3 | ## Introduction
  4 | 
  5 | As we have talked about in lecture, Spark is built on the concept of a *Resilient Distributed Dataset* (RDD). 
  6 | 
  7 | PySpark allows us to interface with these RDD’s in Python. Think of it as an API. In fact, it is an API: it even has its own [documentation](http://spark.apache.org/docs/latest/api/python/)! It’s built on top of the Spark’s Java API and exposes the Spark programming model to Python.
  8 | 
  9 | PySpark makes use of a library called `Py4J`, which enables Python programs to dynamically access Java objects in a Java Virtual Machine.
 10 | 
 11 | This allows data to be processed in Python and cached in the JVM. This model lets us combine the performance of the JVM with the expressiveness of Python.
 12 | 
 13 | 
 14 | ## Running your Jobs
 15 | 
 16 | We'll be using `spark-submit` to run our spark jobs on the cluster. `spark-submit` has a couple command line options that you can tweak.
 17 | 
 18 | #### `--master`
 19 | This option tells `spark-submit` where to run your job, as Spark can run in several modes.
 20 | 
 21 | * `local`
 22 |     * The Spark job runs locally, *without* using any compute resources from the cluster.
 23 | * `yarn-client`
 24 |     * The spark job runs on our YARN cluster, but the driver is local to the machine, so it 'appears' that you're running the job locally, but you still get the compute resources from the cluster. You'll see the logs spark provides as the program executes.
 25 |     * When the cluster is busy, you *will not* be able to use this mode, because it imposes too much of a memory footprint on the node that everyone is SSH'ing into.
 26 | * `yarn-cluster`
 27 |     * The spark job runs on our YARN cluster, and the spark driver is in some arbitrary location on the cluster. This option doesn't give you logs directly, so you'll have to get the logs manually.
 28 |     * In the output of `spark-submit --master yarn-cluster` you'll find an `applicationId`. (This is similar to when you ran jobs on Hadoop). You can issue this command to get the logs for your job:
 29 | 
 30 |         ```
 31 |         yarn logs -applicationId <YOUR_APPLICATION_ID> | less
 32 |         ```
 33 |     * When debugging Python applications, it's useful to `grep` for `Traceback` in your logs, as this will likely be the actual debug information you're looking for.
 34 | 
 35 |         ```
 36 |         yarn logs -applicationId <YOUR_APPLICATION_ID> | grep -A 50 Traceback
 37 |         ```
 38 |         
 39 |     * *NOTE*: In cluster mode, normal "local" IO operations like opening files will behave unexpectedly! This is because you're not guaranteed which node the driver will run on. You must use the PySpark API for saving files to get reliable results. You also have to coalesce your RDD into one partition before asking PySpark to write to a file (why do you think this is?). Additionally, you should save your results to HDFS.
 40 | 
 41 |         ```python
 42 |         <my_rdd>.coalesce(1).saveAsTextFile('hdfs:///user/MY_USERNAME/foo')
 43 |         ```
 44 | 
 45 | #### `--num-executors`
 46 | This option lets you set the number of executors that your job will have. A good rule of thumb is to have as many executors as the maximum number of partitions an RDD will have during a Spark job (this heuristic holds better for simple jobs, but falls apart as the complexity of your job increases).
 47 | 
 48 | The number of executors is a trade off. Too few, and you might not be taking full advantage of Spark's parallelism. However, there is also an upper bound on the number of executors (for obvious reasons), as they have a fairly large memory footprint. (Don't set this too high or we'll terminate your job.)
 49 | 
 50 | You can tweak executors more granularly by setting the amount of memory and number of cores they're allocated, but for our purposes the default values are sufficient.
 51 | 
 52 | ### Putting it all together
 53 | 
 54 | Submitting a spark job will usually look something like this:
 55 | 
 56 | ```
 57 | spark-submit --master yarn-cluster --num-executors 10 <my_spark_job>.py <job_arguments>
 58 | ```
 59 | 
 60 | Be sure to include the `--master` flag, or else your code will only run locally, and you won't get the benefits of the cluster's parallelism.
 61 | 
 62 | ### Interactive Shell
 63 | 
 64 | While `spark-submit` is the way we'll be endorsing to run PySpark jobs, there is an option to run jobs in an interactive shell. Use the `pyspark` command to load into the PySpark interactive shell. You can use many of the same options listed above to tweak `pyspark` settings, such as `--num-executors` and `--master`.
 65 | 
 66 | Note: If you start up the normal `python` interpreter on the cluster, you won't be able to use any of the PySpark features.
 67 | 
 68 | ### Helpful Hints
 69 | 
 70 | * You'll find the [PySpark documentation](https://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.RDD) (especially the section on RDDs) **very** useful.
 71 | * Run your Spark jobs on a subset of the data when you're debugging. Even though Spark is very fast, jobs can still take a long time - especially when you're working with the review dataset. When you are experimenting, always use a subset of the data. The best way to use a subset of data is through the [take](https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html#take(int)) command.
 72 | 
 73 | Specifically the most common pattern to sample data looks like
 74 | `rdd = sc.parallelize(rdd.take(100))`
 75 | This converts an rdd into a list of 100 items and then back into an RDD through the `.parallelize` function.
 76 | 
 77 | * [Programming Guide](http://spark.apache.org/docs/latest/programming-guide.html) -- This documentation by itself could be used to solve the entire MP. It is a great quick-start guide about Spark.
 78 | 
 79 | ### Spark Web Interface
 80 | 
 81 | Similar to what we saw in the past weeks with the Hadoop Job web interface, Spark has a really useful web interface to view how jobs are executed. Accessing this web interface is similar to accessing the Hadoop interface. The destination for this web UI is `192-168-100-15.local:18081`.
 82 | 
 83 | UNIX users can access this UI by appending `-L 8002:192-168-100-15.local:18081` to their ssh command, and then accessing `localhost:8002` in their local web browser. Windows users can use PuTTY in a similar fashion.
 84 | 
 85 | 
 86 | ## The Dataset
 87 | 
 88 | This week, we'll be working off of a set of released Yelp data.
 89 | 
 90 | The dataset is located in `/shared/yelp` in HDFS. We'll be using the following files for this MP:
 91 | 
 92 | ```
 93 | /shared/yelp/yelp_academic_dataset_business.json
 94 | /shared/yelp/yelp_academic_dataset_checkin.json
 95 | /shared/yelp/yelp_academic_dataset_review.json
 96 | /shared/yelp/yelp_academic_dataset_user.json
 97 | ```
 98 | 
 99 | We'll give more details about the data in these files as we continue with the MP, but the general schema is this: each line in each of these JSON files is an independent JSON object that represents a *distinct entity*, whether it be a business, a review, or a user.
100 | 
101 | *Hint:* JSON is parsed with `json.loads`
102 | 
103 | ## MP Activities
104 | **MP 4 is due on Wednesday, October 11th, 2017 at 11:55PM.**
105 | 
106 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
107 | 
108 | **IMPORTANT:** To grade your code, we may use different datasets than the ones we have provided you. In order to receive credit for this assignment, you **must** accept an argument in your job. This single argument will be the location of the dataset. Note that the starting code for this MP already has this implemented.
109 | 
110 | ### 1. Least Expensive Cities
111 | 
112 | This problem uses the `yelp_academic_dataset_business.json` dataset.
113 | 
114 | In planning your next road trip, you want to find the cities that will be the least expensive to dine at.
115 | 
116 | It turns out that Yelp keeps track of a handy metric for this, and many restaurants have the attribute `RestaurantsPriceRange2` that gives the business a score from 1-4 as far as 'priciness'.
117 | 
118 | Write a PySpark application that sorts cities by the average 'priciness' of their businesses/restaurants.
119 | 
120 | Notes:
121 | 
122 | * Discard any business that does not have the `RestaurantsPriceRange2` attribute
123 | * Discard any business that does not have a valid city and state
124 | * Your output should be sorted descending by average price (highest at top, lowest at bottom). Your average restaurant price should be rounded to 2 decimal places. Each city should get a row in the output and look like:
125 | 
126 |     ```
127 |     CITY, STATE: PRICE
128 | 
129 |     Example:
130 | 
131 |     Champaign, IL: 1.23
132 |     ```
133 | 
134 | ### 2. Best Reviews
135 | 
136 | This problem uses the `yelp_academic_dataset_review.json` dataset.
137 | 
138 | In selecting a restaurant, you might want to find the review that Yelp users have decided is the best review. For each business in the `review` dataset, find the review that has the greatest engagement. That is, find the review that has the greatest `useful` + `funny` + `cool` interactions. These are fields of the review
139 | 
140 | Your output should should look like this:
141 | 
142 |     ```
143 |     BUSINESS_ID REVIEW_ID
144 | 
145 |     Example: 
146 |     KYasaF1nov1bn7phfSgWeg EmuqmSacByt96t8G5GK0KQ
147 |     ```
148 | 
149 | ### 3. Yelp Reviewer Accuracy
150 | 
151 | For this activity, we'll be looking at [Yelp reviews](https://www.youtube.com/watch?v=QEdXhH97Z7E). 😱 Namely, we want to find out which Yelp reviewers are... more harsh than they should be.
152 | 
153 | To do this we will calculate the average review score of each business in our dataset, and find how far away users' ratings are from the average of a business.
154 |  
155 | The `average_business_rating` of a business is the sum of the ratings of the business divided by the count of the ratings for that business. A user's review offset score is the sum of the differences between their rating and the average business rating. A positive score indicates that a user tends to give higher ratings than the average; a negative score indicates that the user tends to give lower ratings than the average.
156 | 
157 | Your output should look like this:
158 | ```
159 | user_id: average_review_offset
160 | ```
161 | 
162 | Your output should first list the users with the top 100 positive review offsets, then the users with the top 100 negative review offsets (by magnitude).
163 | 
164 | Notes:
165 | 
166 | * Business have "average rating" as a property. We **will not** be using this. Instead - to have greater precision - we will be manually calculating a business' average reviews by averaging all the review scores given in `yelp_academic_dataset_review.json`.
167 | * Discard any reviews that do not have a rating, a `user_id`, and a `business_id`.
168 | 
169 | ### 4. Descriptors of a Good Business
170 | 
171 | Suppose we want to predict a review score from its text. There are many ways we could do this, but a simple way would be to find words that are indicative of either a positive or negative review.
172 | 
173 | In this activity, we want to find the words that are the most positively 'charged'. We can think about the probability that a word shows up in a review as depending on the type of a review. For example, it is more likely that "delicious" would show up in a positive review than a negative one.
174 | 
175 | Calculate the probability of each word appearing to be the number of occurrences of the word in the category tested (positive/negative) divided by the number of reviews in that category.
176 | 
177 | For example, if we had a dataset of the following reviews:
178 | 
179 | ```
180 | 1 star: The food was delicious, but everything else was bad.
181 | 5 star: Everything was delicious!
182 | 4 star: I liked the food, it was delicious.
183 | 3 star: Meh, it was OK.
184 | 2 star: The food really was pretty gross
185 | ```
186 | 
187 | We see that `delicious` appears in `2` positive reviews, and `1` negative review. We have `3` total positive reviews, and `2` total negative reviews. `P(positive) = (2/3) = 0.666` and `P(negative) = (1/2) = 0.5` (not realistic, of course). Our `probability_diff` for 'delicious' would therefore be `P(positive) - P(negative) = 0.666 - 0.5 = 0.166`.
188 | 
189 | Output the **top 250** words that are most likely to be in positive reviews, but not in negative reviews (maximize `P(positive) - P(negative)`).
190 | 
191 | Notes:
192 | 
193 | * Consider a review to be positive if it has `>=3` stars, and consider a review negative if it has `<3` stars.
194 | * Your output should be as follows, where `probability_diff` is `P(positive) - P(negative)` rounded to **4** decimal places and sorted in descending order:
195 | 
196 |     ```
197 |     word: probability_diff
198 | 
199 |     Example:
200 |     delicious 0.1234
201 |     ```
202 | 


--------------------------------------------------------------------------------
/MPs/MP4/best_reviews.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | import sys
 3 | if len(sys.argv) != 3:
 4 |     print("Usage: best_reviews.py INPUT OUTPUT")
 5 |     sys.exit()
 6 | 
 7 | input_file = sys.argv[1]
 8 | output_file = sys.argv[2]
 9 | 
10 | conf = SparkConf().setAppName("best_reviews")
11 | sc = SparkContext(conf=conf)
12 | 
13 | reviews = sc.textFile(input_file)
14 | 
15 | # After you're done:
16 | # <RDD>.saveAsTextFile(output_file)
17 | 


--------------------------------------------------------------------------------
/MPs/MP4/descriptors_of_good_business.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | import sys
 3 | if len(sys.argv) != 3:
 4 |     print("Usage: descriptors_of_good_business.py INPUT OUTPUT")
 5 |     sys.exit()
 6 | 
 7 | input_file = sys.argv[1]
 8 | output_file = sys.argv[2]
 9 | 
10 | conf = SparkConf().setAppName("descriptors_of_good_business")
11 | sc = SparkContext(conf=conf)
12 | 
13 | reviews = sc.textFile(input_file)
14 | 
15 | # After you're done:
16 | # <RDD>.saveAsTextFile(output_file)
17 | 


--------------------------------------------------------------------------------
/MPs/MP4/least_expensive_cities.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | import sys
 3 | if len(sys.argv) != 3:
 4 |     print("Usage: least_expensive_cities.py INPUT OUTPUT")
 5 |     sys.exit()
 6 | 
 7 | input_file = sys.argv[1]
 8 | output_file = sys.argv[2]
 9 | 
10 | conf = SparkConf().setAppName("least_expensive_cities")
11 | sc = SparkContext(conf=conf)
12 | 
13 | reviews = sc.textFile(input_file)
14 | 
15 | # After you're done:
16 | # <RDD>.saveAsTextFile(output_file)
17 | 


--------------------------------------------------------------------------------
/MPs/MP4/yelp_reviewer_accuracy.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | import sys
 3 | if len(sys.argv) != 3:
 4 |     print("Usage: yelp_reviewer_accuracy.py INPUT OUTPUT")
 5 |     sys.exit()
 6 | 
 7 | input_file = sys.argv[1]
 8 | output_file = sys.argv[2]
 9 | 
10 | conf = SparkConf().setAppName("yelp_reviewer_accuracy")
11 | sc = SparkContext(conf=conf)
12 | 
13 | reviews = sc.textFile(input_file)
14 | 
15 | # After you're done:
16 | # <RDD>.saveAsTextFile(output_file)
17 | 


--------------------------------------------------------------------------------
/MPs/MP5/README.md:
--------------------------------------------------------------------------------
  1 | # MP 5: Spark MLlib
  2 | 
  3 | ## Introduction
  4 | 
  5 | This week we'll be diving into another aspect of PySpark: MLlib. Spark MLlib provides an API to run machine learning algorithms on RDDs so that we can do ML on the cluster with the benefit of parallelism / distributed computing.
  6 | 
  7 | ## Machine Learning Crash Course
  8 | 
  9 | We'll be considering 3 types of machine learning in the MP this week:
 10 | 
 11 | * Classification
 12 | * Regression
 13 | * Clustering
 14 | 
 15 | However, for most of the algorithms / feature extractors that MLlib provides, there is a common pattern:
 16 | 
 17 | 1) Fit - Trains the model, using training data to adjust the model's internal parameters.
 18 | 
 19 | 2) Transform - Use the fitted model to predict the MPel/value of novel data (data not used in the training of the model).
 20 | 
 21 | If you go on to do more data science work, you'll see that this 2-phase ML pattern is common in other ML libraries, like `scikit-learn`.
 22 | 
 23 | Things are a bit more complicated in PySpark, because part of the way RDDs are handled (i.e. lazy evaluation), we often have to explicitly note when we want to predict data, and other instances when we're piping data through different steps of our model's setup.
 24 | 
 25 | It'll be extremely valuable to look up PySpark's documentation and example when working on this week's MP. The MP examples we'll be giving you do not require deep knowledge of Machine Learning concepts to complete. However, you will need to be good documentation-readers to navigate MLlib's nuances.
 26 | 
 27 | ## Examples
 28 | 
 29 | ### TF-IDF Naive Bayes Yelp Review Classification
 30 | 
 31 | #### Extracting Features
 32 | 
 33 | Remember last week when you found out which words were correlated with negative reviews by calculating the probability of a word occurring in a review? PySpark lets you do something like this extremely easily to calculate the Term Frequency - Inverse Document Frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) characteristics of a set of texts.
 34 | 
 35 | Let's define some terms:
 36 | 
 37 | * Term frequency - The number of times a word appears in a text
 38 | * Inverse document frequency - Weights a word's importance in a text by seeing if that word is rare in the collection of all texts. So, if we have a sentence contains the only reference to "cats" in an entire book, that sentence will have "cats" ranked as highly relevant.
 39 | 
 40 | TF-IDF combines the previous two concepts. Suppose we had a few sentences that refer to "cats" in a large book. We'd rank those rare sentences then by the frequency of the "cats" in each of those sentences.
 41 | 
 42 | There's a fair amount of math behind calculating TF-IDF, but for this MP it is sufficient to know that it is a relatively reliable way of guessing the relevance of a word in the context of a large body of data.
 43 | 
 44 | You'll also note that we're making use of a `HashingTF`. This is just a really quick way to compute the term-frequency of words. It uses a hash function to represent a long string with a shorter hash, and can use a data structure like a hash map to quickly count the frequency with which words appear.
 45 | 
 46 | #### Classifying Features
 47 | 
 48 | We'll also be using a [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier. This type of classifier looks at a set of data and labels, and constructs a model to predict the label given the data using probabilistic means.
 49 | 
 50 | Again, it's not necessary to know the inner workings of Naive Bayes, just that we'll be using it to classify data.
 51 | 
 52 | #### Constructing a model
 53 | 
 54 | To construct a model, we'll need to construct an RDD that has (key, value) pairs with keys as our labels, and values as our features. First, however, we'll need to extract those features from the text. We're going to use TF-IDF as our feature, so we'll calculate that for all of our text first.
 55 | 
 56 | We'll start with the assumption that you've transformed the data so that we have `(label, array_of_words)` as the RDD. To start with, we'll have label be `0` if the review is negative and `1` if the review is positive. You practiced how to do this last week.
 57 | 
 58 | Here's how we'll extract the TF-IDF features:
 59 | 
 60 | ```python
 61 | # Feed HashingTF just the array of words
 62 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1]))
 63 | 
 64 | # Pipe term frequencies into the IDF
 65 | idf = IDF(minDocFreq=5).fit(tf)
 66 | 
 67 | # Transform the IDF into a TF-IDF
 68 | tfidf = idf.transform(tf)
 69 | 
 70 | # Reassemble the data into (label, feature) K,V pairs
 71 | zipped_data = (labels.zip(tfidf)
 72 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
 73 |                      .cache())
 74 | ```
 75 | 
 76 | Now that we have our labels and our features in one RDD, we can train our model:
 77 | 
 78 | ```
 79 | # Do a random split so we can test our model on non-trained data
 80 | training, test = zipped_data.randomSplit([0.7, 0.3])
 81 | 
 82 | # Train our model with the training data
 83 | model = NaiveBayes.train(training)
 84 | ```
 85 | 
 86 | Then, we can use this model to predict new data:
 87 | ```python
 88 | # Use the test data and get predicted labels from our model
 89 | test_preds = (test.map(lambda x: x.label)
 90 |                   .zip(model.predict(test.map(lambda x: x.features))))
 91 | ```
 92 | 
 93 | If we look at this `test_preds` RDD, we'll see our text, and the label the model predicted.
 94 | 
 95 | However, if we want a more precise measurement of how our model faired, PySpark gives us `MulticlassMetrics`, which we can use to measure our model's performance.
 96 | 
 97 | ```python
 98 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
 99 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
100 | 
101 | print trained_metrics.confusionMatrix().toArray()
102 | print trained_metrics.precision()
103 | 
104 | print test_metrics.confusionMatrix().toArray()
105 | print test_metrics.precision()
106 | ```
107 | 
108 | #### Analyzing our Results
109 | `MulticlassMetrics` let's us see the ["confusion matrix"](https://en.wikipedia.org/wiki/Confusion_matrix) of our model, which shows us how many times our model chose each label given the actual label of the data point.
110 | 
111 | The meaning of the columns is the _predicted_ value, and the meaning of the rows is the _actual_ value. So, we read that `confusion_matrix[0][1]` is the number of items predicted as having `label[1]` that were in actuality `label[0]`.
112 | 
113 | Thus, we want our confusion matrix to have as many items on the diagonals as possible, as these represent items that were correctly predicted.
114 | 
115 | We can also get precision, which is a more simple metric of "how many items we predicted correctly".
116 | 
117 | Here's our results for this example:
118 | 
119 | ```
120 | # Training Data Confusion Matrix:
121 | [[ 2019245.   115503.]
122 |  [  258646.   513539.]]
123 | # Training Data Accuracy:
124 | 0.8712908071840665
125 | 
126 | # Testing Data Confusion Matrix:
127 | [[ 861056.   55386.]
128 |  [ 115276.  214499.]]
129 | #Testing Data Accuracy:
130 | 0.8630559525347512
131 | ```
132 | 
133 | Not terrible. As you see, our training data gets slightly better prediction precision, because it's the data used to train the model.
134 | 
135 | #### Extending the Example
136 | 
137 | What if instead of just classifying on positive and negative, we try to classify reviews based on their 1-5 stars review score?
138 | 
139 | ```
140 | # Training Data Confusion Matrix:
141 | [[ 130042.   38058.   55682.  115421.  193909.]
142 |  [  27028.   71530.   26431.   55381.   95007.]
143 |  [  35787.   22641.  102753.   71802.  122539.]
144 |  [  72529.   45895.   69174.  254838.  246081.]
145 |  [ 113008.   73249.  108349.  225783.  535850.]]
146 | # Training Data Accuracy:
147 | 0.37645263439801124
148 | 
149 | # Testing Data Confusion Matrix:
150 | [[  33706.   20317.   27553.   54344.   90325.]
151 |  [  15384.   10373.   14875.   28413.   46173.]
152 |  [  18958.   13288.   19389.   37813.   59746.]
153 |  [  36921.   25382.   37791.   76008.  120251.]
154 |  [  57014.   37817.   55372.  112851.  194319.]]
155 | #Testing Data Accuracy:
156 | 0.268241369417615
157 | ```
158 | 
159 | Ouch. What went wrong? Well, a couple things. One thing that hurts us is that Naive Bayes is, well, Naive. While we intuitively know that the meanings 1, 2, 3, 4, 5 have a specific value, NB doesn't have any concept that items labeled 4 and 5 are probably going to be closer than a pair labeled 1 and 5.
160 | 
161 | Also, in this example we see a case where testing out training data doesn't have much utility. While an accuracy of `0.376` isn't great, it's still a lot better than `0.268`. Validating on the training data would lead us to think that our model is substantially more accurate than it actually is.
162 | 
163 | #### Conclusion
164 | 
165 | The full code of the first example is in `bayes_binary_tfidf.py`, and the second "extended" example is in `bayes_tfidf.py`.
166 | 
167 | ## MP Activities
168 | **MP 5 is due on Wednesday, October 18th, 2017 at 11:55PM.**
169 | 
170 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
171 | 
172 | **NOTE:** 
173 | 
174 | * For each problem you may only use, at most, 80% of the dataset to train on. The other 20% should be used for testing your model. (i.e. use `rdd.randomSplit([0.8, 0.2])`)
175 | * Our cluster has PySpark version 2.1.1. Use [this](https://spark.apache.org/docs/2.1.1/api/python/pyspark.html#) documentation in your research.
176 | * This MP is a bit more "flexible" than previous MPs. We want you to experiment with these tools, and learn how to adjust the parameters to improve accuracy. Have fun! :) 
177 | 
178 | ### 1. Amazon Review Score Classification
179 | This week, we'll be using an Amazon dataset of food reviews. You can find this dataset in HDFS at `/shared/amazon/amazon_food_reviews.csv`. The dataset has the following columns:
180 | 
181 | ```
182 | Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text
183 | ```
184 | Similar to the Yelp Dataset, Amazon's food review dataset provides you with some review text and a review score. Use MLlib to classify these reviews by score. You can use any classifiers and feature extractors that are available. You may also choose to classify either on positive/negative or the more granular stars rating.
185 | 
186 | Notes:
187 | 
188 | * You can use the any fields other than `HelpfulnessNumerator` or `HelpfulnessDenominator` for feature extraction.
189 | * Use `MulticlassMetrics` to output the `confusionMatrix` and `precision` of your model. You want to maximize the precision. Include this output in your submission.
190 | 
191 | ### 2. Amazon Review Helpfulness Regression
192 | 
193 | Amazon also gives a metric of "helpfulness". The dataset has the number of users who marked a review as helpful, and the number of users who voted either up or down on the review.
194 | 
195 | Define a review's helpfulness score as `HelpfulnessNumerator / HelpfulnessDenominator`.
196 | 
197 | Construct and train a model that uses a regression algorithm to predict a review's helpfulness score from it's text. 
198 | 
199 | Notes:
200 | 
201 | * You can use the any fields other than `Score` for feature extraction.
202 | * We suggest that at for a starting point, you use `pyspark.mllib.regression.LinearRegressionWithSGD` as your regression model.
203 | * Use `pyspark.mllib.evaluation.RegressionMetrics` to output the `explainedVariance` and `rootMeanSquaredError`. You want to minimize the error.
204 | 
205 | ### 3. Yelp Business Clustering
206 | 
207 | Going back to the Yelp dataset, suppose we want to find clusters of business in the Urbana/Champaign area. Where do businesses aggregate geographically? Could we predict from a set of coordinates which cluster of business a given business is in? Use K-Means to come up with a clustering model for the U-C area.
208 | 
209 | How can we determine how good our model is? The simplest way is to just graph it, and see if the clusters match what we would expect. More formally, we can use Within Set Sum of Squared Error ([WSSSE](https://spark.apache.org/docs/1.5.0/mllib-clustering.html#k-means)) to determine the optimal number of clusters. If we plot the error for multiple values of k, we can see the point of diminishing returns to adding more clusters. You should pick a value of k that is around this point of diminishing return.
210 | 
211 | Notes:
212 | 
213 | * Use `pyspark.mllib.clustering.KMeans` as your clustering algorithm.
214 | * Your task is to:
215 |     1. Extract the business that are in the U-C area and use their coordinates as features for your KMeans clustering model.
216 |     2. Select a proper K such that you get a good approximation of the "actual" clusters of businesses. (May require trial-and-error)
217 |     3. Plot the businesses with `matplotlib.pyplot.scatter` and have each point on the scatter plot be color-keyed by their cluster. You can plot your clustering error in a separate script (it doesn't have to be part of your Spark job).
218 |     4. Include both the plot as a PNG and a short justification for your k value (either in comments in your code or in a separate `.txt`) in your submission.
219 | 


--------------------------------------------------------------------------------
/MPs/MP5/amazon_helpfulness_regression.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Amazon Helpfulness Regression")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("gs://dataproc-3ba9e17b-802e-4fec-8f2d-4e0d4167cadb-us-central1/Datasets/amazon/amazon_food_reviews.csv")
6 | 
7 | with open('amazon_helpfulness_regression.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP5/amazon_review_classification.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Amazon Review Classification")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("gs://dataproc-3ba9e17b-802e-4fec-8f2d-4e0d4167cadb-us-central1/Datasets/amazon/amazon_food_reviews.csv")
6 | 
7 | with open('amazon_review_classification.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP5/bayes_binary_tfidf.py:
--------------------------------------------------------------------------------
 1 | from pyspark.mllib.feature import HashingTF, IDF
 2 | from pyspark.mllib.regression import LabeledPoint
 3 | from pyspark.mllib.classification import NaiveBayes
 4 | from pyspark.mllib.evaluation import MulticlassMetrics
 5 | import json
 6 | import nltk
 7 | from pyspark import SparkContext, SparkConf
 8 | conf = SparkConf().setAppName("Bayes Binary TFIDF")
 9 | sc = SparkContext(conf=conf)
10 | 
11 | 
12 | def get_labeled_review(x):
13 |     return x.get('stars'), x.get('text')
14 | 
15 | 
16 | def categorize_review(x):
17 |     return (0 if x[0] > 2.5 else 1), x[1]
18 | 
19 | 
20 | def format_prediction(x):
21 |     return "actual: {0}, predicted: {1}".format(x[0], float(x[1]))
22 | 
23 | 
24 | def produce_tfidf(x):
25 |     tf = HashingTF().transform(x)
26 |     idf = IDF(minDocFreq=5).fit(tf)
27 |     tfidf = idf.transform(tf)
28 |     return tfidf
29 | 
30 | # Load in reviews
31 | reviews = sc.textFile("gs://dataproc-3ba9e17b-802e-4fec-8f2d-4e0d4167cadb-us-central1/Datasets/yelp/review.json")
32 | # Parse to json
33 | json_payloads = reviews.map(json.loads)
34 | # Tokenize and weed out bad data
35 | labeled_data = (json_payloads.map(get_labeled_review)
36 |                              .filter(lambda x: x[0] and x[1])
37 |                              .map(lambda x: (float(x[0]), x[1]))
38 |                              .map(categorize_review)
39 |                              .mapValues(nltk.word_tokenize))
40 | labels = labeled_data.map(lambda x: x[0])
41 | 
42 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1]))
43 | idf = IDF(minDocFreq=5).fit(tf)
44 | tfidf = idf.transform(tf)
45 | zipped_data = (labels.zip(tfidf)
46 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
47 |                      .cache())
48 | 
49 | # Do a random split so we can test our model on non-trained data
50 | training, test = zipped_data.randomSplit([0.7, 0.3])
51 | 
52 | # Train our model
53 | model = NaiveBayes.train(training)
54 | 
55 | # Use our model to predict
56 | train_preds = (training.map(lambda x: x.label)
57 |                        .zip(model.predict(training.map(lambda x: x.features))))
58 | test_preds = (test.map(lambda x: x.label)
59 |                   .zip(model.predict(test.map(lambda x: x.features))))
60 | 
61 | # Ask PySpark for some metrics on how our model predictions performed
62 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
63 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
64 | 
65 | with open('output_binary.txt', 'w+') as f:
66 |     f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n')
67 |     f.write(str(trained_metrics.precision()) + '\n')
68 |     f.write(str(test_metrics.confusionMatrix().toArray()) + '\n')
69 |     f.write(str(test_metrics.precision()) + '\n')
70 | 


--------------------------------------------------------------------------------
/MPs/MP5/bayes_tfidf.py:
--------------------------------------------------------------------------------
 1 | from pyspark.mllib.feature import HashingTF, IDF
 2 | from pyspark.mllib.regression import LabeledPoint
 3 | from pyspark.mllib.classification import NaiveBayes
 4 | from pyspark.mllib.evaluation import MulticlassMetrics
 5 | import json
 6 | import nltk
 7 | from pyspark import SparkContext, SparkConf
 8 | conf = SparkConf().setAppName("Bayes TFIDF")
 9 | sc = SparkContext(conf=conf)
10 | 
11 | 
12 | def get_labeled_review(x):
13 |     return x.get('stars'), x.get('text')
14 | 
15 | 
16 | def format_prediction(x):
17 |     return "actual: {0}, predicted: {1}".format(x[0], float(x[1]))
18 | 
19 | 
20 | def produce_tfidf(x):
21 |     tf = HashingTF().transform(x)
22 |     idf = IDF(minDocFreq=5).fit(tf)
23 |     tfidf = idf.transform(tf)
24 |     return tfidf
25 | 
26 | # Load in reviews
27 | reviews = sc.textFile("gs://dataproc-3ba9e17b-802e-4fec-8f2d-4e0d4167cadb-us-central1/Datasets/yelp/review.json")
28 | # Parse to json
29 | json_payloads = reviews.map(json.loads)
30 | # Tokenize and weed out bad data
31 | labeled_data = (json_payloads.map(get_labeled_review)
32 |                              .filter(lambda x: x[0] and x[1])
33 |                              .map(lambda x: (float(x[0]), x[1]))
34 |                              .mapValues(nltk.word_tokenize))
35 | labels = labeled_data.map(lambda x: x[0])
36 | 
37 | tfidf = produce_tfidf(labeled_data.map(lambda x: x[1]))
38 | zipped_data = (labels.zip(tfidf)
39 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
40 |                      .cache())
41 | 
42 | # Do a random split so we can test our model on non-trained data
43 | training, test = zipped_data.randomSplit([0.7, 0.3])
44 | 
45 | # Train our model
46 | model = NaiveBayes.train(training)
47 | 
48 | # Use our model to predict
49 | train_preds = (training.map(lambda x: x.label)
50 |                        .zip(model.predict(training.map(lambda x: x.features))))
51 | test_preds = (test.map(lambda x: x.label)
52 |                   .zip(model.predict(test.map(lambda x: x.features))))
53 | 
54 | # Ask PySpark for some metrics on how our model predictions performed
55 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
56 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
57 | 
58 | with open('output_discrete.txt', 'w+') as f:
59 |     f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n')
60 |     f.write(str(trained_metrics.precision()) + '\n')
61 |     f.write(str(test_metrics.confusionMatrix().toArray()) + '\n')
62 |     f.write(str(test_metrics.precision()) + '\n')
63 | 


--------------------------------------------------------------------------------
/MPs/MP5/yelp_clustering.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Yelp Clustering")
3 | sc = SparkContext(conf=conf)
4 | 
5 | businesses = sc.textFile("gs://dataproc-3ba9e17b-802e-4fec-8f2d-4e0d4167cadb-us-central1/Datasets/yelp/business.json")
6 | 
7 | with open('yelp_clustering.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/MPs/MP6/README.md:
--------------------------------------------------------------------------------
  1 | # MP 6: Spark SQL
  2 | 
  3 | ## Introduction
  4 | 
  5 | Spark SQL is a powerful way for interacting with large amounts of structured data. Spark SQL gives us the concept of "dataframes", which will be familiar if you've ever done work with Pandas or R. DataFrames can also be thought of as similar to tables in databases.
  6 | 
  7 | With Spark SQL Dataframes we can interact with our data using the Structured Query Language (SQL). This gives us a declarative way to query our data, as opposed to the imperative methods we've studied in past weeks (i.e. discrete operations on sets of RDDs)
  8 | 
  9 | ## SQL Crash Course
 10 | 
 11 | SQL is a declarative language used for querying data. The simplest SQL query is a `SELECT - FROM - WHERE` query. This selects a set of attributes (`SELECT`) from a specific table (`FROM`) where a given set of conditions holds (`WHERE`).
 12 | 
 13 | However, SQL also has a series of more advanced aggregation commands for grouping data. This is accomplished with the `GROUP BY` keyword. We can also join tables on attributes or conditions with the set of `JOIN ... ON` commands. We won't be expecting advanced knowledge of these topics, but developing a working understanding of how these work will be useful in completing this MP.
 14 | 
 15 | Spark SQL has a pretty good [Programming Guide](https://spark.apache.org/docs/2.1.0/sql-programming-guide.html) that's worth looking at.
 16 | 
 17 | Additionally, you may find [SQL tutorials](https://www.w3schools.com/sql/default.asp) online useful for this assignment.
 18 | 
 19 | ## Examples
 20 | 
 21 | ### Loading Tables
 22 | 
 23 | The easiest way to get data into Spark SQL is by registering a DataFrame as a table. A DataFrame is essentially an instance of a Table: it has a schema (columns with data types and names), and data.
 24 | 
 25 | We can create a DataFrame by passing an RDD of data tuples and a schema to `sqlContext.createDataFrame`:
 26 | 
 27 | ```
 28 | data = sc.parallelize([('Tyler', 1), ('Quinn', 2), ('Ben', 3)])
 29 | df = sqlContext.createDataFrame(data, ['name', 'instructor_id'])
 30 | ```
 31 | 
 32 | This creates a DataFrame with 2 columns: `name` and `instructor_id`.
 33 | 
 34 | We can then register this frame with the sqlContext to be able to query it generally:
 35 | 
 36 | ```
 37 | sqlContext.registerDataFrameAsTable(df, "instructors")
 38 | ```
 39 | 
 40 | Now we can query the table:
 41 | 
 42 | ```
 43 | sqlContext.sql("SELECT name FROM instructors WHERE instructor_id=3")
 44 | ```
 45 | 
 46 | ### Specific Business Subset
 47 | 
 48 | Suppose we want to find all the businesses located in Champaign, IL that have 5 star ratings. We can do this with a simple `SELECT - FROM - WHERE` query:
 49 | 
 50 | ```python
 51 | sqlContext.sql("SELECT * "
 52 |                "FROM businesses "
 53 |                "WHERE stars=5 "
 54 |                "AND city='Champaign' AND state='IL'").collect()
 55 | ```
 56 | 
 57 | This selects all the rows from the `businesses` table that match the criteria described in the `WHERE` clause.
 58 | 
 59 | ### Highest Number of Reviews
 60 | 
 61 | Suppose we want to rank users by how many reviews they've written. We can do this query with aggregation and grouping:
 62 | 
 63 | ```python
 64 | sqlContext.sql("SELECT user_id, COUNT(*) AS c"
 65 |                "FROM reviews "
 66 |                "GROUP BY user_id "
 67 |                "ORDER BY c DESC "
 68 |                "LIMIT 10").collect()
 69 | ```
 70 | 
 71 | This query groups rows by the `user_id` column, and collapses those rows into tuples of `(user_id, COUNT(*))`, where `COUNT(*)` is the number of collapsed rows per grouping. This gives us the review count of each user. We then do `ORDER BY c DESC` to show the top counts first, and `LIMIT 10` to only show the top 10 results.
 72 | 
 73 | ## MP Activities
 74 | **MP 6 is due on Saturday, November 4th, 2017 at 11:55PM.**
 75 | 
 76 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
 77 | 
 78 | **NOTE:**
 79 | 
 80 | * For each of these problems you may use RDDs *only* for loading in and saving data to/from HDFS. All of your "computation" must be performed on DataFrames, either via the SQLContext or DataFrame interfaces.
 81 | * We _suggest_ using the SQLContext for most of these problems, as it's generally a more straight-forward interface.
 82 | 
 83 | ### 1. Quizzical Queries
 84 | 
 85 | For this problem, we'll construct some simple SQL queries on the Amazon Review dataset that we used last week. Your first task is to create a DataFrame from the CSV set. Once you've done this, write queries that get the requested information about the data. Format and save your output and include it in your submission.
 86 | 
 87 | **NOTE:** For this problem, you *must* use `sqlContext.sql` to run your queries. This means, you have to run `sqlContext.registerDataFrameAsTable` on your constructed DataFrame and write queries in raw SQL.
 88 | 
 89 | Queries:
 90 | 
 91 | 1. What is the review text of the review with id `22010`?
 92 | 2. How many 5-star ratings does product `B000E5C1YE` have?
 93 | 3. How any unique users have written reviews?
 94 | 
 95 | Notes:
 96 | 
 97 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review.
 98 | 
 99 | ### 2. Aggregation Aggravation
100 | 
101 | For this problem, we'll use some more complicated parts of the SQL language. Often times, we'll want to learn aggregate statistics about our data. We'll use `GROUP BY` and aggregation methods like `COUNT`, `MAX`, `AVG` to find out more interesting information about our dataset.
102 | 
103 | Queries:
104 | 
105 | 1. How many reviews has the person who has written the most number of reviews written? What is that user's UserId?
106 | 2. List the ProductIds of the products with the top 10 highest average review scores of products that have more than 10 reviews, ordered by product score, with ties broken by number of reviews.
107 | 3. List the Ids of the reviews with the top 10 highest ratios between `HelpfulnessNumerator` and `HelpfulnessDenominator`, which have `HelpfulnessDenominator` more than 10, ordered by that ratio, with ties broken by `HelpfulnessDenominator`.
108 | 
109 | Notes: 
110 | 
111 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review.
112 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames.
113 | 
114 | ### 3. Jaunting with Joins
115 | 
116 | For this problem, we'll switch back to the Yelp dataset. Note that you can use the very handy [json DataFrame reader](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json) method to load in the dataset as a DataFrame.
117 | 
118 | There are some times that we need to access data that is split accross multiple tables. For instance, when we look at a single Yelp review, we cannot directly get the user's name, because we only have their id. But, we can match users with their reviews by "joining" on their user id. The database does this by looking for rows with matching values for the join columns.
119 | 
120 | You'll want to look up the `JOIN` (specifically `INNER JOIN`) SQL commands for these problems.
121 | 
122 | Queries:
123 | 1) What state has had the most Yelp check-ins?
124 | 2) What is the maximum number of "funny" ratings left on a review created by someone who's been yelping since 2012?
125 | 3) List the user ids of anyone who has left a 1-star review, has created more than 250 reviews, and has left a review in Champaign, IL.
126 | 
127 | Notes: 
128 | 
129 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames.


--------------------------------------------------------------------------------
/MPs/MP6/aggregation_aggravation.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.sql import SQLContext
 3 | import csv
 4 | conf = SparkConf().setAppName("Aggregation Aggravation")
 5 | sc = SparkContext(conf=conf)
 6 | sqlContext = SQLContext(sc)
 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',')
 8 | 
 9 | 
10 | def parse_csv(x):
11 |     x = x.encode('ascii', 'ignore').replace('\n', '')
12 |     d = csv.reader([x])
13 |     return next(d)
14 | 
15 | reviews = sc.textFile("hdfs:///shared/amazon/amazon_food_reviews.csv")
16 | first = reviews.first()
17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv)
18 | 
19 | df = sqlContext.createDataFrame(csv_payloads, schema)
20 | sqlContext.registerDataFrameAsTable(df, "amazon")
21 | 
22 | # Do your queries here
23 | 
24 | # CHANGE THESE TO YOUR ANSWERS
25 | query_1_num_reviews= 123
26 | query_1_user_id = 123
27 | query_2_product_ids = [123, 456]
28 | query_3_product_ids = [123, 456]
29 | 
30 | # DON'T EDIT ANYTHING BELOW THIS COMMENT
31 | with open('aggregation_aggravation.txt', 'w+') as f:
32 |     f.write("1: {}, {}\n".format(query_1_num_reviews, query_1_user_id))
33 |     f.write('3: {}\n'.format(','.join(map(str, query_2_product_ids))))
34 |     f.write('3: {}\n'.format(','.join(map(str, query_3_product_ids))))
35 | 


--------------------------------------------------------------------------------
/MPs/MP6/jaunting_with_joins.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.sql import SQLContext, SparkSession
 3 | conf = SparkConf().setAppName("Jaunting With Joins")
 4 | sc = SparkContext(conf=conf)
 5 | sqlContext = SQLContext(sc)
 6 | spark = SparkSession.builder.getOrCreate()
 7 | 
 8 | reviews = spark.read.json("hdfs:///shared/yelp/review.json")
 9 | businesses = spark.read.json("hdfs:///shared/yelp/business.json")
10 | checkins = spark.read.json("hdfs:///shared/yelp/checkin.json")
11 | users = spark.read.json("hdfs:///shared/yelp/user.json")
12 | 
13 | sqlContext.registerDataFrameAsTable(reviews, "reviews")
14 | sqlContext.registerDataFrameAsTable(businesses, "businesses")
15 | sqlContext.registerDataFrameAsTable(checkins, "checkins")
16 | sqlContext.registerDataFrameAsTable(users, "users")
17 | 
18 | # Do your queries here
19 | 
20 | # CHANGE THESE TO YOUR ANSWERS
21 | query_1_state = ""
22 | query_2_maximum_funny = 123
23 | query_3_user_ids = [123, 456, 789]
24 | 
25 | # DON'T EDIT ANYTHING BELOW THIS COMMENT
26 | with open('jaunting_with_joins.txt', 'w+') as f:
27 |     f.write('1: {}\n'.format(query_1_state))
28 |     f.write('2: {}\n'.format(query_2_maximum_funny))
29 |     f.write('3: {}\n'.format(','.join(map(str, query_3_user_ids))))


--------------------------------------------------------------------------------
/MPs/MP6/quizzical_queries.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | from pyspark import SparkContext, SparkConf
 3 | from pyspark.sql import SQLContext
 4 | conf = SparkConf().setAppName("Quizzical Queries")
 5 | sc = SparkContext(conf=conf)
 6 | sqlContext = SQLContext(sc)
 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',')
 8 | 
 9 | 
10 | def parse_csv(x):
11 |     x = x.encode('ascii', 'ignore').replace('\n', '')
12 |     d = csv.reader([x])
13 |     return next(d)
14 | 
15 | reviews = sc.textFile("hdfs:///shared/amazon/amazon_food_reviews.csv")
16 | first = reviews.first()
17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv)
18 | 
19 | # Do your queries here
20 | 
21 | # CHANGE THESE TO YOUR ANSWERS
22 | query_1_review_string = "FOOBAR"
23 | query_2_five_star_ratings = 123
24 | query_3_user_count = 123
25 | 
26 | # DON'T EDIT ANYTHING BELOW THIS COMMENT
27 | with open('quizzical_queries.txt', 'w+') as f:
28 |     f.write("1: '{}'\n".format(query_1_review_string))
29 |     f.write("2: {}\n".format(query_2_five_star_ratings))
30 |     f.write('3: {}\n'.format(query_3_user_count))


--------------------------------------------------------------------------------
/MPs/MP7/README.md:
--------------------------------------------------------------------------------
  1 | # MP 7: Introduction to Terraform
  2 | 
  3 | ## Introduction
  4 | 
  5 | This MP will introduce Terraform. You will be writing a Terraform configuration file that will deploy resources in [Google Cloud Platform](http://cloud.google.com/) (GCP).
  6 | 
  7 | ### Background
  8 | 
  9 | [Terraform](https://www.terraform.io/) is one of the Infrastructure-as-Code frameworks that we talked about in lecture. It has become quite popular in recent years because it's relatively easy to configure, and is platform agnostic. Terraform supports AWS, GCP, and Azure, among other cloud platforms.
 10 | 
 11 | The main idea behind Terraform is that infrastructure should be:
 12 | 
 13 | * Totally reproducable
 14 | * Able to be applied consistently and automatically
 15 | * Able to be version controlled
 16 | * Able to be collaborated on
 17 | 
 18 | Terraform does this by saving your infrastructure configuration in `*.tf` files, and saving the current state of your deployed infrastructure in a `*.tfstate` file. We won't be talking much about `*.tfstate` files in the MP, but know that they are very important to include in your source control so that state can be maintained between users of Terraform.
 19 | 
 20 | Internally, Terraform has the concept of resource dependency. As we'll see later in this MP, Terraform calculates an internal dependency graph for all cloud resources. This is very useful because cloud infrastructure often needs to be created in a specific order. For example, you may need to create virtual disks *before* you can create your VM instances and attach those disks to those instances.
 21 | 
 22 | Additionally, Terraform is smart about "applying" infrastructure changes. If you change a property (i.e. the name) of a resource, Terraform is usually able to simply edit that resource instead of deleting the old resource and recreating it. However, this is not perfect, and sometimes limitations in the API of your cloud provider will necessitate deleting and recreating resources.
 23 | 
 24 | ### Example
 25 | 
 26 | The basic terraform structure for a resource is as follows:
 27 | 
 28 | ```
 29 | resource "<RESOURCE_TYPE>" "<RESOURCE_IDENTIFIER" {
 30 |     ...
 31 |     <RESOURCE SETTINGS>
 32 |     ...
 33 | }
 34 | ```
 35 | 
 36 | The following is an example of what a simple GCP SQL database setup could look like.
 37 | 
 38 | ```
 39 | resource "google_sql_database_instance" "master" {
 40 |   // The name of the database on GCP
 41 |   name = "master-instance"
 42 | 
 43 |   settings {
 44 |     tier = "D0"
 45 |   }
 46 | }
 47 | 
 48 | resource "google_sql_database" "test-database" {
 49 |   name      = "test-db"
 50 | 
 51 |   // Notice that we can access attributes of other resources
 52 |   instance  = "${google_sql_database_instance.master.name}"
 53 | 
 54 |   charset   = "latin1"
 55 |   collation = "latin1_swedish_ci"
 56 | }
 57 | 
 58 | resource "google_sql_user" "test-database-student-user" {
 59 |   name     = "cs199student"
 60 |   instance = "${google_sql_database_instance.master.name}"
 61 |   host     = "0.0.0.0"
 62 |   password = "changeme"
 63 | }
 64 | ```
 65 | 
 66 | An important thing to node is the interpolation syntax. For example, when creating the SQL user, we need to reference the SQL database that the user should be connected to. We could manually populate this instance name, because we are in control of the databases name, however it is much better to use interpolation. Note that `${google_sql_database_instance.master.name}` will be interpolated to `master-instance` (the name of the database) at runtime. This is useful because if later on we decide to change the name of the database, we only have to change this in one place, and not look for references to that specific name all over our codebase.
 67 | 
 68 | For more examples of Terraform syntax, check the documentation linked to in the Resources section.
 69 | 
 70 | 
 71 | ### Resources
 72 | 
 73 | You will have a much easier time with this MP if you heavily consult the Terraform documentation for GCP. You can find that documentation [here](https://www.terraform.io/docs/providers/google/index.html).
 74 | 
 75 | Additionally, Terraform has some example GCP compatible examples that you can refer to [here](https://github.com/terraform-providers/terraform-provider-google/tree/master/examples). (Be warned though that these examples are a bit more complicated than the work we're asking you to do)
 76 | 
 77 | ### Setup
 78 | 
 79 | Download and install the version of Terraform for your system from [this page](https://www.terraform.io/downloads.html).
 80 | 
 81 | ## MP Activities
 82 | 
 83 | ### Problem Statement
 84 | 
 85 | In this MP, we'll be setting up a very basic cloud application configuration. We'll be addressing how to setup a VM instance, different options for persistent storage, and some basic networking.
 86 | 
 87 | Terraform can display what our infrastructure looks like once it's setup. Here's the graph of what we'll setup in the subsequent problems:
 88 | 
 89 | ![](graph.png)
 90 | 
 91 | (*Hint:* Once you've completed the MP, you can run `terraform graph | dot -Tpng > graph.png` if you have [graphviz](http://www.graphviz.org/) installed to double check that your solution matches this graph!)
 92 | 
 93 | 
 94 | ### Problem 1 - Getting Everything Setup
 95 | First, we need to get a GCP project and credentials setup, so that we have a place to work on deploying our cloud infrastructure. Here's what you need to do:
 96 | 
 97 | 1. Create a project in GCP titled something like "cs199-YOUR_NETID-mp7". You will need to change the "project" field in `main.tf` to reflect your project name.
 98 | 2. Create and download an "Authenticated JSON File" from GCP using [this tutorial](https://www.terraform.io/docs/providers/google/index.html#authentication-json-file). This is how Terraform will authenticate with GCP when it goes to deploy your infrastructure. **Important:** Place the contents of this file in a file called `account.json` in this MP folder.
 99 |     * **Note:** You may have to create a "Service Account". The name of this service account can be arbitrary, but make sure to give it the "Project Owner" Role so that it has permissions to manipulate your cloud infrastructure.
100 | 3. Enable the following APIs in the GCP console. Do this by going to the "API and Services" menu item in the GCP console, navigating to the "Dashboard" of this view, and clicking "Enable API and Services". Then, enable the following APIs:
101 |     * Google Compute Engine API
102 |     * Cloud Storage JSON API
103 | 4. Run `terraform init` in the MP directory. This will download the GCP Terraform plugin, and get everything setup.
104 | 5. Run `terraform get` in the MP directory. This will load and validate the essentially empty `main.tf` Terraform file.
105 | 6. Run `terraform plan` and then `terraform apply`. Terraform should report that there are no changes needed. This is a good thing! We're just getting started.
106 | 
107 | **Note**: After each step, it is helpful to run the `terraform get`/`terraform plan`/`terraform apply` sequence. This will confirm for you that your terraform code is valid, and is able to be applied.
108 | 
109 | 
110 | ### Problem 2 - Creating a Storage Bucket
111 | As we discussed in lecture, "Storage Buckets" are a useful way to store static resources. Often times, user data (like images) will be stored in storage buckets. These buckets are also sometimes used for large datasets, as they tend to scale elastically. You don't have to have a maximum size on your storage buckets, like you need to do with virtual disks. These Storage Buckets are usually backed by a service, so that people can access data inside these buckets from the wider Internet (if that's how you set them up). In GCP this service is called "Google Storage"; in AWS, this is called "S3" (Simple Storage Service). This problem will have you setup a storage bucket.
112 | 
113 | 1. Create a `google_storage_bucket` resource that is named `cs199-file-storage-<YOUR_NETID>`.
114 | 2. Make sure that the storage bucket located in the `US`.
115 | 3. Enable `versioning` for this bucket.
116 | 
117 | ### Problem 3 - Creating an Instance
118 | Instances are the core of many cloud computing infrastructures. An "instance" is essentially just a managed VM. You can specify the requested specifications of your VM (memory, number of VPCUs, etc) at the creation time of your VM. Furthermore, GCP and other cloud platforms have a variety of ready-made "images" to initialize your VM. These images contain copies of a specific operating system. We'll be using Ubuntu for this example, but there are also public images of other Linux distributions (Debian, CentOS, etc.) and Windows.
119 | 
120 | 1. Create a `google_compute_instance` resource named `nebula-in-the-cloud`
121 | 2. Set the machine type of this instance to `n1-standard-1`
122 | 3. Add the following tags to the instance: `cs199`, `mp7`
123 | 4. Add a boot disk to your image, and make sure that it initializes to the `ubuntu-os-cloud/ubuntu-1604-lts` image
124 | 5. Make sure that your image is in the `us-central1-a` zone. (This is a reference to the physical location of your VM)
125 | 6. Set the description of your instance to be `Look, we're cooking with clouds now!`
126 | 7. Set the `metadata_startup_script` on you instance to be `"${file("startup.sh")}"`. This loads the simple script we've provided you, and runs it when your instance starts up.
127 | 
128 | (*Note:* Your network interfaces can be empty to begin with)
129 | 
130 | ### Problem 4 - Creating a Disk and Attaching It
131 | Virtual disks are another way of managing persistent storage. These are most analogous to traditional hard drives (or SSDs). They don't usually have a service behind them, so data on these disks cannot be directly accessed from outside the VM it's attached to. You also have to provide a maximum capacity for drives. In recent years, you have also been able to choose the storage medium of these drives. Usually, you have a choice between spinning disks (hard drives) and solid state media. Solid state media tends to be faster, but will cost extra.
132 | 
133 | 1. Create a `google_compute_disk` resource named `dataset-disk`
134 | 2. Set the disk's size to be `10` Gigabytes
135 | 3. Set the zone of the disk to be `us-central1-a`
136 | 4. Attach this disk to the instance you created in Problem 3 by modifying the configuration for your `nebula_in_the_cloud` instance (*Note:* You will need to read up on resource "self_link"s and variable interpolation to do this problem correctly)
137 | 
138 | ### Problem 5 - Making Your Instance Accessible
139 | Now that we have an instance setup, we ant to make it accessible to the wider Internet. By default, your VM will only be accessible from within your private project within GCP. It will only be assigned a private "internal" IP address (like the `192.168.*.*` address that your computer probably has internally). First we will allocate an external IP address from GCP. This will give us the exclusive right to use that IP address for whatever purpose we want. Then, we will attach this IP address to our instance so that traffic that is routed to our external IP will be routed to our instance. This will involve editing the configuration of our instance. Finally, we will create a firewall rule to allow ingress traffic on port `8000` to be accepted by our instance. By default, the networking settings on instances will be very strict so that no one can access your VM without your permission.
140 | 
141 | 1. Create a `google_compute_address` named `nebula-in-the-cloud-address`
142 | 2. Attach this address to your `nebula_in_the_cloud` instance by adding it as a `nat_ip` to the instances `network_interface`. (*Note:* You will need to read up on resource "self_link"s and variable interpolation to do this problem correctly)
143 | 3. Create a `google_compute_firewall` named `nebula-in-the-cloud-firewall`. Set the network to `default`, and allow the `tcp` protocol on port `8000`.
144 | 
145 | Go into the GCP console, and then check under "VPC Network" > "External IP Addresses". Find the IP address for `nebula-in-the-cloud-address`. If you run `ping <THE_ADDRESS>` in your terminal, you should be able to ping your instance from the "outside world". Cool!
146 | 
147 | Now, go to `<THE_ADDRESS>:8000` in your browser. You should see a friendly "hello world" page if all of your networking is setup correctly. Nice job! (*Note:* It may take a couple minutes for your instance to boot up. Don't worry if it takes a couple minutes)
148 | 
149 | ### Problem N - Tear it all down!
150 | 
151 | 🚨 **This is an important step!** 🚨
152 | 
153 | Run `terraform destroy`, and then type in `yes`.
154 | 
155 | **Note:** This will delete all of your infrastructure in GCP. We're just experimenting with GCP, so we don't want to waste our credits on resources that we aren't using! `terraform destroy` is analogous to `rm -rf /`, so be very careful about it's use in non-experimental situations!
156 | 
157 | ## Deliverables
158 | By the end of the MP, you should have **one** Terraform file (`main.tf`) that solves all the activity problems. We will check off if you have all **17** of the requirements listed above for one point each, totaling 17 points for this MP. Please, if you are having any troubles setting up, let us know during office hours or through Piazza.
159 | 
160 | Submit your Terraform file as MP7 on Moodle. That's it! (**Note**: You won't lose any points if you do this, but you probably don't want to submit your `account.json` as part of your submission)
161 | 


--------------------------------------------------------------------------------
/MPs/MP7/graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-fa17/840d8ddf0264bf67f1251121c0f572c4b438901b/MPs/MP7/graph.png


--------------------------------------------------------------------------------
/MPs/MP7/main.tf:
--------------------------------------------------------------------------------
 1 | provider "google" {
 2 |   credentials = "${file("account.json")}"
 3 |   // Change the following line to the correct GCP project!
 4 |   project     = "cs199-YOUR_NETID-mp7"
 5 |   region      = "us-central1"
 6 | }
 7 | 
 8 | // Problem 2 - Creating a storage bucket
 9 | 
10 | // Problem 3 - Creating an Instance
11 | 
12 | // Problem 4 - Creating a Disk and Attaching It
13 | 
14 | // Problem 5 - Making Your Instance Accessible
15 | 


--------------------------------------------------------------------------------
/MPs/MP7/startup.sh:
--------------------------------------------------------------------------------
1 | echo "<h1>Hello World!</h1><p>Your instance started up correctly!</p>" > index.html
2 | python -m SimpleHTTPServer 8000 &


--------------------------------------------------------------------------------
/MPs/MP8/README.md:
--------------------------------------------------------------------------------
  1 | # MP 8: Introduction to Spark Streaming
  2 | 
  3 | ## Introduction
  4 | 
  5 | This MP will introduce Spark Streaming. You will be writing some simple Spark jobs that process streaming data from Twitter and Reddit.
  6 | 
  7 | ### Example
  8 | 
  9 | Here's an example of a Spark Streaming application that does "word count" on incoming Tweets
 10 | 
 11 | ```python
 12 | import json
 13 | from pyspark import SparkContext, SparkConf
 14 | from pyspark.streaming import StreamingContext
 15 | 
 16 | # Initialize the spark streaming context
 17 | conf = SparkConf().setAppName("Word count")
 18 | sc = SparkContext(conf=conf)
 19 | ssc = StreamingContext(sc, 10)
 20 | ssc.checkpoint('streaming_checkpoints')
 21 | 
 22 | # Stream window tunable parameters
 23 | window_length = 900 # The size of each window "slice" to consider
 24 | slide_interval = 30 # How often the windowed function is executed
 25 | 
 26 | 
 27 | # Parse tweets and return the text's word count
 28 | def get_tweet_word_count(tweet_json):
 29 |     try:
 30 |         data = json.loads(tweet_json)
 31 |         return len(data['text'].split(' '))
 32 |     except:
 33 |         return 0
 34 | 
 35 | 
 36 | # Listen to the tweet stream
 37 | tweet_json_lines = ssc.socketTextStream("nebula-m", 8001)
 38 | 
 39 | # Map json tweets to word counts
 40 | tweet_word_counts = tweet_json_lines.map(get_tweet_word_count)
 41 | 
 42 | # Do a windowed aggregate sum on the word counts
 43 | windowed_word_count = tweet_word_counts.reduceByWindow(lambda x, y: x + y, None, window_length, slide_interval)
 44 | 
 45 | # Save output to Hadoop
 46 | windowed_word_count.saveAsTextFiles('tweet_word_count')
 47 | 
 48 | 
 49 | # Signal to spark streaming that we've setup our streaming application, and it's ready to be run
 50 | ssc.start()
 51 | 
 52 | # Run the streaming application until it's terminated externally
 53 | ssc.awaitTermination()
 54 | ```
 55 | 
 56 | ### Resources
 57 | 
 58 | For this MP, you'll find the following Resouces useful:
 59 | 
 60 | * [Spark Streaming Programming Guide](https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html)
 61 | * [Spark Streaming API Documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.streaming.html)
 62 | 
 63 | ### Setup
 64 | 
 65 | Download and install the version of Terraform for your system from [this page](https://www.terraform.io/downloads.html).
 66 | 
 67 | ## MP Activities
 68 | 
 69 | ### Problem 1 - Popular Hashtags
 70 | 
 71 | Write a Spark Streaming application that listens to an incoming stream of Tweets and counts the number of times each Hashtag is used within a given stream interval.
 72 | 
 73 | For example, if we saw a stream like this:
 74 | ```
 75 | Testing my cool application #HelloWorld
 76 | #HelloWorld Foo #Bar
 77 | Here's another Tweet #HelloWorld
 78 | ```
 79 | 
 80 | Then our hashtag count would look like this:
 81 | 
 82 | ```
 83 | ("#HelloWorld", 3)
 84 | ("#Bar", 1)
 85 | ```
 86 | 
 87 | For this problem, use a window length of **60 seconds** with a slide interval of **30 seconds**.
 88 | 
 89 | ### Problem 2 - Subreddit Most Common Words
 90 | 
 91 | Note that Reddit comments have a field called "subreddit" that indicates on which forum on Reddit the comments were posted. Write a Spark Streaming application that listens to an incoming stream of Reddit comments and outputs the **top 10** most used word in comments **per each subreddit**. Normalize the comments by transforming them to all lowercase, and use the "stopwords" provided in the solution template to remove common words.
 92 | 
 93 | For this problem, use a window length of **900 seconds** with a slide interval of **300 seconds**.
 94 | 
 95 | ### Problem 3 - Reddit Bot Detection
 96 | 
 97 | Write a Spark Streaming application that listens to an incoming stream of Reddit comments and detects users that post multiple similar comments within a given stream period.
 98 | 
 99 | A user should be reported as a "bot" if their comments within a given stream window fulfill all these criteria:
100 | 
101 | - The user has made at least **5** comments within the given window.
102 | - The user has made a comment that is within **0.25** similarity to at least half of all other comments that user made in the given window. Use [difflib.get_close_matches](https://docs.python.org/2/library/difflib.html#difflib.get_close_matches) to determine string similarity
103 | 
104 | The output of your program should be a list of users that have comment histories matching the above criteria.
105 | 
106 | (To sanity-check your output, you should probably see the "AutoModerator" user as a bot after your job has been running for a few minutes)
107 | 
108 | For this problem, use a window length of **900 seconds** with a slide interval of **300 seconds**.
109 | 
110 | ## Stream Formats
111 | 
112 | ### Twitter
113 | 
114 | You will receive Tweets as a JSON blob with the following schema:
115 | 
116 | ```
117 | {
118 |     "text": "<tweet body>"
119 | }
120 | ```
121 | 
122 | This stream can be accessed via TCP on `nebula-m:8001` from within the cluster.
123 | 
124 | If you want to look at the raw stream, you can do so by running `telnet nebula-m 8001` on the GCP cluster
125 | 
126 | ### Reddit
127 | 
128 | You will receive Reddit comments as a JSON blob with the following schema:
129 | 
130 | ```
131 | {
132 |     "text": "<comment body>",
133 |     "subreddit": "<subreddit name>",
134 |     "author": "<comment author>"
135 | }
136 | ```
137 | 
138 | This stream can be accessed via TCP on `nebula-m:8000` from within the cluster.
139 | 
140 | If you want to look at the raw stream, you can do so by running `telnet nebula-m 8000` on the GCP cluster
141 | 
142 | ### Tips
143 | 
144 | You can "listen" to the stream by running `telnet <stream_address> <stream_port>`. Spark Streaming considers each line as a distinct record, so you'll see a JSON blob on each line.
145 | 
146 | ## Deliverables
147 | **MP 8 is due on Saturday, December 2nd, 2017 at 11:55PM.**
148 | 
149 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
150 | 


--------------------------------------------------------------------------------
/MPs/MP8/problem1.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.streaming import StreamingContext
 3 | 
 4 | conf = SparkConf().setAppName("Popular hashtags")
 5 | sc = SparkContext(conf=conf)
 6 | ssc = StreamingContext(sc, 10)
 7 | ssc.checkpoint('streaming_checkpoints')
 8 | 
 9 | tweet_stream = ssc.socketTextStream("nebula-m", 8001)
10 | 
11 | # YOUR CODE HERE
12 | 
13 | ssc.start()
14 | ssc.awaitTermination()
15 | 


--------------------------------------------------------------------------------
/MPs/MP8/problem2.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.streaming import StreamingContext
 3 | 
 4 | conf = SparkConf().setAppName("Subreddit common words")
 5 | sc = SparkContext(conf=conf)
 6 | ssc = StreamingContext(sc, 10)
 7 | ssc.checkpoint('streaming_checkpoints')
 8 | 
 9 | stopwords_str = 'i,me,my,myself,we,our,ours,ourselves,you,your,yours,yourself,yourselves,he,him,his,himself,she,her,hers,herself,it,its,itself,they,them,their,theirs,themselves,what,which,who,whom,this,that,these,those,am,is,are,was,were,be,been,being,have,has,had,having,do,does,did,doing,a,an,the,and,but,if,or,because,as,until,while,of,at,by,for,with,about,against,between,into,through,during,before,after,above,below,to,from,up,down,in,out,on,off,over,under,again,further,then,once,here,there,when,where,why,how,all,any,both,each,few,more,most,other,some,such,no,nor,not,only,own,same,so,than,too,very,s,t,can,will,just,don,should,now,d,ll,m,o,re,ve,y,ain,aren,couldn,didn,doesn,hadn,hasn,haven,isn,ma,mightn,mustn,needn,shan,shouldn,wasn,weren,won,wouldn'
10 | stopwords = set(stopwords_str.split(','))
11 | 
12 | reddit_comment_stream = ssc.socketTextStream("nebula-m", 8000)
13 | 
14 | # YOUR CODE HERE
15 | 
16 | ssc.start()
17 | ssc.awaitTermination()
18 | 


--------------------------------------------------------------------------------
/MPs/MP8/problem3.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.streaming import StreamingContext
 3 | 
 4 | conf = SparkConf().setAppName("Reddit bot detector")
 5 | sc = SparkContext(conf=conf)
 6 | ssc = StreamingContext(sc, 10)
 7 | ssc.checkpoint('streaming_checkpoints')
 8 | 
 9 | reddit_comment_stream = ssc.socketTextStream("nebula-m", 8000)
10 | 
11 | # YOUR CODE HERE
12 | 
13 | ssc.start()
14 | ssc.awaitTermination()
15 | 


--------------------------------------------------------------------------------
/MPs/MP9/README.md:
--------------------------------------------------------------------------------
 1 | # MP 9: Introduction to NoSQL
 2 | 
 3 | ## Introduction
 4 | 
 5 | This MP will introduce the Open Source NoSQL database called [MongoDB](https://github.com/mongodb/mongo).
 6 | 
 7 | MongoDB is a Document-Oriented NoSQL database. This allows applications to persist large amounts of unstructured data. MongoDB's "atomic unit" is a document -- analogous to record in a SQL database. MongoDB's equivalent of a "table" is a "collection".
 8 | 
 9 | ### Resources
10 | 
11 | For this MP, you'll find the following Resources useful:
12 | 
13 | * [MongoDB Python Driver Documentation](https://api.mongodb.com/python/current/)
14 | * This MongoDB article explaining common operations: https://docs.mongodb.com/manual/crud/
15 | * Getting started with MongoDB using Python: https://docs.mongodb.com/getting-started/python/
16 | 
17 | ## MP Activities
18 | 
19 | In this MP we'll revisit the Yelp dataset we used earlier in the course. This MP will have you interact with MongoDB by loading a portion of the Yelp dataset into MongoDB. Then, you'll run some basic queries.
20 | 
21 | ### Problem 1 - Create and Populate the Collection
22 | 
23 | **Step 1:** Create a new collection called `businesses`.
24 | 
25 | **Step 2:** Add the Yelp Business data to the MongoDB collection you created. Persist all attributes in the original dataset. (*Hint:* Since MongoDB is schema-less, try to insert data into the collection without doing anything dependent on the data's structure)
26 | 
27 | **Step 3:** Return the number of inserted businesses.
28 | 
29 | ### Problem 2 - Retrieving Data
30 | 
31 | Complete the `find_urbana_businesses` function so that it queries the `businesses` collection and returns all the business that have their `city` value as "Urbana" and `state` value as "IL".
32 | 
33 | Return these businesses as `(name, address)` tuples.
34 | 
35 | ### Problem 3 - Updating Data
36 | 
37 | Now suppose that we want to update our database. We've decided that Yelp users underrate businesses on Green Street in Champaign and Urbana. To fix this, we decide to give an extra star to every business in Champaign and Urbana.
38 | 
39 | Complete the `add_stars_to_cu_businesses` so that every business in the collection that has a `city` value of "Urbana" or "Champaign" AND is has a `state` value of "IL" AND the word 'Green' is in it's `address` field (case insensitive) a new `stars` value that is one greater than its original value. Note that we'll stick with the conventions of our data and enforce that all businesses have `star` values `<=5`.
40 | 
41 | Return the number of updated businesses.
42 | 
43 | ### Problem 4 - Deleting Data
44 | 
45 | Now suppose we have decided that we only care about business in Illinois. We decide that we're going to delete all businesses from our collection that do not have `state` being 'IL'.
46 | 
47 | Complete the function so that the only remaining businesses in the database are businesses in IL. Return the number of deleted businesses.
48 | 
49 | ## Deliverables
50 | **MP 9 is due on Wednesday, December 13nd, 2017 at 11:55PM.**
51 | 
52 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
53 | 


--------------------------------------------------------------------------------
/MPs/MP9/assignment.py:
--------------------------------------------------------------------------------
  1 | import pymongo as pm
  2 | import getpass
  3 | 
  4 | 
  5 | # Part 1
  6 | def setup_business_collection(db, data):
  7 |     '''
  8 |     Creates a new collection using the name "businesses"
  9 |     and adds new documents `data` to our MongoDB collection.
 10 | 
 11 |     Parameters
 12 |     ----------
 13 |     db: A pymongo.database.Database instance.
 14 |     data: A list of unparsed JSON strings.
 15 | 
 16 |     Returns
 17 |     -------
 18 |     int: The number of inserted businesses
 19 |     '''
 20 |     pass
 21 | 
 22 | 
 23 | # Part 2
 24 | def find_urbana_businesses(db):
 25 |     '''
 26 |     Queries the MongoDB collection for businesses in Urbana, IL
 27 | 
 28 |     Parameters
 29 |     ----------
 30 |     db: A pymongo.database.Database instance.
 31 | 
 32 |     Returns
 33 |     -------
 34 |     A list of (name, address) tuples
 35 |     '''
 36 |     pass
 37 | 
 38 | 
 39 | # Part 3
 40 | def add_stars_to_green_street_businesses(db):
 41 |     '''
 42 |     Adds one star to any businesses on Green Street in Champaign/Urbana IL
 43 |     Returns the number of updated businesses
 44 | 
 45 |     Parameters
 46 |     ----------
 47 |     db: A pymongo.database.Database instance.
 48 | 
 49 |     Returns
 50 |     -------
 51 |     int: The number of updated businesses
 52 |     '''
 53 |     pass
 54 | 
 55 | 
 56 | # Part 4
 57 | def delete_non_il_businesses(db):
 58 |     '''
 59 |     Deletes any businesses in the MongoDB colleciton that are not in Illinois
 60 | 
 61 |     Parameters
 62 |     ----------
 63 |     db: A pymongo.database.Database instance.
 64 | 
 65 |     Returns
 66 |     -------
 67 |     int: The number of deleted businesses
 68 |     '''
 69 |     pass
 70 | 
 71 | 
 72 | if __name__ == '__main__':
 73 |     username = getpass.getuser()
 74 |     client = pm.MongoClient("mongodb://localhost:27017")
 75 | 
 76 |     # We will delete our database if it exists before recreating
 77 |     if username in client.database_names():
 78 |         client.drop_database(username)
 79 | 
 80 |     db = client[username]
 81 | 
 82 |     # Load the data in from disk to MongoDB
 83 |     with open('/data/business.json') as f:
 84 |         num_businesses = setup_business_collection(db, f.readlines())
 85 | 
 86 |     print("Inserted {} businesses".format(num_businesses))
 87 | 
 88 |     # Query the table for urbana businesses
 89 |     print("Some Urbana Businesses:")
 90 |     print('\n'.join(
 91 |         ['{}\t{}'.format(*result) for result in find_urbana_businesses(db)[:10]]
 92 |     ))
 93 | 
 94 |     green_street = add_stars_to_green_street_businesses(db)
 95 |     print("Green Street Businesses Updated: {}".format(green_street))
 96 | 
 97 |     deleted_businesses = delete_non_il_businesses(db)
 98 |     print("Deleted non-Illinois businesses: {}".format(deleted_businesses))
 99 | 
100 |     print("There are {} CU businesses remaining in the DB".format(db['businesses'].count()))
101 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Welcome to CS 199
 2 |   > Welcome to CS 199 - Applied Cloud Computing at the University of Illinois Urbana-Champaign!
 3 |  
 4 | ## Informal Course Description
 5 | Cloud computing is increasingly becoming an area of interest in the tech industry. In this course, we will introduce service-level technologies including Hadoop, Spark, distributed databases and more. We will also introduce students to the concepts IaaS (Infrastructure as a Service), Cloud Functions, and containerization. This course will be hands-on and focussed on applications of cloud computing; students will be using a cluster to explore cloud based technologies.</p>
 6 | 
 7 | After taking this course, students will be experienced with cloud computing platforms, and will have been introduced to the processes necessary to build scalable cloud applications.
 8 | 
 9 | ## Formal Course Description
10 | This course is an introduction into cloud computing. We define cloud computing as the usage of services or infrastructure from the internet (what we refer to as the “cloud”) instead of self-hosted solutions.
11 | This course will address two levels of cloud computing: IaaS and PaaS. IaaS is short for infrastructure- as-a-service, meaning that we will talk about different cloud computing systems and platforms (like AWS or GCP). We will go in depth about what these different systems are and what differentiates them. We will also give students a chance to use these in practice to have a firmer, hands-on experience on what they do. We will also have a brief overview on virtualization.
12 | The second type of cloud computing we will cover is PaaS, which is short for platform-as-a-service. These are managed-infrastructure solutions that are hosted remotely, which can be used through an interface like SSH.
13 |  Throughout the course, we will also cover various common cloud computing applications. Some of the technologies that we cover are Hadoop, Spark, Messaging Queues, and SQL/NoSQL stores. While we won’t go through this software in detail, students will use all of these technologies in real-world settings through assignments.
14 | We hope that students will have an opportunity to learn and explore something that they could not do on a normal basis. We welcome ideas and suggestions about what to cover.
15 | 
16 | ## Learning Objectives
17 | By the end of the course, we wish students to know:
18 | 
19 | - A survey of popular cloud platforms.
20 | - A variety of cloud software applications
21 | - How to query SQL and NoSQL datastores
22 | - How to use Spark and Hadoop to run batch data jobs
23 | - How messaging queues and discovery services tie the cloud together
24 | 
25 | ## Grading
26 | Breakdown
27 | 
28 | - Attendance: 10%
29 | - MPs: 60%
30 | - Final Project: 30%
31 | 
32 | Cutoffs
33 | - 90% A-
34 | - 80% B-
35 | - 70% C-
36 | - 60% D-
37 | 
38 | ## MP Policy
39 | MPs make up the majority of the student work for CS199. As such, it is important that students stay on top of the MP. We suggest that students start MPs early, and go to office hours if they feel like they need help.
40 | 
41 | - All MPs will be released after lectures on Thursdays.
42 | - All MPs will be due on the following Wednesday at 11:59pm.
43 | - No late MPs will be accepted, unless a valid excuse is presented to the course staff.
44 | 
45 | **Note:** Many of the MPs will be run on the CS199 Nebula cluster. We do not guarantee 100% uptime of this cluster. Additionally, we notice that many students try to complete assignments right before the deadline, and this can lead to severely decreased cluster performance.
46 | Course staff also reserves the right to monitor and manage student submitted jobs on the cluster. If your code takes too much time to execute, or too much computing resources, course staff may terminate these jobs
47 | 
48 | ## Office Hours
49 | Course staff will host office hours throughtout the week to help students with course content. If a TA cannot attend a scheduled office hour section, they will usually announce the cancellation or rescheduling on the course Piazza. The office hours schedule is available <a href="/schedule">here</a>. Office hours will be hosted in the <a href="https://goo.gl/maps/CKopb13UiNr">NCSA lobby</a>.</p>
50 | 
51 | ## Academic Integrity
52 | The standard academic integrity policy applies here. If you believe that the solution to the assignment that you have written does not reflect how well you know the concepts being assessed, that is considered cheating. <a href="https://wiki.illinois.edu//wiki/display/undergradProg/Honor+Code">The academic policy</a> can be found here.</p>
53 | In addition to the standard academic integrity policy, we have an additional policy that if you leave the MP in “plain view” (public on github, pastebin, your blog), we reserve the right to reduce your final letter grade in the course once per instance, even after you are done taking the course. This means that if you received a A- in the course and you posted 2 assignments online, your final grade will be a C-.
54 | Just remember, we consider cheating a failure on our part. If you are ever in a situation where you are forced to cheat, that means that we have not done our jobs as instructors to give you adequate resources to work through the issues that you have. If it anytime you feel this, please shoot us an email and hopefully we can find an acceptable solution.
55 | 
56 | ## Absences
57 | If you have an emergency or unplanned absence you can talk to the course staff and we will deal with the absence.
58 | If you are sick, or have another similar urgent situation you must get approval from the emergency dean to get an excused absence. Otherwise, absences will not be accepted.
59 | 
60 | ## Attribution
61 | Lecture slides adapted from course materials created by previous CS199ACC instructor, Quinn Jarrell.
62 | 


--------------------------------------------------------------------------------