├── LICENSE
├── Labs
    ├── Lab1
    │   ├── README.md
    │   ├── bigram_count.py
    │   ├── common_friends.py
    │   ├── friend_graph.txt
    │   ├── map_reducer.py
    │   ├── sherlock.txt
    │   ├── utils.py
    │   └── word_count.py
    ├── Lab2
    │   ├── README.md
    │   ├── book.txt
    │   ├── mapper.py
    │   └── reducer.py
    ├── Lab3
    │   ├── README.md
    │   ├── prob1_mapper.py
    │   ├── prob1_reducer.py
    │   ├── prob2_mapper.py
    │   ├── prob2_reducer.py
    │   ├── prob3_mapper.py
    │   └── prob3_reducer.py
    ├── Lab4
    │   ├── README.md
    │   ├── descriptors_of_bad_business.py
    │   ├── most_expensive_city.py
    │   ├── pessimistic_users.py
    │   └── up_all_night.py
    ├── Lab5
    │   ├── README.md
    │   ├── amazon_helpfulness_regression.py
    │   ├── amazon_review_classification.py
    │   ├── bayes_binary_tfidf.py
    │   ├── bayes_tfidf.py
    │   └── yelp_clustering.py
    └── Lab6
    │   ├── README.md
    │   ├── aggregation_aggravation.py
    │   ├── jaunting_with_joins.py
    │   └── quizzical_queries.py
├── Lectures
    ├── Lecture 10.pdf
    ├── Lecture 11 - Clouds.pdf
    ├── Lecture 12 - Streaming.pdf
    ├── Lecture 13- Networking.pdf
    ├── Lecture 6- More Spark.pdf
    ├── Lecture 7- MLib.pdf
    ├── Lecture 8- SQL.pdf
    ├── Lecture 9- NoSQL.pdf
    ├── week1
    │   └── Lecture 1.pdf
    ├── week2
    │   ├── Lecture 2 - Git, Latex, and Other Intros.pdf
    │   └── Lecture 2.pdf
    ├── week3
    │   └── Lecture 3.pdf
    ├── week4
    │   ├── .DS_Store
    │   └── Lecture 4- Data.pdf
    └── week5
    │   ├── .DS_Store
    │   └── Lecture 5-  Spark.pdf
└── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | University of Illinois/NCSA Open Source License
 2 | 
 3 | Copyright (c) 2017 LCDM@UIUC
 4 | All rights reserved.
 5 | 
 6 | Developed by: 		LCDM@UIUC - Professor Robert J. Brunner and CS199: ACC Course Staffs
 7 |                     http://lcdm.illinois.edu
 8 | 
 9 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal with the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
10 | 
11 |     * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimers.
12 | 
13 |     * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided with the distribution.
14 | 
15 |     * Neither the names of the course development team, LCDM@UIUC, nor the names of its contributors may be used to endorse or promote products derived from this Software without specific prior written permission.
16 | 
17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE SOFTWARE.
18 | 


--------------------------------------------------------------------------------
/Labs/Lab1/README.md:
--------------------------------------------------------------------------------
 1 | # Lab 1: Introduction to MapReduce
 2 | 
 3 | ## Introduction
 4 | 
 5 | This lab will introduce the map/reduce computing paradigm. In essence, map/reduce breaks tasks down into a map phase (where an algorithm is mapped onto data) and a reduce phase, where the outputs of the map phase are aggregated into a concise output. The map phase is designed to be parallel, so as to allow wide distribution of computation.
 6 | 
 7 | The map phase identifies keys and associates with them a value. The reduce phase collects keys and aggregates their values. The standard example used to demonstrate this programming approach is a word count problem, where words (or tokens) are the keys and the number of occurrences of each word (or token) is the value.
 8 | 
 9 | As this technique was popularized by large web search companies like Google and Yahoo who were processing large quantities of unstructured text data, this approach quickly became popular for a wide range of problems. The standard MapReduce approach uses Hadoop, which was built using Java. However, to introduce you to this topic without adding the extra overhead of learning Hadoop's idiosyncrasies, we will be 'simulating' a map/reduce workload in pure Python.
10 | 
11 | ## Example: Word Count
12 | 
13 | This example displays the type of programs we can build from simple map/reduce functions. Suppose our task is to come up with a count of the occurrences of each word in a large set of text. We could simply iterate through the text and count the words as we saw them, but this would be slow and non-parallelizable.
14 | 
15 | Instead, we break the text up into chunks, and then split those chunks into words. This is the ‘map’ phase (i.e. the input text is mapped to a list of words). Then, we can ‘reduce’ this data into a coherent word count that holds for the entire text set. We do this by accumulating the count of each word in each chunk using our reduce function.
16 | 
17 | Take a look at `map_reducer.py` and `word_count.py` to see the example we’ve constructed for you. Notice that the `map` stage is being run on a multiprocess pool. This is functionally analogous to a cloud computing application, the difference being in the cloud, this work would be distributed amongst multiple nodes, whereas in our toy MapReduce, all the processes run on a single machine.
18 | 
19 | Run `python word_count.py` to see our simple map/reduce example. You can adjust `NUM_WORKERS` in `map_reducer.py` to see how we make (fairly small) performance gains from parallelizing the work. (Hint: running `time python word_count.py` will give you a better idea of the runtime)
20 | 
21 | ## Exercise: Bigram Count
22 | 
23 | Suppose now that instead of trying to count the individual words, we want to get counts of the occurences word [bigrams](https://en.wikipedia.org/wiki/Bigram) - that is, pairs of words that are adjacent to each other in the text. It is not just all the pairs of the words in the text
24 | 
25 | For example, if our line of text was `“cat dog sheep horse”`, we’d have the bigrams `(“cat”, “dog”)`, `(“dog, “sheep”)` and `(“sheep”, “horse”)`.
26 | 
27 | Construct a map function and reduce function that will accomplish this goal.
28 | 
29 | Note: For the purposes of this exercise, we’ll only consider bigrams that occur on the same line. So, you don’t need to worry about pairs that occur between line breaks.
30 | 
31 | ## Exercise: Common Friends
32 | 
33 | Suppose we’re running a social network and we want a fast way to calculate a list of common friends for pairs of users in our site. This can be done fairly easily with a map/reduce procedure.
34 | 
35 | You’ll be given input of a friend ‘graph’ that looks like this:
36 | 
37 | ```
38 | A|B
39 | B|A,C,D
40 | C|B,D
41 | D|C,B,E
42 | E|D
43 | ```
44 | The graph can be visualized as
45 | ``` 
46 | A-B - D-E
47 |    \ /
48 |     C
49 | ```
50 | Read this as: A is friends with B, B is friends with A, C and D, and so on. Our desired output is as follows:
51 | 
52 | ```
53 | (B,C): [D]
54 | (B,D): [C]
55 | (C,D): [B]
56 | ```
57 | Read this as: B and C have D in common as a friend, B and D have C in common as a friend, and C and D have B in common as a friend. None of the other relationships have common friends.
58 | 
59 | Your mapper stage should take each line of the friend graph and produce a list of relationships:
60 | 
61 | `A|B` -> `(A,B): A, B` 
62 | 
63 | `B|A, C, D` -> `(B,A): A, C, D`, `(B,C): A, C, D`, `(B,D): A, C, D`
64 | 
65 | `C|B, D` -> `(C,B): B, D`, `(C, D): B, D`
66 | 
67 | *et cetera*
68 | 
69 | The reducer phase should take all of these relationships and output common friends for each pair. (Hint: Lookup set intersection)
70 | 
71 | ##Submission
72 | Lab 1 is due on Thursday, Febuary 2nd, 2017 at 11:55PM.
73 | 
74 | Please zip the files and upload it to Moodle (learn.illinois.edu).
75 | 


--------------------------------------------------------------------------------
/Labs/Lab1/bigram_count.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | from operator import itemgetter
 3 | 
 4 | 
 5 | def bigram_mapper(line):
 6 |     pass
 7 | 
 8 | 
 9 | def bigram_reducer(bigram_tuples):
10 |     pass
11 | 
12 | if __name__ == '__main__':
13 |     with open('sherlock.txt') as f:
14 |         lines = f.readlines()
15 |     mr = MapReduce(bigram_mapper, bigram_reducer)
16 |     bigram_counts = mr(lines)
17 |     sorted_bgc = sorted(bigram_counts, key=itemgetter(1), reverse=True)
18 |     for word, count in sorted_bgc[:100]:
19 |         print '{}\t{}'.format(word, count)
20 | 


--------------------------------------------------------------------------------
/Labs/Lab1/common_friends.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | 
 3 | 
 4 | def friend_mapper(line):
 5 |     pass
 6 | 
 7 | 
 8 | def friend_reducer(friend_tuples):
 9 |     pass
10 | 
11 | if __name__ == '__main__':
12 |     with open('friend_graph.txt') as f:
13 |         lines = f.readlines()
14 |     mr = MapReduce(friend_mapper, friend_reducer)
15 |     common_friends = mr(lines)
16 |     for relationship, friends in common_friends:
17 |         print '{}\t{}'.format(relationship, friends)
18 | 


--------------------------------------------------------------------------------
/Labs/Lab1/friend_graph.txt:
--------------------------------------------------------------------------------
1 | A|B,D,E,F
2 | B|A,C,E
3 | C|B,F
4 | D|A,E
5 | E|A,B,D
6 | F|A,C
7 | 


--------------------------------------------------------------------------------
/Labs/Lab1/map_reducer.py:
--------------------------------------------------------------------------------
 1 | import multiprocessing
 2 | import itertools
 3 | from operator import itemgetter
 4 | 
 5 | NUM_WORKERS = 10
 6 | 
 7 | 
 8 | class MapReduce(object):
 9 |     def __init__(self, map_func, reduce_func):
10 |         # Function for the map phase
11 |         self.map_func = map_func
12 | 
13 |         # Function for the reduce phase
14 |         self.reduce_func = reduce_func
15 | 
16 |         # Pool of processes to parallelize computation
17 |         self.proccess_pool = multiprocessing.Pool(NUM_WORKERS)
18 | 
19 |     def kv_sort(self, mapped_values):
20 |         return sorted(list(mapped_values), key=itemgetter(0))
21 | 
22 |     def __call__(self, data_in):
23 |         # Run the map phase in our process pool
24 |         map_phase = self.proccess_pool.map(self.map_func, data_in)
25 | 
26 |         # Sort the resulting mapped data
27 |         sorted_map = self.kv_sort(itertools.chain(*map_phase))
28 | 
29 |         # Run our reduce function
30 |         reduce_phase = self.reduce_func(sorted_map)
31 | 
32 |         # Return the results
33 |         return reduce_phase
34 | 


--------------------------------------------------------------------------------
/Labs/Lab1/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import string
3 | 
4 | 
5 | def strip_punctuation(str_in):
6 |     # Strip punctuation from word (don't worry too much about this)
7 |     return re.sub('[%s]' % re.escape(string.punctuation), '', str_in)
8 | 


--------------------------------------------------------------------------------
/Labs/Lab1/word_count.py:
--------------------------------------------------------------------------------
 1 | from map_reducer import MapReduce
 2 | from operator import itemgetter
 3 | from utils import strip_punctuation
 4 | 
 5 | 
 6 | def string_to_words(str_in):
 7 |     words = []
 8 |     # Split string into words
 9 |     for word in str_in.strip().split():
10 |         # Strip punctuation
11 |         word = strip_punctuation(word)
12 | 
13 |         # Note each individual instance of a word
14 |         words.append((word, 1))
15 |     return words
16 | 
17 | 
18 | def word_count_reducer(word_tuples):
19 |     # Dict to count the instances of each word
20 |     words = {}
21 | 
22 |     for entry in word_tuples:
23 |         word, count = entry
24 | 
25 |         # Add 1 to our word counts for each word we see
26 |         if word in words:
27 |             words[word] += 1
28 |         else:
29 |             words[word] = 1
30 | 
31 |     return words.items()
32 | 
33 | if __name__ == '__main__':
34 |     with open('sherlock.txt') as f:
35 |         lines = f.readlines()
36 | 
37 |     # Construct our MapReducer
38 |     mr = MapReduce(string_to_words, word_count_reducer)
39 |     # Call MapReduce on our input set
40 |     word_counts = mr(lines)
41 |     sorted_wc = sorted(word_counts, key=itemgetter(1), reverse=True)
42 |     for word, count in sorted_wc[:100]:
43 |         print '{}\t{}'.format(word, count)
44 | 


--------------------------------------------------------------------------------
/Labs/Lab2/README.md:
--------------------------------------------------------------------------------
  1 | # Lab 2: Introduction to Map/Reduce on Hadoop
  2 | 
  3 | ## Introduction
  4 | 
  5 | In this lab, we introduce the map/reduce programming
  6 | paradigm. Simply put, this approach to computing breaks tasks down into
  7 | a map phase (where an algorithm is mapped onto data) and a reduce phase,
  8 | where the outputs of the map phase are aggregated into a concise output.
  9 | The map phase is designed to be parallel, and to move the computation to
 10 | the data, which, when using HDFS, can be widely distributed. In this
 11 | case, a map phase can be executed against a large quantity of data very
 12 | quickly. The map phase identifies keys and associates with them a value.
 13 | The reduce phase collects keys and aggregates their values. The standard
 14 | example used to demonstrate this programming approach is a word count
 15 | problem, where words (or tokens) are the keys) and the number of
 16 | occurrences of each word (or token) is the value.
 17 | 
 18 | As this technique was popularized by large web search companies like
 19 | Google and Yahoo who were processing large quantities of unstructured
 20 | text data, this approach quickly became popular for a wide range of
 21 | problems.  Of course, not every problem can be transformed into a
 22 | map-reduce approach, which is why we will explore Spark in several
 23 | weeks. The standard MapReduce approach uses Hadoop, which was built
 24 | using Java. Rather than switching to a new language, however, we will
 25 | use Hadoop Streaming to execute Python code. In the rest of this
 26 | lab, we introduce a simple Python WordCount example code. We first
 27 | demonstrate this code running at the Unix command line, before switching to running the code by using Hadoop Streaming.
 28 | 
 29 | ### Mapper: Word Count
 30 | 
 31 | The first Python code we will write is the map Python program. This
 32 | program simply reads data from `STDIN`, tokenizes each line into words and
 33 | outputs each word on a separate line along with a count of one. Thus our
 34 | map program generates a list of word tokens as the keys and the value is
 35 | always one.
 36 | 
 37 | ```python
 38 | #!/usr/bin/python
 39 | 
 40 | # These examples are based off the blog post by Michale Noll:
 41 | # 
 42 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
 43 | #
 44 | 
 45 | import sys
 46 | 
 47 | # We explicitly define the word/count separator token.
 48 | sep = '\t'
 49 | 
 50 | # We open STDIN and STDOUT
 51 | with sys.stdin as fin:
 52 |     with sys.stdout as fout:
 53 |     
 54 |         # For every line in STDIN
 55 |         for line in fin:
 56 |         
 57 |             # Strip off leading and trailing whitespace
 58 |             line = line.strip()
 59 |             
 60 |             # We split the line into word tokens. Use whitespace to split.
 61 |             # Note we don't deal with punctuation.
 62 |             
 63 |             words = line.split()
 64 |             
 65 |             # Now loop through all words in the line and output
 66 | 
 67 |             for word in words:
 68 |                 fout.write("{0}{1}1\n".format(word, sep))
 69 | ```
 70 | 
 71 | ### Reducer: Word Count
 72 | 
 73 | The second Python program we write is our reduce program. In this code,
 74 | we read key-value pairs from `STDIN` and use the fact that the Hadoop
 75 | process first sorts all key-value pairs before sending the map output to
 76 | the reduce process to accumulate the cumulative count of each word. The
 77 | following code could easily be made more sophisticated by using `yield`
 78 | statements and iterators, but for clarity we use the simple approach of
 79 | tracking when the current word becomes different than the previous word
 80 | to output the key-cumulative count pairs.
 81 | 
 82 | ```python
 83 | #!/usr/bin/python
 84 | 
 85 | import sys
 86 | 
 87 | # We explicitly define the word/count separator token.
 88 | sep = '\t'
 89 | 
 90 | # We open STDIN and STDOUT
 91 | with sys.stdin as fin:
 92 |     with sys.stdout as fout:
 93 |     
 94 |         # Keep track of current word and count
 95 |         cword = None
 96 |         ccount = 0
 97 |         word = None
 98 |    
 99 |         # For every line in STDIN
100 |         for line in fin:
101 |         
102 |             # Strip off leading and trailing whitespace
103 |             # Note by construction, we should have no leading white space
104 |             line = line.strip()
105 |             
106 |             # We split the line into a word and count, based on predefined
107 |             # separator token.
108 |             #
109 |             # Note we haven't dealt with punctuation.
110 |             
111 |             word, scount = line.split('\t', 1)
112 |             
113 |             # We will assume count is always an integer value
114 |             
115 |             count = int(scount)
116 |             
117 |             # word is either repeated or new
118 |             
119 |             if cword == word:
120 |                 ccount += count
121 |             else:
122 |                 # We have to handle first word explicitly
123 |                 if cword != None:
124 |                     fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount))
125 |                 
126 |                 # New word, so reset variables
127 |                 cword = word
128 |                 ccount = count
129 |         else:
130 |             # Output final word count
131 |             if cword == word:
132 |                 fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount))
133 | ```
134 | 
135 | ### Testing Python Map-Reduce
136 | 
137 | Before we begin using Hadoop, we should first test our Python codes out
138 | to ensure they work as expected. First, we should change the permissions
139 | of the two programs to be executable, which we can do with the Unix
140 | `chmod` command.
141 | 
142 | ```sh
143 | chmod u+x /path/to/lab2/mapper.py
144 | chmod u+x /path/to/lab2/reducer.py
145 | ```
146 | 
147 | #### Testing Mapper.py
148 | 
149 | To test out the map Python code, we can run the Python `mapper.py` code
150 | and specify that the code should redirect STDIN to read the book text
151 | data. This is done in the following code cell, we pipe the output into
152 | the Unix `head` command in order to restrict the output, which would be
153 | one line per word found in the book text file. In the second code cell,
154 | we next pipe the output of  `mapper.py` into the Unix `sort` command,
155 | which is done automatically by Hadoop. To see the result of this
156 | operation, we next pipe the result into the Unix `uniq` command to count
157 | duplicates, pipe this result into a new sort routine to sort the output
158 | by the number of occurrences of a word, and finally display the last few
159 | lines with the Unix `tail` command to verify the program is operating
160 | correctly.
161 | 
162 | With these sequence of Unix commands, we have (in a single-node)
163 | replicated the steps performed by Hadoop MapReduce: Map, Sort, and
164 | Reduce.
165 | 
166 | 
167 | 
168 | 
169 | ```sh
170 | cd /path/to/lab2
171 | 
172 | ./mapper.py <  book.txt | wc -l
173 | ```
174 | 
175 | ```sh
176 | cd /path/to/lab2
177 | 
178 | ./mapper.py <  book.txt | sort -n -k 1 | \
179 |  uniq -c | sort -n -k 1 | tail -10
180 | ```
181 | 
182 | #### Testing Reducer.py
183 | 
184 | To test out the reduce Python code, we run the previous code cell, but
185 | rather than piping the result into the Unix `tail` command, we pipe the
186 | result of the sort command into the Python `reducer.py` code. This
187 | simulates the Hadoop model, where the map output is key sorted before
188 | being passed into the reduce process. First, we will simply count the
189 | number of lines displayed by the reduce process, which will indicate the
190 | number of  unique _word tokens_ in the book. Next, we will sort the
191 | output by the number of times each word token appears and display the
192 | last few lines to compare with the previous results.
193 | 
194 | 
195 | ```sh
196 | cd /path/to/lab2
197 | 
198 | ./mapper.py <  book.txt | sort -n -k 1 | \
199 | ./reducer.py | wc -l
200 | ```
201 | 
202 | ```sh
203 | cd /path/to/lab2
204 | 
205 | ./mapper.py <  book.txt | sort -n -k 1 | \
206 | ./reducer.py | sort -n -k 2 | tail -10
207 | ```
208 | 
209 | ## Python Hadoop Streaming
210 | 
211 | **IMPORTANT:** Before doing the following activities, run the following command to setup the Hadoop environment correctly. If you don't, it's likely that these instructions **will not work**.
212 | 
213 | ```
214 | source ~/hadoop.env
215 | ```
216 | 
217 | ### Introduction
218 | 
219 | We are now ready to actually run our Python codes via Hadoop Streaming.
220 | The main command to perform this task is `hadoop`.
221 | 
222 | Running this Hadoop command by supplying the `-help` flag will provide
223 | a useful summary of the different options. Note that `jar` is short for
224 | Java Archive, which is a compressed archive of compiled Java code that
225 | can be executed to perform different operations. In this case, we will
226 | run the Java Hadoop streaming jar file to enable our Python code to work
227 | within Hadoop.
228 | 
229 | 
230 | ```sh
231 | # Run the Map Reduce task within Hadoop
232 | hadoop --help
233 | ```
234 | 
235 |     Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
236 |       CLASSNAME            run the class named CLASSNAME
237 |      or
238 |       where COMMAND is one of:
239 |       fs                   run a generic filesystem user client
240 |       version              print the version
241 |       jar <jar>            run a jar file
242 |                            note: please use "yarn jar" to launch
243 |                                  YARN applications, not this command.
244 |       checknative [-a|-h]  check native hadoop and compression libraries availability
245 |       distcp <srcurl> <desturl> copy file or directories recursively
246 |       archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
247 |       classpath            prints the class path needed to get the
248 |       credential           interact with credential providers
249 |                            Hadoop jar and the required libraries
250 |       daemonlog            get/set the log level for each daemon
251 |       trace                view and modify Hadoop tracing settings
252 |     
253 |     Most commands print help when invoked w/o parameters.
254 | 
255 | 
256 | For our map/reduce Python example to
257 | run successfully, we will need to specify five flags:
258 | 
259 | 1. `-files`: a comma separated list of files to be copied to the Hadoop cluster.
260 | 2. `-input`: the HDFS input file(s) to be used for the map task.
261 | 3. `-output`: the HDFS output directory, used for the reduce task.
262 | 4. `-mapper`: the command to run for the map task.
263 | 5. `-reducer`: the command to run for the reduce task.
264 | 
265 | Given our previous setup, we will eventually run the full command as follows:
266 | 
267 | ```
268 | 	# DON'T RUN ME YET!
269 |     hadoop $STREAMING -files mapper.py,reducer.py -input wc/in \
270 |         -output wc/out -mapper mapper.py -reducer reducer.py 
271 | ```
272 | When this command is run, a series of messages will be displayed to the
273 | screen (via STDERR) showing the progress of our Hadoop Streaming task.
274 | At the end of the stream of information messages will be a statement
275 | indicating the location of the output directory as shown below. Note, we
276 | can append Bash redirection to ignore the Hadoop messages, simply by
277 | appending `2> /dev/null` to the end of any Hadoop command, which sends
278 | all STDERR messages to a non-existent Unix device, which is akin to
279 | nothing. 
280 | 
281 | For example, to ignore any messages from the `hdfs dfs -rm -r -f wc/out`
282 | command, we would use the following syntax:
283 | 
284 | ```bash
285 | hdfs dfs -rm -r -f wc/out 2> /dev/null
286 | ```
287 | 
288 | Doing this, however, does hide all messages, which can make debugging
289 | problems more difficult. As a result, you should only do this when your
290 | commands work correctly.
291 | 
292 | ### Putting files in HDFS
293 | In order for Hadoop to be able to access our raw data (the book text) we first have to copy it into the file system that Hadoop uses natively, HDFS.
294 | 
295 | To do this, we'll run a series of HDFS commands that will copy our local `book.txt` into the distributed file system.
296 | 
297 | ```
298 | # Make a directory for our book data
299 | hdfs dfs -mkdir -p wc/in
300 | 
301 | # Copy our book to our new folder
302 | hdfs dfs -copyFromLocal book.txt wc/in/book.txt
303 | 
304 | # Check to see that our book has made it to the folder
305 | hdfs dfs -ls wc/in
306 | hdfs dfs -tail wc/in/book.txt
307 | ```
308 | 
309 | ### Running the Hadoop Job
310 | Now that our data is in Hadoop HDFS, we can actually execute the streaming job that will run our word count map/reduce.
311 | 
312 | ```sh
313 | # Delete output directory (if it exists)
314 | hdfs dfs -rm -r -f wc/out
315 | 
316 | # Run the Map Reduce task within Hadoop
317 | hadoop jar $STREAMING \
318 |     -files mapper.py,reducer.py -input wc/in \
319 |     -output wc/out -mapper mapper.py -reducer reducer.py
320 | ```
321 | 
322 | ### Hadoop Results
323 | 
324 | In order to view the results of our Hadoop Streaming task, we must use
325 | HDFS DFS commands to examine the directory and files generated by our
326 | Python Map/Reduce programs. The following list of DFS commands might
327 | prove useful to view the results of this map/reduce job.
328 | 
329 | ```bash
330 | # List the wc directory
331 | hdfs dfs -ls wc
332 | 
333 | # List the output directory
334 | hdfs dfs -ls wc/out
335 | 
336 | # Do a line count on our output
337 | hdfs dfs -count -h wc/out/part-00000
338 | 
339 | # Tail the output
340 | hdfs dfs -tail wc/out/part-00000
341 | ```
342 | 
343 | Note that these
344 | Hadoop HDFS commands can be intermixed with Unix commands to perform
345 | additional text processing. The important point is that direct file I/O
346 | operations must use HDFS commands to work with the HDFS file system.
347 | 
348 | The output should match the Python
349 | only map-reduce approach.
350 | 
351 | ### Hadoop Cleanup
352 | 
353 | Following the successful run of our map/reduce Python programs, we have
354 | created a new directory `wc/out` in the HDFS, which contains two files. If we wish
355 | to rerun this Hadoop Streaming map/reduce task, we must either specify a
356 | different output directory, or else we must clean up the results of the
357 | previous run. To remove the output directory, we can simply use the HDFS
358 | `-rm -r -f wc/out` command, which will immediately delete the `wc/out`
359 | directory. The successful completion of this command is indicated by
360 | Hadoop, and this can also be verified by listing the contents of the
361 | `wc` directory.
362 | 
363 | ```sh
364 | hdfs dfs -ls wc
365 | ```
366 | 
367 | ## Lab Assignment
368 | 
369 | Lab 2 is due on Thursday, Febuary 9nd, 2017 at 11:55PM.
370 | Please zip your source files for the following exersizes and upload it to Moodle (learn.illinois.edu).
371 | 
372 | 
373 | In the preceding cells, we introduced Hadoop map/reduce by using a
374 | simple word count task. Now that you have run the lab, go back and
375 | make the following changes to see how the results change.
376 | 
377 | 1. We ignored punctuation, modify the original mapper Python code to
378 | token on white space or punctuation.
379 | 2. Try downloading a different text from Project Gutenberg. Write a map-reduce application that can run across multiple texts.
380 | 3. Write a map-reduce application to compute bi-grams instead of
381 | unigrams (combinations of two adjacent words as they appear in the text)
382 | 


--------------------------------------------------------------------------------
/Labs/Lab2/mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # These examples are based off the blog post by Michale Noll:
 4 | #
 5 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
 6 | #
 7 | 
 8 | import sys
 9 | 
10 | # We explicitly define the word/count separator token.
11 | sep = '\t'
12 | 
13 | # We open STDIN and STDOUT
14 | with sys.stdin as fin:
15 |     with sys.stdout as fout:
16 | 
17 |         # For every line in STDIN
18 |         for line in fin:
19 | 
20 |             # Strip off leading and trailing whitespace
21 |             line = line.strip()
22 | 
23 |             # We split the line into word tokens. Use whitespace to split.
24 |             # Note we don't deal with punctuation.
25 | 
26 |             words = line.split()
27 | 
28 |             # Now loop through all words in the line and output
29 | 
30 |             for word in words:
31 |                 fout.write("{0}{1}1\n".format(word, sep))
32 | 


--------------------------------------------------------------------------------
/Labs/Lab2/reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We explicitly define the word/count separator token.
 6 | sep = '\t'
 7 | 
 8 | # We open STDIN and STDOUT
 9 | with sys.stdin as fin:
10 |     with sys.stdout as fout:
11 | 
12 |         # Keep track of current word and count
13 |         cword = None
14 |         ccount = 0
15 |         word = None
16 | 
17 |         # For every line in STDIN
18 |         for line in fin:
19 | 
20 |             # Strip off leading and trailing whitespace
21 |             # Note by construction, we should have no leading white space
22 |             line = line.strip()
23 | 
24 |             # We split the line into a word and count, based on predefined
25 |             # separator token.
26 |             #
27 |             # Note we haven't dealt with punctuation.
28 | 
29 |             word, scount = line.split('\t', 1)
30 | 
31 |             # We will assume count is always an integer value
32 | 
33 |             count = int(scount)
34 | 
35 |             # word is either repeated or new
36 | 
37 |             if cword == word:
38 |                 ccount += count
39 |             else:
40 |                 # We have to handle first word explicitly
41 |                 if cword is not None:
42 |                     fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount))
43 | 
44 |                 # New word, so reset variables
45 |                 cword = word
46 |                 ccount = count
47 |         else:
48 |             # Output final word count
49 |             if cword == word:
50 |                 fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount))
51 | 


--------------------------------------------------------------------------------
/Labs/Lab3/README.md:
--------------------------------------------------------------------------------
 1 | # Lab 3: Hadoop Map/Reduce on "Real" Data
 2 | 
 3 | ## Introduction
 4 | In the past 2 labs, you were introduced to the concept of Map/Reduce and how we can execute Map/Reduce Python scripts on Hadoop with Hadoop Streaming.
 5 | 
 6 | This week, we'll give you access to a modestly large (~60GB) Twitter dataset. You'll be using the skills you learned in the last two weeks to perform some more complex transformations on this dataset.
 7 | 
 8 | Like in the last lab, we'll be using Python to write mappers and reducers. We've included a helpful shell script to make running mapreduce jobs easier. The script is in your `bin` folder, so you can just run it as `mapreduce`, but we've included the source in this lab so you can see what it's doing.
 9 | 
10 | Here's the usage for that command:
11 | 
12 | ```
13 | Usage: ./mapreduce map-script reduce-scripe hdfs-input-path hdfs-output-path
14 | 
15 | Example: ./mapreduce mapper.py reducer.py /tmp/helloworld.txt /user/quinnjarr
16 | ```
17 | 
18 | ## The Dataset
19 | 
20 | The dataset is located in `/shared/snapTwitterData` in HDFS. You'll find these files: 
21 | 
22 | ```
23 | /shared/snapTwitterData/tweets2009-06.tsv
24 | /shared/snapTwitterData/tweets2009-07.tsv
25 | /shared/snapTwitterData/tweets2009-08.tsv
26 | /shared/snapTwitterData/tweets2009-09.tsv
27 | /shared/snapTwitterData/tweets2009-10.tsv
28 | /shared/snapTwitterData/tweets2009-11.tsv
29 | /shared/snapTwitterData/tweets2009-12.tsv
30 | ```
31 | 
32 | Each file is a `tsv` (tab-separated value) file. The schema of the file is as follows:
33 | 
34 | ```
35 | POST_DATETIME <tab> TWITTER_USER)URL <tab> TWEET_TEXT
36 | ```
37 | 
38 | Example:
39 | 
40 | ```
41 | 2009-10-31 23:59:58	http://twitter.com/sometwitteruser	Wow, CS199 is really a good course
42 | ```
43 | 	
44 | ## Lab Activities
45 | **Lab 3 is due on Thursday, Febuary 16nd, 2017 at 11:55PM.**
46 | 
47 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu).
48 | 
49 | **NOTE:** Place your Hadoop output in your HDFS home directory under the folder `~/twitter/`. (i.e. Problem 1 output should map to `~/twitter/`
50 | 
51 | **EDIT:** Due to resource constraints on the cluster, please run your map/reduce jobs on only 1 twitter file (`tweets2009-06.tsv`). If you've already run your code on the whole dataset, that is fine.
52 | 
53 | 1. Write a map/reduce program to determine the the number of @ replies each user received.
54 | 2. Write a map/reduce program to determine the user with the most Tweets for every given day in the dataset. (If there's a tie, break the tie by sorting alphabetically on users' handles)
55 | 3. Write a map reduce program to determine which Twitter users have the largest vocabulary - that is, users whose number of unique words in their tweets is maximized.
56 | 
57 | ### Don't lose your progress!
58 | 
59 | Hadoop jobs can take a very long time to complete. If you don't take precautions, you'll lose all your progress if something happens to your SSH session.
60 | 
61 | To mitigate this, we have installed `tmux` on the cluster. Tmux is a tool that lets us persist shell sessions even when we lose SSH connection.
62 | 
63 | 1. Run `tmux` to enter into a tmux session.
64 | 2. Run some command that will take a long time (`ping google.com`)
65 | 3. Exit out of your SSH session.
66 | 4. Log back into the server and run `tmux attach` and you should find your session undisturbed.
67 | 
68 | ### Suggested Workflow
69 | 
70 | We've provided you with a `sample` command that streams out a random 1% sample of the text file. This is useful for testing, as you won't want to use the entire dataset while developing your map/reduce scripts.
71 | 
72 | 1. Write your map/reduce and test it with regular unix commands:
73 | 
74 | 	```
75 | 	sample /mnt/volume/snapTwitterData/tweets2009-06.tsv | ./<MAPPER>.py | sort | ./<REDUCER>.py
76 | 	```
77 | 
78 | 2. Test your map/reduce with a single Tweet file on Hadoop:
79 | 
80 | 	```
81 | 	hdfs dfs -mkdir -p twitter
82 | 	hdfs dfs -rm -r twitter/out
83 | 	mapreduce <MAPPER>.py <REDUCER>.py /shared/snapTwitterData/tweets2009-06.tsv twitter/out
84 | 	```
85 | 	
86 | 3. Run your map/reduce on the full dataset:
87 | 	```
88 | 	hdfs dfs -mkdir -p twitter
89 | 	hdfs dfs -rm -r twitter/out
90 | 	mapreduce <MAPPER>.py <REDUCER>.py /shared/snapTwitterData/*.tsv twitter/out
91 | 	```
92 | 


--------------------------------------------------------------------------------
/Labs/Lab3/prob1_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/Labs/Lab3/prob1_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/Labs/Lab3/prob2_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/Labs/Lab3/prob2_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | .


--------------------------------------------------------------------------------
/Labs/Lab3/prob3_mapper.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/Labs/Lab3/prob3_reducer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | # We open STDIN and STDOUT
 6 | with sys.stdin as fin:
 7 |     with sys.stdout as fout:
 8 | 
 9 |         # For every line in STDIN
10 |         for line in fin:
11 |             # Your code here!
12 |             pass
13 | 


--------------------------------------------------------------------------------
/Labs/Lab4/README.md:
--------------------------------------------------------------------------------
  1 | # Lab 4: Spark
  2 | 
  3 | ## Introduction
  4 | 
  5 | As we have talked about in lecture, Spark is built on what's called a Resilient Distributed Dataset (RDD). 
  6 | 
  7 | PySpark allows us to interface with these RDD’s in Python. Think of it as an API. In fact, it is an API; it even has its own [documentation](http://spark.apache.org/docs/latest/api/python/)! It’s built on top of the Spark’s Java API and exposes the Spark programming model to Python.
  8 | 
  9 | 
 10 | PySpark makes use of a library called `Py4J`, which enables Python programs to dynamically access Java objects in a Java Virtual Machine.
 11 | 
 12 | This allows data to be processed in Python and cached in the JVM.
 13 | 
 14 | 
 15 | ## Running your Jobs
 16 | 
 17 | We'll be using `spark-submit` to run our spark jobs on the cluster. `spark-submit` has a couple command line options that you can tweak.
 18 | 
 19 | #### `--master`
 20 | This option tells `spark-submit` where to run your job, as spark can run in several modes.
 21 | 
 22 | * `local`
 23 |     * The spark job runs locally, without using any compute resources from the cluster.
 24 | * `yarn-client`
 25 |     * The spark job runs on our YARN cluster, but the driver is local to the machine, so it 'appears' that you're running the job locally, but you still get the compute resources from the cluster. You'll see the logs spark provides as the program executes.
 26 |     * When the cluster is busy, you *will not* be able to use this mode, because it imposes too much of a memory footprint.
 27 | * `yarn-cluster`
 28 |     * The spark job runs on our YARN cluster, and the spark driver is in some arbitrary location on the cluster. This option doesn't give you logs directly, so you'll have to get the logs manually.
 29 |     * In the output of `spark-submit --master yarn-cluster` you'll find an `applicationId`. (This is similar to when you ran jobs on Hadoop). You can issue this command to get the logs for your job:
 30 | 
 31 |         ```
 32 |         yarn logs -applicationId <YOUR_APPLICATION_ID> | less
 33 |         ```
 34 |     * When debugging Python applications, it's useful to `grep` for `Traceback` in your logs, as this will likely be the actual debug information you're looking for.
 35 | 
 36 |         ```
 37 |         yarn logs -applicationId <YOUR_APPLICATION_ID> | grep -A 50 Traceback
 38 |         ```
 39 |         
 40 |     * *NOTE*: In cluster mode, normal IO operations like opening files will behave unexpectedly! This is because you're not guaranteed which node the driver will run on. You must use the PySpark API for saving files to get reliable results. You also have to coalesce your RDD into one partition before asking PySpark to write to a file (why do you think this is?). Additionally, you should save your results to HDFS.
 41 | 
 42 |         ```python
 43 |         <my_rdd>.coalesce(1).saveAsTextFile('hdfs:///user/MY_USERNAME/foo')
 44 |         ```
 45 | 
 46 | #### `--num-executors`
 47 | This option lets you set the number of executors that your job will have. A good rule of thumb is to have as many executors as the maximum number of partitions an RDD will have during a Spark job (this heuristic holds better for simple jobs, but falls apart as the complexity of your job increases).
 48 | 
 49 | The number of executors is a tradeoff. Too few, and you might not be taking full advantage of Sparks parallelism. However, there is also an upper bound on the number of executors (for obvious reasons), as they have a fairly large memory footprint. (Don't set this too high or we'll terminate your job.)
 50 | 
 51 | You can tweak executors more granularly by setting the amount of memory and number of cores they're allocated, but for our purposes the default values are sufficient.
 52 | 
 53 | ### Putting it all together
 54 | 
 55 | Submitting a spark job will ususally look something like this:
 56 | 
 57 | ```
 58 | spark-submit --master yarn-cluster --num-executors 10 MY_PYTHON_FILE.py
 59 | ```
 60 | 
 61 | Be sure to include the `--master` flag, or else your code will only run locally, and you won't get the benefits of the cluster's parallelism.
 62 | 
 63 | You can track the progress of your application by looking running this command
 64 | `ssh -L 127.0.0.1:9002:192-168-100-234.local:18080 username@141.142.210.245`
 65 | 
 66 | where username is your username. Then look at http://127.0.0.1:9002 . If you scroll to the bottom and click Show incomplete applications, you can see the current progress of your script
 67 | 
 68 | ### Interactive Shell
 69 | 
 70 | While `spark-submit` is the way we'll be endorsing to run PySpark jobs, there is an option to run jobs in an interactive shell. Use the `pyspark` command to load into the PySpark interactive shell. You can use many of the same options listed above to tweak `pyspark` settings, such as `--num-executors` and `--master`.
 71 | 
 72 | Note: If you start up the normal `python` interpreter, you probably won't be able to use any of the PySpark features.
 73 | 
 74 | ### Helpful Hints
 75 | 
 76 | * You'll find the [PySpark documentation](https://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.RDD) (especially the section on RDDs) very useful.
 77 | * Run your Spark jobs on a subset of the data when you're debugging. Even though Spark is very fast, jobs can still take a long time - especially when you're working with the review dataset. When you are experimenting, always use a subset of the data. The best way to use a subset of data is through the [take](https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html#take(int)) command.
 78 | 
 79 | Specifically the most common pattern to sample data looks like
 80 | `rdd = sc.parallelize(rdd.take(100))`
 81 | This converts an rdd into a list of 100 items and then back into an rdd through the parallelize function.
 82 | 
 83 | * [Programming Guide](http://spark.apache.org/docs/latest/programming-guide.html) -- This documentation by itself could be used to solve the entire lab. It is a great quickstart guide about Spark.
 84 | 
 85 | 
 86 | 
 87 | 
 88 | ## The Dataset
 89 | 
 90 | This week, we'll be working off of a set of released Yelp data.
 91 | 
 92 | The dataset is located in `/shared/yelp` in HDFS. We'll be using the following files for this lab:
 93 | 
 94 | ```
 95 | /shared/yelp/yelp_academic_dataset_business.json
 96 | /shared/yelp/yelp_academic_dataset_checkin.json
 97 | /shared/yelp/yelp_academic_dataset_review.json
 98 | /shared/yelp/yelp_academic_dataset_user.json
 99 | ```
100 | 
101 | We'll give more details about the data in these files as we continue with the lab, but the general schema is this: each line in each of these JSON files is an independently parsable JSON object that represents a distinct entity, whether it be a business, a review, or a user.
102 | 
103 | *Hint:* JSON is parsed with `json.loads`
104 | 
105 | ## Lab Activities
106 | **Lab 4 is due on Thursday, March 2nd, 2017 at 11:55PM.**
107 | 
108 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu).
109 | 
110 | ### 1. Least Expensive Cities
111 | 
112 | In planning your next road trip, you want to find the cities that, overall, will be the least expensive to dine at.
113 | 
114 | It turns out that Yelp keeps track of a handy metric for this, and many restaurants have the attribute `RestaurantsPriceRange2` that gives the business a score from 1-4 as far as 'priciness'.
115 | 
116 | Write a PySpark application that sorts cities by the average price of their businesses/restaurants.
117 | 
118 | Notes:
119 | 
120 | * Discard any business that does not have the `RestaurantsPriceRange2` attribute
121 | * Discard any business that does not have a valid city and state
122 | * Your output should be sorted descending by average price (highest at top, lowest at bottom). Your average restaurant price should be rounded to 2 decimal places. Each city should get a row in the output and look like:
123 | 
124 |     `CITY, STATE: PRICE`
125 | 
126 | ### 2. Up All Night
127 | 
128 | You also expect on this road trip that you'll be out pretty late. Yelp also lists the hours that businesses are open, so lets find out where you'll be likely to find something to eat late at night.
129 | 
130 | Write a PySpark application that sorts cities by the median closing time of their businesses/restaurants to find the cities that are open latest.
131 | 
132 | Notes:
133 | 
134 | * Discard any business that doesn't have a valid `hours` property, or an `hours` property that does not include the closing time of the business
135 | * Discard any invalid times (some business have `DAY 0:0-0:0` as their hours, which we consider to be invalid), and for simplicities sake, assume that all businesses close before midnight.
136 | * Use the **median** closing time of businesses in each city as the "city closing time". If you have to tie break (i.e. `num_business_hours % 2 == 1`), choose the lower of the two so you can avoid doing datetime math
137 | * Your output should be in the following format, with median closing time in `HH:MM` (24 hour clock) format and should be sorted descending by time (latest cities first).
138 |     
139 |     `CITY, STATE: HH:MM`
140 | 
141 | ### 3. Pessimistic Yelp Reviewers
142 | 
143 | For this activity, we'll be looking at [Yelp reviews](https://www.youtube.com/watch?v=QEdXhH97Z7E). 😱 Namely, we want to find out which Yelp reviewers are... more harsh than they should be.
144 | 
145 | To do this we will calculate the average review score of each business in our dataset, and find the users that most often under-rate businesses.
146 | 
147 | Use the following to calculate which users are pessimistic:
148 | 
149 | * The `average_business_rating` of a business is the sum of the ratings of the business divided by the count of the ratings for that business.
150 | * A user's pessimism score is the sum of the differences between their rating and the average business rating *if and only if* their rating is lower than the average divided by the number of times their rating was less than the average.
151 | 
152 | Your output should contain the top 100 pessimistic users in the following format, where `pessimism_score` is rounded to 2 decimal places, and users are sorted in descending order by `pessimism_score`:
153 | 
154 | ```
155 | user_id: pessimism_score
156 | ```
157 | 
158 | Notes:
159 | 
160 | * Business have "average rating" as a property. We **will not** be using this. Instead - to have greater precision - we will be manually calculating a business' average reviews by averaging all the review scores given in `yelp_academic_dataset_review.json`.
161 | * Discard any reviews that do not have a rating, a `user_id`, and a `business_id`.
162 | 
163 | ### 4. Descriptors of a Bad Business
164 | 
165 | Suppose we want to predict a review's score from its text. There are many ways we could do this, but a simple way would be to find words that are indicative of either a positive or negative review.
166 | 
167 | In this activity, we want to find the words that are the most 'charged'. We can think about the probability that a word shows up in a review as depending on the type of a review. For example, it is more likely that "delicious" would show up in a positive review than a negative one.
168 | 
169 | Calculate the probability of each word appearing to be the number of occurrences of the word in the category tested (positive/negative) divided by the number of reviews in that category.
170 | 
171 | Output the **top 250** words that are most likely to be in negative reviews, but not in positive reviews (maximize `P(negative) - P(positive)`).
172 | 
173 | Notes:
174 | 
175 | * Remove any words listed in `nltk`'s list of [English stopwords](http://www.nltk.org/book/ch02.html#wordlist-corpora) and remove all punctuation. We also encourage you to use `nltk.tokenize.word_tokenize` to split reviews into words.
176 | * Consider a review to be positive if it has >=3 stars, and consider a review negative if it has <3 stars.
177 | * Your output should be as follows, where `probability_diff` is `P(negative) - P(positive)` rounded to **5** decimal places and sorted in descending order:
178 | 
179 |     `word: probability_diff`
180 | 


--------------------------------------------------------------------------------
/Labs/Lab4/descriptors_of_bad_business.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Descriptors of a Bad Business")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json")
6 | 
7 | with open('descriptors_of_bad_business.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/Labs/Lab4/most_expensive_city.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Most Expensive City")
3 | sc = SparkContext(conf=conf)
4 | 
5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json")
6 | 
7 | with open('most_expensive_city.txt', 'w+') as f:
8 |     f.write('Champaign, IL: 1.23')
9 | 


--------------------------------------------------------------------------------
/Labs/Lab4/pessimistic_users.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Pessimistic Users")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json")
6 | 
7 | with open('pessimistic_users.txt', 'w+') as f:
8 |     f.write('taeyoung_kim: 1.23')
9 | 


--------------------------------------------------------------------------------
/Labs/Lab4/up_all_night.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Up All Night")
3 | sc = SparkContext(conf=conf)
4 | 
5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json")
6 | 
7 | with open('up_all_night.txt', 'w+') as f:
8 |     f.write('Champaign, IL: 11:59')
9 | 


--------------------------------------------------------------------------------
/Labs/Lab5/README.md:
--------------------------------------------------------------------------------
  1 | # Lab 5: Spark MLlib
  2 | 
  3 | ## Introduction
  4 | 
  5 | This week we'll be diving into another aspect of PySpark: MLlib. Spark MLlib provides an API to run machine learning algorithms on RDDs so that we can do ML on the cluster with the benefit of parallelism / distributed computing.
  6 | 
  7 | ## Machine Learning Crash Course
  8 | 
  9 | We'll be considering 3 types of machine learning in the lab this week:
 10 | 
 11 | * Classification
 12 | * Regression
 13 | * Clustering
 14 | 
 15 | However, for most of the algoritms / feature extractors that MLlib provides, there is a common pattern:
 16 | 
 17 | 1) Fit - Trains the model, using training data to adjust the model's internal parameters.
 18 | 
 19 | 2) Transform - Use the fitted model to predict the label/value of novel data (data not used in the traning of the model).
 20 | 
 21 | If you go on to do more data science work, you'll see that this 2-phase ML pattern is common in other ML libraries, like `scikit-learn`.
 22 | 
 23 | Things are a bit more complicated in PySpark, because part of the way RDDs are handled (i.e. lazy evaluation), we often have to explicitly note when we want to predict data, and other instances when we're piping data through different steps of our model's setup.
 24 | 
 25 | It'll be extremely valuable to look up PySpark's documentation and example when working on this week's lab. The lab examples we'll be giving you do not require deep knowledge of Machine Learning concepts to complete. However, you will need to be good documentation-readers to naviaget MLlib's nuances.
 26 | 
 27 | ## Examples
 28 | 
 29 | ### TF-IDF Naive Bayes Yelp Review Classification
 30 | 
 31 | #### Extracting Features
 32 | 
 33 | Remember last week when you found out which words were corrolated with negative reviews by calculating the probability of a word occuring in a review? PySpark lets you do something like this extremely easily to calculate the Term Frequency - Inverse Document Frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) characteristics of a set of texts.
 34 | 
 35 | Let's define some terms:
 36 | 
 37 | * Term frequency - The number of times a word appears in a text
 38 | * Inverse document frequency - Weights a word's importance in a text by seeing if that word is rare in the collection of all texts. So, if we have a sentence contains the only reference to "cats" in an entire book, that sentence will have "cats" ranked as highly relevant.
 39 | 
 40 | TF-IDF combines the previous two concepts. Suppose we had a few sentences that refer to "cats" in a large book. We'd rank those rare sentences then by the frequency of the "cats" in each of those sentences.
 41 | 
 42 | There's a fair amount of math behind calculating TF-IDF, but for this lab it is sufficient to know that it is a relatively reliable way of guessing the relevance of a word in the context of a large body of data.
 43 | 
 44 | You'll also note that we're making use of a `HashingTF`. This is just a really quick way to compute the term-frequency of words. It uses a hash function to represent a long string with a shorter hash, and can use a datastructure like a hashmap to quickly count the frequency with which words appear.
 45 | 
 46 | #### Classifying Features
 47 | 
 48 | We'll also be using a [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier. This type of classifier looks at a set of data and labels, and constructs a model to preduct the label given the data using probabilistic means.
 49 | 
 50 | Again, it's not necessary to know the inner workings of Naive Bayes, just that we'll be using it to classify data.
 51 | 
 52 | #### Constructing a model
 53 | 
 54 | To construct a model, we'll need to construct an RDD that has (key, value) pairs with keys as our labels, and values as our features. First, however, we'll need to extract those features from the text. We're going to use TF-IDF as our feature, so we'll calculate that for all of our text first.
 55 | 
 56 | We'll start with the assumption that you've transformed the data so that we have `(label, array_of_words)` as the RDD. To start with, we'll have label be `0` if the review is negative and `1` if the review is positive. You practiced how to do this last week.
 57 | 
 58 | Here's how we'll extract the TF-IDF features:
 59 | 
 60 | ```python
 61 | # Feed HashingTF just the array of words
 62 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1]))
 63 | 
 64 | # Pipe term frequencies into the IDF
 65 | idf = IDF(minDocFreq=5).fit(tf)
 66 | 
 67 | # Transform the IDF into a TF-IDF
 68 | tfidf = idf.transform(tf)
 69 | 
 70 | # Reassemble the data into (label, feature) K,V pairs
 71 | zipped_data = (labels.zip(tfidf)
 72 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
 73 |                      .cache())
 74 | ```
 75 | 
 76 | Now that we have our labels and our features in one RDD, we can train our model:
 77 | 
 78 | ```
 79 | # Do a random split so we can test our model on non-trained data
 80 | training, test = zipped_data.randomSplit([0.7, 0.3])
 81 | 
 82 | # Train our model with the training data
 83 | model = NaiveBayes.train(training)
 84 | ```
 85 | 
 86 | Then, we can use this model to predict new data:
 87 | ```python
 88 | # Use the test data and get predicted labels from our model
 89 | test_preds = (test.map(lambda x: x.label)
 90 |                   .zip(model.predict(test.map(lambda x: x.features))))
 91 | ```
 92 | 
 93 | If we look at this `test_preds` RDD, we'll see our text, and the label the model predicted.
 94 | 
 95 | However, if we want a more precise measurement of how our model faired, PySpark gives us `MulticlassMetrics`, which we can use to measure our model's performance.
 96 | 
 97 | ```python
 98 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
 99 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
100 | 
101 | print trained_metrics.confusionMatrix().toArray()
102 | print trained_metrics.precision()
103 | 
104 | print test_metrics.confusionMatrix().toArray()
105 | print test_metrics.precision()
106 | ```
107 | 
108 | #### Analyzing our Results
109 | `MulticlassMetrics` let's us see the ["confusion matrix"](https://en.wikipedia.org/wiki/Confusion_matrix) of our model, which shows us how many times our model chose each label given the actual label of the data point.
110 | 
111 | The meaning of the columns is the _predicted_ value, and the meaning of the rows is the _actual_ value. So, we read that `confusion_matrix[0][1]` is the number of items predicted as having `label[1]` that were in actuality `label[0]`.
112 | 
113 | Thus, we want our confusion matrix to have as many items on the diagonals as possible, as these represent items that were correctly predicted.
114 | 
115 | We can also get precision, which is a more simple metric of "how many items we predicted correctly".
116 | 
117 | Here's our results for this example:
118 | 
119 | ```
120 | # Training Data Confusion Matrix:
121 | [[ 2019245.   115503.]
122 |  [  258646.   513539.]]
123 | # Training Data Accuracy:
124 | 0.8712908071840665
125 | 
126 | # Testing Data Confusion Matrix:
127 | [[ 861056.   55386.]
128 |  [ 115276.  214499.]]
129 | #Testing Data Accuracy:
130 | 0.8630559525347512
131 | ```
132 | 
133 | Not terrible. As you see, our training data get's slightly better prediction precision, because it's the data used to train the model.
134 | 
135 | #### Extending the Example
136 | 
137 | What if instead of just classifying on positive and negative, we try to classify reviews based on their 1-5 stars review score?
138 | 
139 | ```
140 | # Training Data Confusion Matrix:
141 | [[ 130042.   38058.   55682.  115421.  193909.]
142 |  [  27028.   71530.   26431.   55381.   95007.]
143 |  [  35787.   22641.  102753.   71802.  122539.]
144 |  [  72529.   45895.   69174.  254838.  246081.]
145 |  [ 113008.   73249.  108349.  225783.  535850.]]
146 | # Training Data Accuracy:
147 | 0.37645263439801124
148 | 
149 | # Testing Data Confusion Matrix:
150 | [[  33706.   20317.   27553.   54344.   90325.]
151 |  [  15384.   10373.   14875.   28413.   46173.]
152 |  [  18958.   13288.   19389.   37813.   59746.]
153 |  [  36921.   25382.   37791.   76008.  120251.]
154 |  [  57014.   37817.   55372.  112851.  194319.]]
155 | #Testing Data Accuracy:
156 | 0.268241369417615
157 | ```
158 | 
159 | Ouch. What went wrong? Well, a couple things. One thing that hurts us is that Naive Bayes is, well, Naive. While we intuitively know that the meanings 1, 2, 3, 4, 5 have a specific value, NB doesn't have any concept that items labeled 4 and 5 are probably going to be closer than a pair labeled 1 and 5.
160 | 
161 | Also, in this example we see a case where testing out training data doesn't have much utility. While an accuracy of `0.376` isn't great, it's still a lot better thatn `0.268`. Validating on the training data would lead us to think that our model is substantially more accurate than it actually is.
162 | 
163 | #### Conclusion
164 | 
165 | The full code of the first example is in `bayes_binary_tfidf.py`, and the second "extended" example is in `bayes_tfidf.py`.
166 | 
167 | ## Lab Activities
168 | **Lab 5 is due on Thursday, March 9th, 2017 at 11:55PM.**
169 | 
170 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu).
171 | 
172 | **NOTE:** 
173 | 
174 | * For each problem you may only use, at most, 80% of the dataset to train on. The other 20% should be used for testing your model. (i.e. use `rdd.randomSplit([0.8, 0.2]))
175 | * Our cluster has PySpark version 1.5.2. This is a slightly older version, so we don't have a couple of the cutting-edge ML tools. Use [this](https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#) documentation in your research.
176 | 
177 | #### Precision Competition
178 | 
179 | Lab Problems 1 and 2 have an aspect of competition this week: We'll be awarding 10% extra credit on this lab to the top 3 students with the highest average precision across the two problems. Make sure that your spark jobs outputs the precision of your models as given by the appropriate metrics class, and that your results are reproducable to be eligable for credit.
180 | 
181 | ### 1. Amazon Review Score Classification
182 | This week, we'll be using an Amazon dataset of food reviews. You can find this dataset in HDFS at `/shared/amazon_food_reviews.csv`. The dataset has the following columns:
183 | 
184 | ```
185 | Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text
186 | ```
187 | Similar to the Yelp Dataset, Amazon's food review dataset provides you with some review text and a review score. Use MLlib to classify these reviews by score. You can use any classifiers and feature extractors that are available. You may also choose to classify either on positive/negative or the more granular stars rating. You'll only be eligable for the precision contest if you classify on stars, not just positive/negative.
188 | 
189 | Notes:
190 | 
191 | * You can use the any fields other than `HelpfulnessNumerator` or `HelpfulnessDenominator` for feature extraction.
192 | * Use `MulticlassMetrics` to output the `confusionMatrix` and `precision` of your model. You want to maximize the precision. Include this output in your submission.
193 | 
194 | ### 2. Amazon Review Helpfulness Regression
195 | 
196 | Amazon also gives a metric of "helpfulness". The dataset has the number of users who marked a review as helpful, and the number of users who voted either up or down on the review.
197 | 
198 | Define a review's helpfulness score as `HelpfulnessNumerator / HelpfulnessDenominator`.
199 | 
200 | Construct and train a model that uses a regression algorithm to predict a review's helpfulnes score from it's text. 
201 | 
202 | Notes:
203 | 
204 | * You can use the any fields other than `Score` for feature extraction.
205 | * We suggest that at for a starting point, you use `pyspark.mllib.regression.LinearRegressionWithSGD` as your regression model.
206 | * Use `pyspark.mllib.evaluation.RegressionMetrics` to output the `explainedVariance` and `rootMeanSquaredError`. You want to minimize the error.
207 | 
208 | ### 3. Yelp Business Clustering
209 | 
210 | Going back to the Yelp dataset, suppose we want to find clusters of business in the Urbana/Champaign area. Where do businesses aggregate geographically? Could we predict from a set of coordinates which cluster of business a given business is in? Use K-Means to come up with a clustering model for the U-C area.
211 | 
212 | How can we determine how good our model is? The simplest way is to just graph it, and see if the clusters match what we would expect. More formally, we can use Within Set Sum of Squared Error ([WSSSE](https://spark.apache.org/docs/1.5.0/mllib-clustering.html#k-means)) to determin the optimal number of clusters. If we plot the error for multiple values of k, we can see the point of diminishing returns to adding more clusters. You should pick a value of k that is around this point of diminishing return.
213 | 
214 | Notes:
215 | 
216 | * Use `pyspark.mllib.clustering.KMeans` as your clustering algorithm.
217 | * Your task is to:
218 |     1. Extract the business that are in the U-C area and use their coordinates as features for your KMeans clustering model.
219 |     2. Select a proper K such that you get a good approximation of the "actual" clusters of businesses. (May require trial-and-error)
220 |     3. Plot the businesses with `matplotlib.pyplot.scatter` and have each point on the scatter plot be color-keyed by their cluster.
221 |     4. Include both the plot as a PNG and a short justification for your k value (either in comments in your code or in a separate `.txt`) in your submission.
222 | 


--------------------------------------------------------------------------------
/Labs/Lab5/amazon_helpfulness_regression.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Amazon Helpfulness Regression")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv")
6 | 
7 | with open('amazon_helpfulness_regression.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/Labs/Lab5/amazon_review_classification.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Amazon Review Classification")
3 | sc = SparkContext(conf=conf)
4 | 
5 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv")
6 | 
7 | with open('amazon_review_classification.txt', 'w+') as f:
8 |     pass
9 | 


--------------------------------------------------------------------------------
/Labs/Lab5/bayes_binary_tfidf.py:
--------------------------------------------------------------------------------
 1 | from pyspark.mllib.feature import HashingTF, IDF
 2 | from pyspark.mllib.regression import LabeledPoint
 3 | from pyspark.mllib.classification import NaiveBayes
 4 | from pyspark.mllib.evaluation import MulticlassMetrics
 5 | import json
 6 | import nltk
 7 | from pyspark import SparkContext, SparkConf
 8 | conf = SparkConf().setAppName("Bayes Binary TFIDF")
 9 | sc = SparkContext(conf=conf)
10 | 
11 | 
12 | def get_labeled_review(x):
13 |     return x.get('stars'), x.get('text')
14 | 
15 | 
16 | def categorize_review(x):
17 |     return (0 if x[0] > 2.5 else 1), x[1]
18 | 
19 | 
20 | def format_prediction(x):
21 |     return "actual: {0}, predicted: {1}".format(x[0], float(x[1]))
22 | 
23 | 
24 | def produce_tfidf(x):
25 |     tf = HashingTF().transform(x)
26 |     idf = IDF(minDocFreq=5).fit(tf)
27 |     tfidf = idf.transform(tf)
28 |     return tfidf
29 | 
30 | # Load in reviews
31 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json")
32 | # Parse to json
33 | json_payloads = reviews.map(json.loads)
34 | # Tokenize and weed out bad data
35 | labeled_data = (json_payloads.map(get_labeled_review)
36 |                              .filter(lambda x: x[0] and x[1])
37 |                              .map(lambda x: (float(x[0]), x[1]))
38 |                              .map(categorize_review)
39 |                              .mapValues(nltk.word_tokenize))
40 | labels = labeled_data.map(lambda x: x[0])
41 | 
42 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1]))
43 | idf = IDF(minDocFreq=5).fit(tf)
44 | tfidf = idf.transform(tf)
45 | zipped_data = (labels.zip(tfidf)
46 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
47 |                      .cache())
48 | 
49 | # Do a random split so we can test our model on non-trained data
50 | training, test = zipped_data.randomSplit([0.7, 0.3])
51 | 
52 | # Train our model
53 | model = NaiveBayes.train(training)
54 | 
55 | # Use our model to predict
56 | train_preds = (training.map(lambda x: x.label)
57 |                        .zip(model.predict(training.map(lambda x: x.features))))
58 | test_preds = (test.map(lambda x: x.label)
59 |                   .zip(model.predict(test.map(lambda x: x.features))))
60 | 
61 | # Ask PySpark for some metrics on how our model predictions performed
62 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
63 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
64 | 
65 | with open('output_binary.txt', 'w+') as f:
66 |     f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n')
67 |     f.write(str(trained_metrics.precision()) + '\n')
68 |     f.write(str(test_metrics.confusionMatrix().toArray()) + '\n')
69 |     f.write(str(test_metrics.precision()) + '\n')
70 | 


--------------------------------------------------------------------------------
/Labs/Lab5/bayes_tfidf.py:
--------------------------------------------------------------------------------
 1 | from pyspark.mllib.feature import HashingTF, IDF
 2 | from pyspark.mllib.regression import LabeledPoint
 3 | from pyspark.mllib.classification import NaiveBayes
 4 | from pyspark.mllib.evaluation import MulticlassMetrics
 5 | import json
 6 | import nltk
 7 | from pyspark import SparkContext, SparkConf
 8 | conf = SparkConf().setAppName("Bayes TFIDF")
 9 | sc = SparkContext(conf=conf)
10 | 
11 | 
12 | def get_labeled_review(x):
13 |     return x.get('stars'), x.get('text')
14 | 
15 | 
16 | def format_prediction(x):
17 |     return "actual: {0}, predicted: {1}".format(x[0], float(x[1]))
18 | 
19 | 
20 | def produce_tfidf(x):
21 |     tf = HashingTF().transform(x)
22 |     idf = IDF(minDocFreq=5).fit(tf)
23 |     tfidf = idf.transform(tf)
24 |     return tfidf
25 | 
26 | # Load in reviews
27 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json")
28 | # Parse to json
29 | json_payloads = reviews.map(json.loads)
30 | # Tokenize and weed out bad data
31 | labeled_data = (json_payloads.map(get_labeled_review)
32 |                              .filter(lambda x: x[0] and x[1])
33 |                              .map(lambda x: (float(x[0]), x[1]))
34 |                              .mapValues(nltk.word_tokenize))
35 | labels = labeled_data.map(lambda x: x[0])
36 | 
37 | tfidf = produce_tfidf(labeled_data.map(lambda x: x[1]))
38 | zipped_data = (labels.zip(tfidf)
39 |                      .map(lambda x: LabeledPoint(x[0], x[1]))
40 |                      .cache())
41 | 
42 | # Do a random split so we can test our model on non-trained data
43 | training, test = zipped_data.randomSplit([0.7, 0.3])
44 | 
45 | # Train our model
46 | model = NaiveBayes.train(training)
47 | 
48 | # Use our model to predict
49 | train_preds = (training.map(lambda x: x.label)
50 |                        .zip(model.predict(training.map(lambda x: x.features))))
51 | test_preds = (test.map(lambda x: x.label)
52 |                   .zip(model.predict(test.map(lambda x: x.features))))
53 | 
54 | # Ask PySpark for some metrics on how our model predictions performed
55 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1]))))
56 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1]))))
57 | 
58 | with open('output_discrete.txt', 'w+') as f:
59 |     f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n')
60 |     f.write(str(trained_metrics.precision()) + '\n')
61 |     f.write(str(test_metrics.confusionMatrix().toArray()) + '\n')
62 |     f.write(str(test_metrics.precision()) + '\n')
63 | 


--------------------------------------------------------------------------------
/Labs/Lab5/yelp_clustering.py:
--------------------------------------------------------------------------------
1 | from pyspark import SparkContext, SparkConf
2 | conf = SparkConf().setAppName("Yelp Clustering")
3 | sc = SparkContext(conf=conf)
4 | 
5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json")
6 | 
7 | with open('yelp_clustering.txt', 'w+') as f:
8 |     pass


--------------------------------------------------------------------------------
/Labs/Lab6/README.md:
--------------------------------------------------------------------------------
  1 | # Lab 6: Spark SQL
  2 | 
  3 | ## Introduction
  4 | 
  5 | Spark SQL is a powerful way for interacting with large amounts of structured data. Spark SQL gives us the concept of "dataframes", which will be familiar if you've ever done work with Pandas or R. DataFrames can also be thought of as similar to tables in databases.
  6 | 
  7 | With Spark SQL Dataframes we can interact with our data using the Structured Query Language (SQL). This gives us a declarative way to query our data, as opposed to the imperative methods we've studied in past weeks (i.e. discrete operations on sets of RDDs)
  8 | 
  9 | ## SQL Crash Course
 10 | 
 11 | SQL is a declarative language used for querying data. The simplest SQL query is a `SELECT - FROM - WHERE` query. This selects a set of attributes (SELECT) from a specific table (FROM) where a given set of conditions holds (WHERE).
 12 | 
 13 | However, SQL also has a series of more advanced aggregation commands for grouping data. This is accomplished with the `GROUP BY` keyword. We can also join tables on attributes or conditions with the set of `JOIN ... ON` commands. We won't be expecting advanced knowledge of these more difficult topics, but developing a working understanding of how these work will be useful in completing this lab.
 14 | 
 15 | Spark SQL has a pretty good [Programming Guide](https://spark.apache.org/docs/1.5.1/sql-programming-guide.html) that's worth looking at.
 16 | 
 17 | Additionally, you may find [SQL tutorials](https://www.w3schools.com/sql/default.asp) online useful for this assignment.
 18 | 
 19 | ## Examples
 20 | 
 21 | ### Loading Tables
 22 | 
 23 | The easiest way to get data into Spark SQL is by registering a DataFrame as a table. A DataFrame is essentially an instance of a Table: it has a schema (columns with data types and names), and data.
 24 | 
 25 | We can create a DataFrame by passing an RDD of data tuples and a schema to `sqlContext.createDataFrame`:
 26 | 
 27 | ```
 28 | data = sc.parallelize([('Tyler', 1), ('Quinn', 2), ('Ben', 3)])
 29 | df = sqlContext.createDataFrame(data, ['name', 'instructor_id'])
 30 | ```
 31 | 
 32 | This creates a DataFrame with 2 columns: `name` and `instructor_id`.
 33 | 
 34 | We can then register this frame with the sqlContext to be able to query it generally:
 35 | 
 36 | ```
 37 | sqlContext.registerDataFrameAsTable(df, "instructors")
 38 | ```
 39 | 
 40 | Now we can query the table:
 41 | 
 42 | ```
 43 | sqlContext.sql("SELECT name FROM instructors WHERE instructor_id=3")
 44 | ```
 45 | 
 46 | ### Specific Business Subset
 47 | 
 48 | Suppose we want to find all the businesses located in Champaign, IL that have 5 star ratings. We can do this with a simple `SELECT - FROM - WHERE` query:
 49 | 
 50 | ```python
 51 | sqlContext.sql("SELECT * "
 52 |                "FROM businesses "
 53 |                "WHERE stars=5 "
 54 |                "AND city='Champaign' AND state='IL'").collect()
 55 | ```
 56 | 
 57 | This selects all the rows from the `businesses` table that match the criteria described in the `WHERE` clause.
 58 | 
 59 | ### Highest Number of Reviews
 60 | 
 61 | Suppose we want to rank users by how many reviews they've written. We can do this query with aggregation and grouping:
 62 | 
 63 | ```python
 64 | sqlContext.sql("SELECT user_id, COUNT(*) AS c"
 65 |                "FROM reviews "
 66 |                "GROUP BY user_id "
 67 |                "SORT BY c DESC "
 68 |                "LIMIT 10").collect()
 69 | ```
 70 | 
 71 | This query groups rows by the `user_id` column, and collapses those rows into tuples of `(user_id, COUNT(*))`, where `COUNT(*)` is the number of collapsed rows per grouping. This gives us the review count of each user. We then do `SORT BY c DESC` to show the top counts first, and `LIMIT 10` to only show the top 10 results.
 72 | 
 73 | ## Lab Activities
 74 | **Lab 6 is due on Thursday, March 16th, 2017 at 11:55PM.**
 75 | 
 76 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu).
 77 | 
 78 | **NOTE:**
 79 | 
 80 | * For each of these problems you may use RDDs *only* for loading in and saving data to/from HDFS. All of your "computation" must be performed on DataFrames, either via the SQLContext or DataFrame interfaces.
 81 | * We _suggest_ using the SQLContext for most of these problems, as it's generally a more straight-forward interface.
 82 | 
 83 | ### 1. Quizzical Queries
 84 | 
 85 | For this problem, we'll construct some simple SQL queries on the Amazon Review dataset that we used last week. Your first task is to create a DataFrame from the CSV set. Once you've done this, write queries that get the requested information about the data. Format and save your output and include it in your submission.
 86 | 
 87 | **NOTE:** For this problem, you *must* use `sqlContext.sql` to run your queries. This means, you have to run `sqlContext.registerDataFrameAsTable` on your constructed DataFrame and write queries in raw SQL.
 88 | 
 89 | Queries:
 90 | 
 91 | 1. What is the review text of the review with id `22010`?
 92 | 2. How many 5-star ratings does product `B000E5C1YE` have?
 93 | 3. How any unique users have written reviews?
 94 | 
 95 | Notes:
 96 | 
 97 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review.
 98 | 
 99 | ### 2. Aggregation Aggravation
100 | 
101 | For this problem, we'll use some more complicated parts of the SQL language. Often times, we'll want to learn aggregate statistics about our data. We'll use `GROUP BY` and aggregation methods like `COUNT`, `MAX`, `AVG` to find out more interesting information about our dataset.
102 | 
103 | Queries:
104 | 
105 | 1. How many reviews has the person who has written the most number of reviews written? What is that user's UserId?
106 | 2. List the ProductIds of the products with the top 10 highest average review scores of products that have more than 10 reviews, sorted by product score, with ties broken by number of reviews.
107 | 3. List the Id of the reviews with the top 10 highest ratios between `HelpfulnessNumerator` and `HelpfulnessDenominator`, which have `HelpfulnessDenominator` more than 10, sorted by that ratio, with ties broken by `HelpfulnessDenominator`.
108 | 
109 | Notes: 
110 | 
111 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review.
112 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames.
113 | 
114 | ### 3. Jaunting with Joins
115 | 
116 | For this problem, we'll switch back to the Yelp dataset. Note that you can use the very handy [jsonFile](https://spark.apache.org/docs/1.5.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonFile) method to load in the dataset as a DataFrame.
117 | 
118 | There are some times that we need to access data that is split accross multiple tables. For instance, when we look at a single Yelp review, we cannot directly get the user's name, because we only have their id. But, we can match users with their reviews by "joining" on their user id. The database does this by looking for rows with matching values for the join columns.
119 | 
120 | You'll want to look up the JOIN (specifically INNER JOIN) SQL commands for these problems.
121 | 
122 | Queries:
123 | 
124 | 1. What state has had the most Yelp check-ins?
125 | 2. What is the maximum number of "funny" ratings left on a review created by someone who's been yelping since 2012?
126 | 3. List the user ids of anyone who has left a 1-star review, has created more than 250 reviews, and has left a review in Champaign, IL.
127 | 
128 | Notes: 
129 | 
130 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames.
131 | 


--------------------------------------------------------------------------------
/Labs/Lab6/aggregation_aggravation.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.sql import SQLContext
 3 | import csv
 4 | conf = SparkConf().setAppName("Aggregation Aggravation")
 5 | sc = SparkContext(conf=conf)
 6 | sqlContext = SQLContext(sc)
 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',')
 8 | 
 9 | 
10 | def parse_csv(x):
11 |     x = x.replace('\n', '')
12 |     d = csv.reader([x])
13 |     return next(d)
14 | 
15 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv")
16 | first = reviews.first()
17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv)
18 | 
19 | df = sqlContext.createDataFrame(csv_payloads, schema)
20 | sqlContext.registerDataFrameAsTable(df, "amazon")
21 | 
22 | # Do your queries here
23 | 
24 | with open('aggregation_aggravation.txt', 'w+') as f:
25 |     pass
26 | 


--------------------------------------------------------------------------------
/Labs/Lab6/jaunting_with_joins.py:
--------------------------------------------------------------------------------
 1 | from pyspark import SparkContext, SparkConf
 2 | from pyspark.sql import SQLContext
 3 | conf = SparkConf().setAppName("Jaunting With Joins")
 4 | sc = SparkContext(conf=conf)
 5 | sqlContext = SQLContext(sc)
 6 | 
 7 | reviews = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json")
 8 | businesses = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json")
 9 | checkins = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_checkin.json")
10 | users = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_user.json")
11 | 
12 | sqlContext.registerDataFrameAsTable(reviews, "reviews")
13 | sqlContext.registerDataFrameAsTable(businesses, "businesses")
14 | sqlContext.registerDataFrameAsTable(checkins, "checkins")
15 | sqlContext.registerDataFrameAsTable(users, "users")
16 | 
17 | # Do your queries here
18 | 
19 | with open('jaunting_with_joins.txt', 'w+') as f:
20 |     pass
21 | 


--------------------------------------------------------------------------------
/Labs/Lab6/quizzical_queries.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | from pyspark import SparkContext, SparkConf
 3 | from pyspark.sql import SQLContext
 4 | conf = SparkConf().setAppName("Quizzical Queries")
 5 | sc = SparkContext(conf=conf)
 6 | sqlContext = SQLContext(sc)
 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',')
 8 | 
 9 | 
10 | def parse_csv(x):
11 |     x = x.replace('\n', '')
12 |     d = csv.reader([x])
13 |     return next(d)
14 | 
15 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv")
16 | first = reviews.first()
17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv)
18 | 
19 | # Do your queries here
20 | 
21 | with open('quizzical_queries.txt', 'w+') as f:
22 |     pass
23 | 


--------------------------------------------------------------------------------
/Lectures/Lecture 10.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 10.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 11 - Clouds.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 11 - Clouds.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 12 - Streaming.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 12 - Streaming.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 13- Networking.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 13- Networking.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 6- More Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 6- More Spark.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 7- MLib.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 7- MLib.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 8- SQL.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 8- SQL.pdf


--------------------------------------------------------------------------------
/Lectures/Lecture 9- NoSQL.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 9- NoSQL.pdf


--------------------------------------------------------------------------------
/Lectures/week1/Lecture 1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week1/Lecture 1.pdf


--------------------------------------------------------------------------------
/Lectures/week2/Lecture 2 - Git, Latex, and Other Intros.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week2/Lecture 2 - Git, Latex, and Other Intros.pdf


--------------------------------------------------------------------------------
/Lectures/week2/Lecture 2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week2/Lecture 2.pdf


--------------------------------------------------------------------------------
/Lectures/week3/Lecture 3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week3/Lecture 3.pdf


--------------------------------------------------------------------------------
/Lectures/week4/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week4/.DS_Store


--------------------------------------------------------------------------------
/Lectures/week4/Lecture 4- Data.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week4/Lecture 4- Data.pdf


--------------------------------------------------------------------------------
/Lectures/week5/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week5/.DS_Store


--------------------------------------------------------------------------------
/Lectures/week5/Lecture 5-  Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week5/Lecture 5-  Spark.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CS199: Applied Cloud Computing
 2 | 
 3 | <b>Professor</b>: Dr. Robert J. Brunner
 4 | 
 5 | <b>Course Staffs</b>:
 6 | 
 7 | - Benjamin Congdon, [@bcongdon](https://github.com/bcongdon)
 8 | 
 9 | - Quinn Jarrell, [@TheRushingWookie](https://github.com/TheRushingWookie)
10 | 
11 | - Tyler Kim, [@tyler-thetyrant](https://github.com/tyler-thetyrant)
12 | 
13 | - Sameet Sapra, [@sameetandpotatoes](https://github.com/sameetandpotatoes)
14 | 
15 | - Bhuvan Venkatesh, [@bhuvan-venkatesh](https://github.com/bhuvan-venkatesh)
16 | 
17 | ## Overview
18 | This course will introduce cloud computing with an emphasis on gaining hands-on experience in implementing big data technologies in a cloud computing environment. Students will be expected to work in small groups to develop and implement specific cloud computing solutions and to create technical reports that document these solutions and technologies.
19 | 
20 | This course is intended for underclassmen in CS and ECE.
21 | 
22 | ## Prerequisites
23 | Grade A- or higher in CS125. Previous experience in python programming is required.
24 | 
25 | ## Tentative list of Topics
26 | 1) Understand motivation behind cloud computing
27 | 
28 | 2) Building a Hadoop cluster.
29 | 
30 | 3) Building a Spark cluster.
31 | 
32 | 4) Text Analytics at scale
33 | 
34 | 5) Graph Analytics at scale
35 | 
36 | 6) Spark Analysis
37 | 
38 | 7) NoSQL Data Stores (installation and operation).
39 | 
40 | 8) Writing Technical Reports.
41 | 
42 | 
43 | ## Grading
44 | 
45 | | **Grading Item**      | **Distribution** |
46 | | --------------------- | -------------- |
47 | | Attendance            | 10%            |
48 | | Labs                  | 30%            |
49 | | Technical Report      | 60%            |
50 | 
51 | 
52 | ## Grading Scale
53 | | Percentage | Letter Grade |
54 | | ---------- | ------------ |
55 | | [98, 100]     | A+           |
56 | | [92, 98)      | A            |
57 | | [90, 92)      | A-           |
58 | | [88, 90)      | B+           |
59 | | [82, 88)      | B            |
60 | | [80, 82)      | B-           |
61 | | [78, 80)      | C+           |
62 | | [72, 78)      | C            |
63 | | [70, 72)      | C-           |
64 | | [68, 70)      | D+           |
65 | | [62, 68)     | D            |
66 | | [60, 62)      | D-           |
67 | | Below 60   | F            |
68 | 
69 | 
70 | ## Labs and Late Submission
71 | There will be about six to seven labs throughout the course, responsible for 30% of the total grade.
72 | 
73 | No late submission will be allowed unless you have received permission from a course staff ahead of time under special circumstances.
74 | 
75 | 
76 | ## Common Errors
77 | * If you reboot your VM, you WILL NEED TO REMOUNT THE SHARED FOLDER
78 | 
79 | `sudo mount -t vboxsf -o rw,uid=1000,gid=1000 NAMEOFYOURSHAREDFOLDERONHOST NAMEOFFOLDERONVMTOSHARETO`
80 | 
81 | * SSH
82 | `ssh user@localhost -p 2222`
83 | 
84 | 
85 | 
86 | ## License
87 | This course is licensed under the University of Illinois/NCSA Open Source License. For a full copy of this license take a look at the LICENSE file.
88 | 


--------------------------------------------------------------------------------