├── LICENSE ├── Labs ├── Lab1 │ ├── README.md │ ├── bigram_count.py │ ├── common_friends.py │ ├── friend_graph.txt │ ├── map_reducer.py │ ├── sherlock.txt │ ├── utils.py │ └── word_count.py ├── Lab2 │ ├── README.md │ ├── book.txt │ ├── mapper.py │ └── reducer.py ├── Lab3 │ ├── README.md │ ├── prob1_mapper.py │ ├── prob1_reducer.py │ ├── prob2_mapper.py │ ├── prob2_reducer.py │ ├── prob3_mapper.py │ └── prob3_reducer.py ├── Lab4 │ ├── README.md │ ├── descriptors_of_bad_business.py │ ├── most_expensive_city.py │ ├── pessimistic_users.py │ └── up_all_night.py ├── Lab5 │ ├── README.md │ ├── amazon_helpfulness_regression.py │ ├── amazon_review_classification.py │ ├── bayes_binary_tfidf.py │ ├── bayes_tfidf.py │ └── yelp_clustering.py └── Lab6 │ ├── README.md │ ├── aggregation_aggravation.py │ ├── jaunting_with_joins.py │ └── quizzical_queries.py ├── Lectures ├── Lecture 10.pdf ├── Lecture 11 - Clouds.pdf ├── Lecture 12 - Streaming.pdf ├── Lecture 13- Networking.pdf ├── Lecture 6- More Spark.pdf ├── Lecture 7- MLib.pdf ├── Lecture 8- SQL.pdf ├── Lecture 9- NoSQL.pdf ├── week1 │ └── Lecture 1.pdf ├── week2 │ ├── Lecture 2 - Git, Latex, and Other Intros.pdf │ └── Lecture 2.pdf ├── week3 │ └── Lecture 3.pdf ├── week4 │ ├── .DS_Store │ └── Lecture 4- Data.pdf └── week5 │ ├── .DS_Store │ └── Lecture 5- Spark.pdf └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | University of Illinois/NCSA Open Source License 2 | 3 | Copyright (c) 2017 LCDM@UIUC 4 | All rights reserved. 5 | 6 | Developed by: LCDM@UIUC - Professor Robert J. Brunner and CS199: ACC Course Staffs 7 | http://lcdm.illinois.edu 8 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal with the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 10 | 11 | * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimers. 12 | 13 | * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided with the distribution. 14 | 15 | * Neither the names of the course development team, LCDM@UIUC, nor the names of its contributors may be used to endorse or promote products derived from this Software without specific prior written permission. 16 | 17 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE SOFTWARE. 18 | -------------------------------------------------------------------------------- /Labs/Lab1/README.md: -------------------------------------------------------------------------------- 1 | # Lab 1: Introduction to MapReduce 2 | 3 | ## Introduction 4 | 5 | This lab will introduce the map/reduce computing paradigm. In essence, map/reduce breaks tasks down into a map phase (where an algorithm is mapped onto data) and a reduce phase, where the outputs of the map phase are aggregated into a concise output. The map phase is designed to be parallel, so as to allow wide distribution of computation. 6 | 7 | The map phase identifies keys and associates with them a value. The reduce phase collects keys and aggregates their values. The standard example used to demonstrate this programming approach is a word count problem, where words (or tokens) are the keys and the number of occurrences of each word (or token) is the value. 8 | 9 | As this technique was popularized by large web search companies like Google and Yahoo who were processing large quantities of unstructured text data, this approach quickly became popular for a wide range of problems. The standard MapReduce approach uses Hadoop, which was built using Java. However, to introduce you to this topic without adding the extra overhead of learning Hadoop's idiosyncrasies, we will be 'simulating' a map/reduce workload in pure Python. 10 | 11 | ## Example: Word Count 12 | 13 | This example displays the type of programs we can build from simple map/reduce functions. Suppose our task is to come up with a count of the occurrences of each word in a large set of text. We could simply iterate through the text and count the words as we saw them, but this would be slow and non-parallelizable. 14 | 15 | Instead, we break the text up into chunks, and then split those chunks into words. This is the ‘map’ phase (i.e. the input text is mapped to a list of words). Then, we can ‘reduce’ this data into a coherent word count that holds for the entire text set. We do this by accumulating the count of each word in each chunk using our reduce function. 16 | 17 | Take a look at `map_reducer.py` and `word_count.py` to see the example we’ve constructed for you. Notice that the `map` stage is being run on a multiprocess pool. This is functionally analogous to a cloud computing application, the difference being in the cloud, this work would be distributed amongst multiple nodes, whereas in our toy MapReduce, all the processes run on a single machine. 18 | 19 | Run `python word_count.py` to see our simple map/reduce example. You can adjust `NUM_WORKERS` in `map_reducer.py` to see how we make (fairly small) performance gains from parallelizing the work. (Hint: running `time python word_count.py` will give you a better idea of the runtime) 20 | 21 | ## Exercise: Bigram Count 22 | 23 | Suppose now that instead of trying to count the individual words, we want to get counts of the occurences word [bigrams](https://en.wikipedia.org/wiki/Bigram) - that is, pairs of words that are adjacent to each other in the text. It is not just all the pairs of the words in the text 24 | 25 | For example, if our line of text was `“cat dog sheep horse”`, we’d have the bigrams `(“cat”, “dog”)`, `(“dog, “sheep”)` and `(“sheep”, “horse”)`. 26 | 27 | Construct a map function and reduce function that will accomplish this goal. 28 | 29 | Note: For the purposes of this exercise, we’ll only consider bigrams that occur on the same line. So, you don’t need to worry about pairs that occur between line breaks. 30 | 31 | ## Exercise: Common Friends 32 | 33 | Suppose we’re running a social network and we want a fast way to calculate a list of common friends for pairs of users in our site. This can be done fairly easily with a map/reduce procedure. 34 | 35 | You’ll be given input of a friend ‘graph’ that looks like this: 36 | 37 | ``` 38 | A|B 39 | B|A,C,D 40 | C|B,D 41 | D|C,B,E 42 | E|D 43 | ``` 44 | The graph can be visualized as 45 | ``` 46 | A-B - D-E 47 | \ / 48 | C 49 | ``` 50 | Read this as: A is friends with B, B is friends with A, C and D, and so on. Our desired output is as follows: 51 | 52 | ``` 53 | (B,C): [D] 54 | (B,D): [C] 55 | (C,D): [B] 56 | ``` 57 | Read this as: B and C have D in common as a friend, B and D have C in common as a friend, and C and D have B in common as a friend. None of the other relationships have common friends. 58 | 59 | Your mapper stage should take each line of the friend graph and produce a list of relationships: 60 | 61 | `A|B` -> `(A,B): A, B` 62 | 63 | `B|A, C, D` -> `(B,A): A, C, D`, `(B,C): A, C, D`, `(B,D): A, C, D` 64 | 65 | `C|B, D` -> `(C,B): B, D`, `(C, D): B, D` 66 | 67 | *et cetera* 68 | 69 | The reducer phase should take all of these relationships and output common friends for each pair. (Hint: Lookup set intersection) 70 | 71 | ##Submission 72 | Lab 1 is due on Thursday, Febuary 2nd, 2017 at 11:55PM. 73 | 74 | Please zip the files and upload it to Moodle (learn.illinois.edu). 75 | -------------------------------------------------------------------------------- /Labs/Lab1/bigram_count.py: -------------------------------------------------------------------------------- 1 | from map_reducer import MapReduce 2 | from operator import itemgetter 3 | 4 | 5 | def bigram_mapper(line): 6 | pass 7 | 8 | 9 | def bigram_reducer(bigram_tuples): 10 | pass 11 | 12 | if __name__ == '__main__': 13 | with open('sherlock.txt') as f: 14 | lines = f.readlines() 15 | mr = MapReduce(bigram_mapper, bigram_reducer) 16 | bigram_counts = mr(lines) 17 | sorted_bgc = sorted(bigram_counts, key=itemgetter(1), reverse=True) 18 | for word, count in sorted_bgc[:100]: 19 | print '{}\t{}'.format(word, count) 20 | -------------------------------------------------------------------------------- /Labs/Lab1/common_friends.py: -------------------------------------------------------------------------------- 1 | from map_reducer import MapReduce 2 | 3 | 4 | def friend_mapper(line): 5 | pass 6 | 7 | 8 | def friend_reducer(friend_tuples): 9 | pass 10 | 11 | if __name__ == '__main__': 12 | with open('friend_graph.txt') as f: 13 | lines = f.readlines() 14 | mr = MapReduce(friend_mapper, friend_reducer) 15 | common_friends = mr(lines) 16 | for relationship, friends in common_friends: 17 | print '{}\t{}'.format(relationship, friends) 18 | -------------------------------------------------------------------------------- /Labs/Lab1/friend_graph.txt: -------------------------------------------------------------------------------- 1 | A|B,D,E,F 2 | B|A,C,E 3 | C|B,F 4 | D|A,E 5 | E|A,B,D 6 | F|A,C 7 | -------------------------------------------------------------------------------- /Labs/Lab1/map_reducer.py: -------------------------------------------------------------------------------- 1 | import multiprocessing 2 | import itertools 3 | from operator import itemgetter 4 | 5 | NUM_WORKERS = 10 6 | 7 | 8 | class MapReduce(object): 9 | def __init__(self, map_func, reduce_func): 10 | # Function for the map phase 11 | self.map_func = map_func 12 | 13 | # Function for the reduce phase 14 | self.reduce_func = reduce_func 15 | 16 | # Pool of processes to parallelize computation 17 | self.proccess_pool = multiprocessing.Pool(NUM_WORKERS) 18 | 19 | def kv_sort(self, mapped_values): 20 | return sorted(list(mapped_values), key=itemgetter(0)) 21 | 22 | def __call__(self, data_in): 23 | # Run the map phase in our process pool 24 | map_phase = self.proccess_pool.map(self.map_func, data_in) 25 | 26 | # Sort the resulting mapped data 27 | sorted_map = self.kv_sort(itertools.chain(*map_phase)) 28 | 29 | # Run our reduce function 30 | reduce_phase = self.reduce_func(sorted_map) 31 | 32 | # Return the results 33 | return reduce_phase 34 | -------------------------------------------------------------------------------- /Labs/Lab1/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import string 3 | 4 | 5 | def strip_punctuation(str_in): 6 | # Strip punctuation from word (don't worry too much about this) 7 | return re.sub('[%s]' % re.escape(string.punctuation), '', str_in) 8 | -------------------------------------------------------------------------------- /Labs/Lab1/word_count.py: -------------------------------------------------------------------------------- 1 | from map_reducer import MapReduce 2 | from operator import itemgetter 3 | from utils import strip_punctuation 4 | 5 | 6 | def string_to_words(str_in): 7 | words = [] 8 | # Split string into words 9 | for word in str_in.strip().split(): 10 | # Strip punctuation 11 | word = strip_punctuation(word) 12 | 13 | # Note each individual instance of a word 14 | words.append((word, 1)) 15 | return words 16 | 17 | 18 | def word_count_reducer(word_tuples): 19 | # Dict to count the instances of each word 20 | words = {} 21 | 22 | for entry in word_tuples: 23 | word, count = entry 24 | 25 | # Add 1 to our word counts for each word we see 26 | if word in words: 27 | words[word] += 1 28 | else: 29 | words[word] = 1 30 | 31 | return words.items() 32 | 33 | if __name__ == '__main__': 34 | with open('sherlock.txt') as f: 35 | lines = f.readlines() 36 | 37 | # Construct our MapReducer 38 | mr = MapReduce(string_to_words, word_count_reducer) 39 | # Call MapReduce on our input set 40 | word_counts = mr(lines) 41 | sorted_wc = sorted(word_counts, key=itemgetter(1), reverse=True) 42 | for word, count in sorted_wc[:100]: 43 | print '{}\t{}'.format(word, count) 44 | -------------------------------------------------------------------------------- /Labs/Lab2/README.md: -------------------------------------------------------------------------------- 1 | # Lab 2: Introduction to Map/Reduce on Hadoop 2 | 3 | ## Introduction 4 | 5 | In this lab, we introduce the map/reduce programming 6 | paradigm. Simply put, this approach to computing breaks tasks down into 7 | a map phase (where an algorithm is mapped onto data) and a reduce phase, 8 | where the outputs of the map phase are aggregated into a concise output. 9 | The map phase is designed to be parallel, and to move the computation to 10 | the data, which, when using HDFS, can be widely distributed. In this 11 | case, a map phase can be executed against a large quantity of data very 12 | quickly. The map phase identifies keys and associates with them a value. 13 | The reduce phase collects keys and aggregates their values. The standard 14 | example used to demonstrate this programming approach is a word count 15 | problem, where words (or tokens) are the keys) and the number of 16 | occurrences of each word (or token) is the value. 17 | 18 | As this technique was popularized by large web search companies like 19 | Google and Yahoo who were processing large quantities of unstructured 20 | text data, this approach quickly became popular for a wide range of 21 | problems. Of course, not every problem can be transformed into a 22 | map-reduce approach, which is why we will explore Spark in several 23 | weeks. The standard MapReduce approach uses Hadoop, which was built 24 | using Java. Rather than switching to a new language, however, we will 25 | use Hadoop Streaming to execute Python code. In the rest of this 26 | lab, we introduce a simple Python WordCount example code. We first 27 | demonstrate this code running at the Unix command line, before switching to running the code by using Hadoop Streaming. 28 | 29 | ### Mapper: Word Count 30 | 31 | The first Python code we will write is the map Python program. This 32 | program simply reads data from `STDIN`, tokenizes each line into words and 33 | outputs each word on a separate line along with a count of one. Thus our 34 | map program generates a list of word tokens as the keys and the value is 35 | always one. 36 | 37 | ```python 38 | #!/usr/bin/python 39 | 40 | # These examples are based off the blog post by Michale Noll: 41 | # 42 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 43 | # 44 | 45 | import sys 46 | 47 | # We explicitly define the word/count separator token. 48 | sep = '\t' 49 | 50 | # We open STDIN and STDOUT 51 | with sys.stdin as fin: 52 | with sys.stdout as fout: 53 | 54 | # For every line in STDIN 55 | for line in fin: 56 | 57 | # Strip off leading and trailing whitespace 58 | line = line.strip() 59 | 60 | # We split the line into word tokens. Use whitespace to split. 61 | # Note we don't deal with punctuation. 62 | 63 | words = line.split() 64 | 65 | # Now loop through all words in the line and output 66 | 67 | for word in words: 68 | fout.write("{0}{1}1\n".format(word, sep)) 69 | ``` 70 | 71 | ### Reducer: Word Count 72 | 73 | The second Python program we write is our reduce program. In this code, 74 | we read key-value pairs from `STDIN` and use the fact that the Hadoop 75 | process first sorts all key-value pairs before sending the map output to 76 | the reduce process to accumulate the cumulative count of each word. The 77 | following code could easily be made more sophisticated by using `yield` 78 | statements and iterators, but for clarity we use the simple approach of 79 | tracking when the current word becomes different than the previous word 80 | to output the key-cumulative count pairs. 81 | 82 | ```python 83 | #!/usr/bin/python 84 | 85 | import sys 86 | 87 | # We explicitly define the word/count separator token. 88 | sep = '\t' 89 | 90 | # We open STDIN and STDOUT 91 | with sys.stdin as fin: 92 | with sys.stdout as fout: 93 | 94 | # Keep track of current word and count 95 | cword = None 96 | ccount = 0 97 | word = None 98 | 99 | # For every line in STDIN 100 | for line in fin: 101 | 102 | # Strip off leading and trailing whitespace 103 | # Note by construction, we should have no leading white space 104 | line = line.strip() 105 | 106 | # We split the line into a word and count, based on predefined 107 | # separator token. 108 | # 109 | # Note we haven't dealt with punctuation. 110 | 111 | word, scount = line.split('\t', 1) 112 | 113 | # We will assume count is always an integer value 114 | 115 | count = int(scount) 116 | 117 | # word is either repeated or new 118 | 119 | if cword == word: 120 | ccount += count 121 | else: 122 | # We have to handle first word explicitly 123 | if cword != None: 124 | fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount)) 125 | 126 | # New word, so reset variables 127 | cword = word 128 | ccount = count 129 | else: 130 | # Output final word count 131 | if cword == word: 132 | fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount)) 133 | ``` 134 | 135 | ### Testing Python Map-Reduce 136 | 137 | Before we begin using Hadoop, we should first test our Python codes out 138 | to ensure they work as expected. First, we should change the permissions 139 | of the two programs to be executable, which we can do with the Unix 140 | `chmod` command. 141 | 142 | ```sh 143 | chmod u+x /path/to/lab2/mapper.py 144 | chmod u+x /path/to/lab2/reducer.py 145 | ``` 146 | 147 | #### Testing Mapper.py 148 | 149 | To test out the map Python code, we can run the Python `mapper.py` code 150 | and specify that the code should redirect STDIN to read the book text 151 | data. This is done in the following code cell, we pipe the output into 152 | the Unix `head` command in order to restrict the output, which would be 153 | one line per word found in the book text file. In the second code cell, 154 | we next pipe the output of `mapper.py` into the Unix `sort` command, 155 | which is done automatically by Hadoop. To see the result of this 156 | operation, we next pipe the result into the Unix `uniq` command to count 157 | duplicates, pipe this result into a new sort routine to sort the output 158 | by the number of occurrences of a word, and finally display the last few 159 | lines with the Unix `tail` command to verify the program is operating 160 | correctly. 161 | 162 | With these sequence of Unix commands, we have (in a single-node) 163 | replicated the steps performed by Hadoop MapReduce: Map, Sort, and 164 | Reduce. 165 | 166 | 167 | 168 | 169 | ```sh 170 | cd /path/to/lab2 171 | 172 | ./mapper.py < book.txt | wc -l 173 | ``` 174 | 175 | ```sh 176 | cd /path/to/lab2 177 | 178 | ./mapper.py < book.txt | sort -n -k 1 | \ 179 | uniq -c | sort -n -k 1 | tail -10 180 | ``` 181 | 182 | #### Testing Reducer.py 183 | 184 | To test out the reduce Python code, we run the previous code cell, but 185 | rather than piping the result into the Unix `tail` command, we pipe the 186 | result of the sort command into the Python `reducer.py` code. This 187 | simulates the Hadoop model, where the map output is key sorted before 188 | being passed into the reduce process. First, we will simply count the 189 | number of lines displayed by the reduce process, which will indicate the 190 | number of unique _word tokens_ in the book. Next, we will sort the 191 | output by the number of times each word token appears and display the 192 | last few lines to compare with the previous results. 193 | 194 | 195 | ```sh 196 | cd /path/to/lab2 197 | 198 | ./mapper.py < book.txt | sort -n -k 1 | \ 199 | ./reducer.py | wc -l 200 | ``` 201 | 202 | ```sh 203 | cd /path/to/lab2 204 | 205 | ./mapper.py < book.txt | sort -n -k 1 | \ 206 | ./reducer.py | sort -n -k 2 | tail -10 207 | ``` 208 | 209 | ## Python Hadoop Streaming 210 | 211 | **IMPORTANT:** Before doing the following activities, run the following command to setup the Hadoop environment correctly. If you don't, it's likely that these instructions **will not work**. 212 | 213 | ``` 214 | source ~/hadoop.env 215 | ``` 216 | 217 | ### Introduction 218 | 219 | We are now ready to actually run our Python codes via Hadoop Streaming. 220 | The main command to perform this task is `hadoop`. 221 | 222 | Running this Hadoop command by supplying the `-help` flag will provide 223 | a useful summary of the different options. Note that `jar` is short for 224 | Java Archive, which is a compressed archive of compiled Java code that 225 | can be executed to perform different operations. In this case, we will 226 | run the Java Hadoop streaming jar file to enable our Python code to work 227 | within Hadoop. 228 | 229 | 230 | ```sh 231 | # Run the Map Reduce task within Hadoop 232 | hadoop --help 233 | ``` 234 | 235 | Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] 236 | CLASSNAME run the class named CLASSNAME 237 | or 238 | where COMMAND is one of: 239 | fs run a generic filesystem user client 240 | version print the version 241 | jar run a jar file 242 | note: please use "yarn jar" to launch 243 | YARN applications, not this command. 244 | checknative [-a|-h] check native hadoop and compression libraries availability 245 | distcp copy file or directories recursively 246 | archive -archiveName NAME -p * create a hadoop archive 247 | classpath prints the class path needed to get the 248 | credential interact with credential providers 249 | Hadoop jar and the required libraries 250 | daemonlog get/set the log level for each daemon 251 | trace view and modify Hadoop tracing settings 252 | 253 | Most commands print help when invoked w/o parameters. 254 | 255 | 256 | For our map/reduce Python example to 257 | run successfully, we will need to specify five flags: 258 | 259 | 1. `-files`: a comma separated list of files to be copied to the Hadoop cluster. 260 | 2. `-input`: the HDFS input file(s) to be used for the map task. 261 | 3. `-output`: the HDFS output directory, used for the reduce task. 262 | 4. `-mapper`: the command to run for the map task. 263 | 5. `-reducer`: the command to run for the reduce task. 264 | 265 | Given our previous setup, we will eventually run the full command as follows: 266 | 267 | ``` 268 | # DON'T RUN ME YET! 269 | hadoop $STREAMING -files mapper.py,reducer.py -input wc/in \ 270 | -output wc/out -mapper mapper.py -reducer reducer.py 271 | ``` 272 | When this command is run, a series of messages will be displayed to the 273 | screen (via STDERR) showing the progress of our Hadoop Streaming task. 274 | At the end of the stream of information messages will be a statement 275 | indicating the location of the output directory as shown below. Note, we 276 | can append Bash redirection to ignore the Hadoop messages, simply by 277 | appending `2> /dev/null` to the end of any Hadoop command, which sends 278 | all STDERR messages to a non-existent Unix device, which is akin to 279 | nothing. 280 | 281 | For example, to ignore any messages from the `hdfs dfs -rm -r -f wc/out` 282 | command, we would use the following syntax: 283 | 284 | ```bash 285 | hdfs dfs -rm -r -f wc/out 2> /dev/null 286 | ``` 287 | 288 | Doing this, however, does hide all messages, which can make debugging 289 | problems more difficult. As a result, you should only do this when your 290 | commands work correctly. 291 | 292 | ### Putting files in HDFS 293 | In order for Hadoop to be able to access our raw data (the book text) we first have to copy it into the file system that Hadoop uses natively, HDFS. 294 | 295 | To do this, we'll run a series of HDFS commands that will copy our local `book.txt` into the distributed file system. 296 | 297 | ``` 298 | # Make a directory for our book data 299 | hdfs dfs -mkdir -p wc/in 300 | 301 | # Copy our book to our new folder 302 | hdfs dfs -copyFromLocal book.txt wc/in/book.txt 303 | 304 | # Check to see that our book has made it to the folder 305 | hdfs dfs -ls wc/in 306 | hdfs dfs -tail wc/in/book.txt 307 | ``` 308 | 309 | ### Running the Hadoop Job 310 | Now that our data is in Hadoop HDFS, we can actually execute the streaming job that will run our word count map/reduce. 311 | 312 | ```sh 313 | # Delete output directory (if it exists) 314 | hdfs dfs -rm -r -f wc/out 315 | 316 | # Run the Map Reduce task within Hadoop 317 | hadoop jar $STREAMING \ 318 | -files mapper.py,reducer.py -input wc/in \ 319 | -output wc/out -mapper mapper.py -reducer reducer.py 320 | ``` 321 | 322 | ### Hadoop Results 323 | 324 | In order to view the results of our Hadoop Streaming task, we must use 325 | HDFS DFS commands to examine the directory and files generated by our 326 | Python Map/Reduce programs. The following list of DFS commands might 327 | prove useful to view the results of this map/reduce job. 328 | 329 | ```bash 330 | # List the wc directory 331 | hdfs dfs -ls wc 332 | 333 | # List the output directory 334 | hdfs dfs -ls wc/out 335 | 336 | # Do a line count on our output 337 | hdfs dfs -count -h wc/out/part-00000 338 | 339 | # Tail the output 340 | hdfs dfs -tail wc/out/part-00000 341 | ``` 342 | 343 | Note that these 344 | Hadoop HDFS commands can be intermixed with Unix commands to perform 345 | additional text processing. The important point is that direct file I/O 346 | operations must use HDFS commands to work with the HDFS file system. 347 | 348 | The output should match the Python 349 | only map-reduce approach. 350 | 351 | ### Hadoop Cleanup 352 | 353 | Following the successful run of our map/reduce Python programs, we have 354 | created a new directory `wc/out` in the HDFS, which contains two files. If we wish 355 | to rerun this Hadoop Streaming map/reduce task, we must either specify a 356 | different output directory, or else we must clean up the results of the 357 | previous run. To remove the output directory, we can simply use the HDFS 358 | `-rm -r -f wc/out` command, which will immediately delete the `wc/out` 359 | directory. The successful completion of this command is indicated by 360 | Hadoop, and this can also be verified by listing the contents of the 361 | `wc` directory. 362 | 363 | ```sh 364 | hdfs dfs -ls wc 365 | ``` 366 | 367 | ## Lab Assignment 368 | 369 | Lab 2 is due on Thursday, Febuary 9nd, 2017 at 11:55PM. 370 | Please zip your source files for the following exersizes and upload it to Moodle (learn.illinois.edu). 371 | 372 | 373 | In the preceding cells, we introduced Hadoop map/reduce by using a 374 | simple word count task. Now that you have run the lab, go back and 375 | make the following changes to see how the results change. 376 | 377 | 1. We ignored punctuation, modify the original mapper Python code to 378 | token on white space or punctuation. 379 | 2. Try downloading a different text from Project Gutenberg. Write a map-reduce application that can run across multiple texts. 380 | 3. Write a map-reduce application to compute bi-grams instead of 381 | unigrams (combinations of two adjacent words as they appear in the text) 382 | -------------------------------------------------------------------------------- /Labs/Lab2/mapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # These examples are based off the blog post by Michale Noll: 4 | # 5 | # http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 6 | # 7 | 8 | import sys 9 | 10 | # We explicitly define the word/count separator token. 11 | sep = '\t' 12 | 13 | # We open STDIN and STDOUT 14 | with sys.stdin as fin: 15 | with sys.stdout as fout: 16 | 17 | # For every line in STDIN 18 | for line in fin: 19 | 20 | # Strip off leading and trailing whitespace 21 | line = line.strip() 22 | 23 | # We split the line into word tokens. Use whitespace to split. 24 | # Note we don't deal with punctuation. 25 | 26 | words = line.split() 27 | 28 | # Now loop through all words in the line and output 29 | 30 | for word in words: 31 | fout.write("{0}{1}1\n".format(word, sep)) 32 | -------------------------------------------------------------------------------- /Labs/Lab2/reducer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We explicitly define the word/count separator token. 6 | sep = '\t' 7 | 8 | # We open STDIN and STDOUT 9 | with sys.stdin as fin: 10 | with sys.stdout as fout: 11 | 12 | # Keep track of current word and count 13 | cword = None 14 | ccount = 0 15 | word = None 16 | 17 | # For every line in STDIN 18 | for line in fin: 19 | 20 | # Strip off leading and trailing whitespace 21 | # Note by construction, we should have no leading white space 22 | line = line.strip() 23 | 24 | # We split the line into a word and count, based on predefined 25 | # separator token. 26 | # 27 | # Note we haven't dealt with punctuation. 28 | 29 | word, scount = line.split('\t', 1) 30 | 31 | # We will assume count is always an integer value 32 | 33 | count = int(scount) 34 | 35 | # word is either repeated or new 36 | 37 | if cword == word: 38 | ccount += count 39 | else: 40 | # We have to handle first word explicitly 41 | if cword is not None: 42 | fout.write("{0:s}{1:s}{2:d}\n".format(cword, sep, ccount)) 43 | 44 | # New word, so reset variables 45 | cword = word 46 | ccount = count 47 | else: 48 | # Output final word count 49 | if cword == word: 50 | fout.write("{0:s}{1:s}{2:d}\n".format(word, sep, ccount)) 51 | -------------------------------------------------------------------------------- /Labs/Lab3/README.md: -------------------------------------------------------------------------------- 1 | # Lab 3: Hadoop Map/Reduce on "Real" Data 2 | 3 | ## Introduction 4 | In the past 2 labs, you were introduced to the concept of Map/Reduce and how we can execute Map/Reduce Python scripts on Hadoop with Hadoop Streaming. 5 | 6 | This week, we'll give you access to a modestly large (~60GB) Twitter dataset. You'll be using the skills you learned in the last two weeks to perform some more complex transformations on this dataset. 7 | 8 | Like in the last lab, we'll be using Python to write mappers and reducers. We've included a helpful shell script to make running mapreduce jobs easier. The script is in your `bin` folder, so you can just run it as `mapreduce`, but we've included the source in this lab so you can see what it's doing. 9 | 10 | Here's the usage for that command: 11 | 12 | ``` 13 | Usage: ./mapreduce map-script reduce-scripe hdfs-input-path hdfs-output-path 14 | 15 | Example: ./mapreduce mapper.py reducer.py /tmp/helloworld.txt /user/quinnjarr 16 | ``` 17 | 18 | ## The Dataset 19 | 20 | The dataset is located in `/shared/snapTwitterData` in HDFS. You'll find these files: 21 | 22 | ``` 23 | /shared/snapTwitterData/tweets2009-06.tsv 24 | /shared/snapTwitterData/tweets2009-07.tsv 25 | /shared/snapTwitterData/tweets2009-08.tsv 26 | /shared/snapTwitterData/tweets2009-09.tsv 27 | /shared/snapTwitterData/tweets2009-10.tsv 28 | /shared/snapTwitterData/tweets2009-11.tsv 29 | /shared/snapTwitterData/tweets2009-12.tsv 30 | ``` 31 | 32 | Each file is a `tsv` (tab-separated value) file. The schema of the file is as follows: 33 | 34 | ``` 35 | POST_DATETIME TWITTER_USER)URL TWEET_TEXT 36 | ``` 37 | 38 | Example: 39 | 40 | ``` 41 | 2009-10-31 23:59:58 http://twitter.com/sometwitteruser Wow, CS199 is really a good course 42 | ``` 43 | 44 | ## Lab Activities 45 | **Lab 3 is due on Thursday, Febuary 16nd, 2017 at 11:55PM.** 46 | 47 | Please zip your source files for the following exercises and upload it to Moodle (learn.illinois.edu). 48 | 49 | **NOTE:** Place your Hadoop output in your HDFS home directory under the folder `~/twitter/`. (i.e. Problem 1 output should map to `~/twitter/` 50 | 51 | **EDIT:** Due to resource constraints on the cluster, please run your map/reduce jobs on only 1 twitter file (`tweets2009-06.tsv`). If you've already run your code on the whole dataset, that is fine. 52 | 53 | 1. Write a map/reduce program to determine the the number of @ replies each user received. 54 | 2. Write a map/reduce program to determine the user with the most Tweets for every given day in the dataset. (If there's a tie, break the tie by sorting alphabetically on users' handles) 55 | 3. Write a map reduce program to determine which Twitter users have the largest vocabulary - that is, users whose number of unique words in their tweets is maximized. 56 | 57 | ### Don't lose your progress! 58 | 59 | Hadoop jobs can take a very long time to complete. If you don't take precautions, you'll lose all your progress if something happens to your SSH session. 60 | 61 | To mitigate this, we have installed `tmux` on the cluster. Tmux is a tool that lets us persist shell sessions even when we lose SSH connection. 62 | 63 | 1. Run `tmux` to enter into a tmux session. 64 | 2. Run some command that will take a long time (`ping google.com`) 65 | 3. Exit out of your SSH session. 66 | 4. Log back into the server and run `tmux attach` and you should find your session undisturbed. 67 | 68 | ### Suggested Workflow 69 | 70 | We've provided you with a `sample` command that streams out a random 1% sample of the text file. This is useful for testing, as you won't want to use the entire dataset while developing your map/reduce scripts. 71 | 72 | 1. Write your map/reduce and test it with regular unix commands: 73 | 74 | ``` 75 | sample /mnt/volume/snapTwitterData/tweets2009-06.tsv | ./.py | sort | ./.py 76 | ``` 77 | 78 | 2. Test your map/reduce with a single Tweet file on Hadoop: 79 | 80 | ``` 81 | hdfs dfs -mkdir -p twitter 82 | hdfs dfs -rm -r twitter/out 83 | mapreduce .py .py /shared/snapTwitterData/tweets2009-06.tsv twitter/out 84 | ``` 85 | 86 | 3. Run your map/reduce on the full dataset: 87 | ``` 88 | hdfs dfs -mkdir -p twitter 89 | hdfs dfs -rm -r twitter/out 90 | mapreduce .py .py /shared/snapTwitterData/*.tsv twitter/out 91 | ``` 92 | -------------------------------------------------------------------------------- /Labs/Lab3/prob1_mapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | -------------------------------------------------------------------------------- /Labs/Lab3/prob1_reducer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | -------------------------------------------------------------------------------- /Labs/Lab3/prob2_mapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | -------------------------------------------------------------------------------- /Labs/Lab3/prob2_reducer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | . -------------------------------------------------------------------------------- /Labs/Lab3/prob3_mapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | -------------------------------------------------------------------------------- /Labs/Lab3/prob3_reducer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | # We open STDIN and STDOUT 6 | with sys.stdin as fin: 7 | with sys.stdout as fout: 8 | 9 | # For every line in STDIN 10 | for line in fin: 11 | # Your code here! 12 | pass 13 | -------------------------------------------------------------------------------- /Labs/Lab4/README.md: -------------------------------------------------------------------------------- 1 | # Lab 4: Spark 2 | 3 | ## Introduction 4 | 5 | As we have talked about in lecture, Spark is built on what's called a Resilient Distributed Dataset (RDD). 6 | 7 | PySpark allows us to interface with these RDD’s in Python. Think of it as an API. In fact, it is an API; it even has its own [documentation](http://spark.apache.org/docs/latest/api/python/)! It’s built on top of the Spark’s Java API and exposes the Spark programming model to Python. 8 | 9 | 10 | PySpark makes use of a library called `Py4J`, which enables Python programs to dynamically access Java objects in a Java Virtual Machine. 11 | 12 | This allows data to be processed in Python and cached in the JVM. 13 | 14 | 15 | ## Running your Jobs 16 | 17 | We'll be using `spark-submit` to run our spark jobs on the cluster. `spark-submit` has a couple command line options that you can tweak. 18 | 19 | #### `--master` 20 | This option tells `spark-submit` where to run your job, as spark can run in several modes. 21 | 22 | * `local` 23 | * The spark job runs locally, without using any compute resources from the cluster. 24 | * `yarn-client` 25 | * The spark job runs on our YARN cluster, but the driver is local to the machine, so it 'appears' that you're running the job locally, but you still get the compute resources from the cluster. You'll see the logs spark provides as the program executes. 26 | * When the cluster is busy, you *will not* be able to use this mode, because it imposes too much of a memory footprint. 27 | * `yarn-cluster` 28 | * The spark job runs on our YARN cluster, and the spark driver is in some arbitrary location on the cluster. This option doesn't give you logs directly, so you'll have to get the logs manually. 29 | * In the output of `spark-submit --master yarn-cluster` you'll find an `applicationId`. (This is similar to when you ran jobs on Hadoop). You can issue this command to get the logs for your job: 30 | 31 | ``` 32 | yarn logs -applicationId | less 33 | ``` 34 | * When debugging Python applications, it's useful to `grep` for `Traceback` in your logs, as this will likely be the actual debug information you're looking for. 35 | 36 | ``` 37 | yarn logs -applicationId | grep -A 50 Traceback 38 | ``` 39 | 40 | * *NOTE*: In cluster mode, normal IO operations like opening files will behave unexpectedly! This is because you're not guaranteed which node the driver will run on. You must use the PySpark API for saving files to get reliable results. You also have to coalesce your RDD into one partition before asking PySpark to write to a file (why do you think this is?). Additionally, you should save your results to HDFS. 41 | 42 | ```python 43 | .coalesce(1).saveAsTextFile('hdfs:///user/MY_USERNAME/foo') 44 | ``` 45 | 46 | #### `--num-executors` 47 | This option lets you set the number of executors that your job will have. A good rule of thumb is to have as many executors as the maximum number of partitions an RDD will have during a Spark job (this heuristic holds better for simple jobs, but falls apart as the complexity of your job increases). 48 | 49 | The number of executors is a tradeoff. Too few, and you might not be taking full advantage of Sparks parallelism. However, there is also an upper bound on the number of executors (for obvious reasons), as they have a fairly large memory footprint. (Don't set this too high or we'll terminate your job.) 50 | 51 | You can tweak executors more granularly by setting the amount of memory and number of cores they're allocated, but for our purposes the default values are sufficient. 52 | 53 | ### Putting it all together 54 | 55 | Submitting a spark job will ususally look something like this: 56 | 57 | ``` 58 | spark-submit --master yarn-cluster --num-executors 10 MY_PYTHON_FILE.py 59 | ``` 60 | 61 | Be sure to include the `--master` flag, or else your code will only run locally, and you won't get the benefits of the cluster's parallelism. 62 | 63 | You can track the progress of your application by looking running this command 64 | `ssh -L 127.0.0.1:9002:192-168-100-234.local:18080 username@141.142.210.245` 65 | 66 | where username is your username. Then look at http://127.0.0.1:9002 . If you scroll to the bottom and click Show incomplete applications, you can see the current progress of your script 67 | 68 | ### Interactive Shell 69 | 70 | While `spark-submit` is the way we'll be endorsing to run PySpark jobs, there is an option to run jobs in an interactive shell. Use the `pyspark` command to load into the PySpark interactive shell. You can use many of the same options listed above to tweak `pyspark` settings, such as `--num-executors` and `--master`. 71 | 72 | Note: If you start up the normal `python` interpreter, you probably won't be able to use any of the PySpark features. 73 | 74 | ### Helpful Hints 75 | 76 | * You'll find the [PySpark documentation](https://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.RDD) (especially the section on RDDs) very useful. 77 | * Run your Spark jobs on a subset of the data when you're debugging. Even though Spark is very fast, jobs can still take a long time - especially when you're working with the review dataset. When you are experimenting, always use a subset of the data. The best way to use a subset of data is through the [take](https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html#take(int)) command. 78 | 79 | Specifically the most common pattern to sample data looks like 80 | `rdd = sc.parallelize(rdd.take(100))` 81 | This converts an rdd into a list of 100 items and then back into an rdd through the parallelize function. 82 | 83 | * [Programming Guide](http://spark.apache.org/docs/latest/programming-guide.html) -- This documentation by itself could be used to solve the entire lab. It is a great quickstart guide about Spark. 84 | 85 | 86 | 87 | 88 | ## The Dataset 89 | 90 | This week, we'll be working off of a set of released Yelp data. 91 | 92 | The dataset is located in `/shared/yelp` in HDFS. We'll be using the following files for this lab: 93 | 94 | ``` 95 | /shared/yelp/yelp_academic_dataset_business.json 96 | /shared/yelp/yelp_academic_dataset_checkin.json 97 | /shared/yelp/yelp_academic_dataset_review.json 98 | /shared/yelp/yelp_academic_dataset_user.json 99 | ``` 100 | 101 | We'll give more details about the data in these files as we continue with the lab, but the general schema is this: each line in each of these JSON files is an independently parsable JSON object that represents a distinct entity, whether it be a business, a review, or a user. 102 | 103 | *Hint:* JSON is parsed with `json.loads` 104 | 105 | ## Lab Activities 106 | **Lab 4 is due on Thursday, March 2nd, 2017 at 11:55PM.** 107 | 108 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu). 109 | 110 | ### 1. Least Expensive Cities 111 | 112 | In planning your next road trip, you want to find the cities that, overall, will be the least expensive to dine at. 113 | 114 | It turns out that Yelp keeps track of a handy metric for this, and many restaurants have the attribute `RestaurantsPriceRange2` that gives the business a score from 1-4 as far as 'priciness'. 115 | 116 | Write a PySpark application that sorts cities by the average price of their businesses/restaurants. 117 | 118 | Notes: 119 | 120 | * Discard any business that does not have the `RestaurantsPriceRange2` attribute 121 | * Discard any business that does not have a valid city and state 122 | * Your output should be sorted descending by average price (highest at top, lowest at bottom). Your average restaurant price should be rounded to 2 decimal places. Each city should get a row in the output and look like: 123 | 124 | `CITY, STATE: PRICE` 125 | 126 | ### 2. Up All Night 127 | 128 | You also expect on this road trip that you'll be out pretty late. Yelp also lists the hours that businesses are open, so lets find out where you'll be likely to find something to eat late at night. 129 | 130 | Write a PySpark application that sorts cities by the median closing time of their businesses/restaurants to find the cities that are open latest. 131 | 132 | Notes: 133 | 134 | * Discard any business that doesn't have a valid `hours` property, or an `hours` property that does not include the closing time of the business 135 | * Discard any invalid times (some business have `DAY 0:0-0:0` as their hours, which we consider to be invalid), and for simplicities sake, assume that all businesses close before midnight. 136 | * Use the **median** closing time of businesses in each city as the "city closing time". If you have to tie break (i.e. `num_business_hours % 2 == 1`), choose the lower of the two so you can avoid doing datetime math 137 | * Your output should be in the following format, with median closing time in `HH:MM` (24 hour clock) format and should be sorted descending by time (latest cities first). 138 | 139 | `CITY, STATE: HH:MM` 140 | 141 | ### 3. Pessimistic Yelp Reviewers 142 | 143 | For this activity, we'll be looking at [Yelp reviews](https://www.youtube.com/watch?v=QEdXhH97Z7E). 😱 Namely, we want to find out which Yelp reviewers are... more harsh than they should be. 144 | 145 | To do this we will calculate the average review score of each business in our dataset, and find the users that most often under-rate businesses. 146 | 147 | Use the following to calculate which users are pessimistic: 148 | 149 | * The `average_business_rating` of a business is the sum of the ratings of the business divided by the count of the ratings for that business. 150 | * A user's pessimism score is the sum of the differences between their rating and the average business rating *if and only if* their rating is lower than the average divided by the number of times their rating was less than the average. 151 | 152 | Your output should contain the top 100 pessimistic users in the following format, where `pessimism_score` is rounded to 2 decimal places, and users are sorted in descending order by `pessimism_score`: 153 | 154 | ``` 155 | user_id: pessimism_score 156 | ``` 157 | 158 | Notes: 159 | 160 | * Business have "average rating" as a property. We **will not** be using this. Instead - to have greater precision - we will be manually calculating a business' average reviews by averaging all the review scores given in `yelp_academic_dataset_review.json`. 161 | * Discard any reviews that do not have a rating, a `user_id`, and a `business_id`. 162 | 163 | ### 4. Descriptors of a Bad Business 164 | 165 | Suppose we want to predict a review's score from its text. There are many ways we could do this, but a simple way would be to find words that are indicative of either a positive or negative review. 166 | 167 | In this activity, we want to find the words that are the most 'charged'. We can think about the probability that a word shows up in a review as depending on the type of a review. For example, it is more likely that "delicious" would show up in a positive review than a negative one. 168 | 169 | Calculate the probability of each word appearing to be the number of occurrences of the word in the category tested (positive/negative) divided by the number of reviews in that category. 170 | 171 | Output the **top 250** words that are most likely to be in negative reviews, but not in positive reviews (maximize `P(negative) - P(positive)`). 172 | 173 | Notes: 174 | 175 | * Remove any words listed in `nltk`'s list of [English stopwords](http://www.nltk.org/book/ch02.html#wordlist-corpora) and remove all punctuation. We also encourage you to use `nltk.tokenize.word_tokenize` to split reviews into words. 176 | * Consider a review to be positive if it has >=3 stars, and consider a review negative if it has <3 stars. 177 | * Your output should be as follows, where `probability_diff` is `P(negative) - P(positive)` rounded to **5** decimal places and sorted in descending order: 178 | 179 | `word: probability_diff` 180 | -------------------------------------------------------------------------------- /Labs/Lab4/descriptors_of_bad_business.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Descriptors of a Bad Business") 3 | sc = SparkContext(conf=conf) 4 | 5 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json") 6 | 7 | with open('descriptors_of_bad_business.txt', 'w+') as f: 8 | pass 9 | -------------------------------------------------------------------------------- /Labs/Lab4/most_expensive_city.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Most Expensive City") 3 | sc = SparkContext(conf=conf) 4 | 5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json") 6 | 7 | with open('most_expensive_city.txt', 'w+') as f: 8 | f.write('Champaign, IL: 1.23') 9 | -------------------------------------------------------------------------------- /Labs/Lab4/pessimistic_users.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Pessimistic Users") 3 | sc = SparkContext(conf=conf) 4 | 5 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json") 6 | 7 | with open('pessimistic_users.txt', 'w+') as f: 8 | f.write('taeyoung_kim: 1.23') 9 | -------------------------------------------------------------------------------- /Labs/Lab4/up_all_night.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Up All Night") 3 | sc = SparkContext(conf=conf) 4 | 5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json") 6 | 7 | with open('up_all_night.txt', 'w+') as f: 8 | f.write('Champaign, IL: 11:59') 9 | -------------------------------------------------------------------------------- /Labs/Lab5/README.md: -------------------------------------------------------------------------------- 1 | # Lab 5: Spark MLlib 2 | 3 | ## Introduction 4 | 5 | This week we'll be diving into another aspect of PySpark: MLlib. Spark MLlib provides an API to run machine learning algorithms on RDDs so that we can do ML on the cluster with the benefit of parallelism / distributed computing. 6 | 7 | ## Machine Learning Crash Course 8 | 9 | We'll be considering 3 types of machine learning in the lab this week: 10 | 11 | * Classification 12 | * Regression 13 | * Clustering 14 | 15 | However, for most of the algoritms / feature extractors that MLlib provides, there is a common pattern: 16 | 17 | 1) Fit - Trains the model, using training data to adjust the model's internal parameters. 18 | 19 | 2) Transform - Use the fitted model to predict the label/value of novel data (data not used in the traning of the model). 20 | 21 | If you go on to do more data science work, you'll see that this 2-phase ML pattern is common in other ML libraries, like `scikit-learn`. 22 | 23 | Things are a bit more complicated in PySpark, because part of the way RDDs are handled (i.e. lazy evaluation), we often have to explicitly note when we want to predict data, and other instances when we're piping data through different steps of our model's setup. 24 | 25 | It'll be extremely valuable to look up PySpark's documentation and example when working on this week's lab. The lab examples we'll be giving you do not require deep knowledge of Machine Learning concepts to complete. However, you will need to be good documentation-readers to naviaget MLlib's nuances. 26 | 27 | ## Examples 28 | 29 | ### TF-IDF Naive Bayes Yelp Review Classification 30 | 31 | #### Extracting Features 32 | 33 | Remember last week when you found out which words were corrolated with negative reviews by calculating the probability of a word occuring in a review? PySpark lets you do something like this extremely easily to calculate the Term Frequency - Inverse Document Frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) characteristics of a set of texts. 34 | 35 | Let's define some terms: 36 | 37 | * Term frequency - The number of times a word appears in a text 38 | * Inverse document frequency - Weights a word's importance in a text by seeing if that word is rare in the collection of all texts. So, if we have a sentence contains the only reference to "cats" in an entire book, that sentence will have "cats" ranked as highly relevant. 39 | 40 | TF-IDF combines the previous two concepts. Suppose we had a few sentences that refer to "cats" in a large book. We'd rank those rare sentences then by the frequency of the "cats" in each of those sentences. 41 | 42 | There's a fair amount of math behind calculating TF-IDF, but for this lab it is sufficient to know that it is a relatively reliable way of guessing the relevance of a word in the context of a large body of data. 43 | 44 | You'll also note that we're making use of a `HashingTF`. This is just a really quick way to compute the term-frequency of words. It uses a hash function to represent a long string with a shorter hash, and can use a datastructure like a hashmap to quickly count the frequency with which words appear. 45 | 46 | #### Classifying Features 47 | 48 | We'll also be using a [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier. This type of classifier looks at a set of data and labels, and constructs a model to preduct the label given the data using probabilistic means. 49 | 50 | Again, it's not necessary to know the inner workings of Naive Bayes, just that we'll be using it to classify data. 51 | 52 | #### Constructing a model 53 | 54 | To construct a model, we'll need to construct an RDD that has (key, value) pairs with keys as our labels, and values as our features. First, however, we'll need to extract those features from the text. We're going to use TF-IDF as our feature, so we'll calculate that for all of our text first. 55 | 56 | We'll start with the assumption that you've transformed the data so that we have `(label, array_of_words)` as the RDD. To start with, we'll have label be `0` if the review is negative and `1` if the review is positive. You practiced how to do this last week. 57 | 58 | Here's how we'll extract the TF-IDF features: 59 | 60 | ```python 61 | # Feed HashingTF just the array of words 62 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1])) 63 | 64 | # Pipe term frequencies into the IDF 65 | idf = IDF(minDocFreq=5).fit(tf) 66 | 67 | # Transform the IDF into a TF-IDF 68 | tfidf = idf.transform(tf) 69 | 70 | # Reassemble the data into (label, feature) K,V pairs 71 | zipped_data = (labels.zip(tfidf) 72 | .map(lambda x: LabeledPoint(x[0], x[1])) 73 | .cache()) 74 | ``` 75 | 76 | Now that we have our labels and our features in one RDD, we can train our model: 77 | 78 | ``` 79 | # Do a random split so we can test our model on non-trained data 80 | training, test = zipped_data.randomSplit([0.7, 0.3]) 81 | 82 | # Train our model with the training data 83 | model = NaiveBayes.train(training) 84 | ``` 85 | 86 | Then, we can use this model to predict new data: 87 | ```python 88 | # Use the test data and get predicted labels from our model 89 | test_preds = (test.map(lambda x: x.label) 90 | .zip(model.predict(test.map(lambda x: x.features)))) 91 | ``` 92 | 93 | If we look at this `test_preds` RDD, we'll see our text, and the label the model predicted. 94 | 95 | However, if we want a more precise measurement of how our model faired, PySpark gives us `MulticlassMetrics`, which we can use to measure our model's performance. 96 | 97 | ```python 98 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1])))) 99 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1])))) 100 | 101 | print trained_metrics.confusionMatrix().toArray() 102 | print trained_metrics.precision() 103 | 104 | print test_metrics.confusionMatrix().toArray() 105 | print test_metrics.precision() 106 | ``` 107 | 108 | #### Analyzing our Results 109 | `MulticlassMetrics` let's us see the ["confusion matrix"](https://en.wikipedia.org/wiki/Confusion_matrix) of our model, which shows us how many times our model chose each label given the actual label of the data point. 110 | 111 | The meaning of the columns is the _predicted_ value, and the meaning of the rows is the _actual_ value. So, we read that `confusion_matrix[0][1]` is the number of items predicted as having `label[1]` that were in actuality `label[0]`. 112 | 113 | Thus, we want our confusion matrix to have as many items on the diagonals as possible, as these represent items that were correctly predicted. 114 | 115 | We can also get precision, which is a more simple metric of "how many items we predicted correctly". 116 | 117 | Here's our results for this example: 118 | 119 | ``` 120 | # Training Data Confusion Matrix: 121 | [[ 2019245. 115503.] 122 | [ 258646. 513539.]] 123 | # Training Data Accuracy: 124 | 0.8712908071840665 125 | 126 | # Testing Data Confusion Matrix: 127 | [[ 861056. 55386.] 128 | [ 115276. 214499.]] 129 | #Testing Data Accuracy: 130 | 0.8630559525347512 131 | ``` 132 | 133 | Not terrible. As you see, our training data get's slightly better prediction precision, because it's the data used to train the model. 134 | 135 | #### Extending the Example 136 | 137 | What if instead of just classifying on positive and negative, we try to classify reviews based on their 1-5 stars review score? 138 | 139 | ``` 140 | # Training Data Confusion Matrix: 141 | [[ 130042. 38058. 55682. 115421. 193909.] 142 | [ 27028. 71530. 26431. 55381. 95007.] 143 | [ 35787. 22641. 102753. 71802. 122539.] 144 | [ 72529. 45895. 69174. 254838. 246081.] 145 | [ 113008. 73249. 108349. 225783. 535850.]] 146 | # Training Data Accuracy: 147 | 0.37645263439801124 148 | 149 | # Testing Data Confusion Matrix: 150 | [[ 33706. 20317. 27553. 54344. 90325.] 151 | [ 15384. 10373. 14875. 28413. 46173.] 152 | [ 18958. 13288. 19389. 37813. 59746.] 153 | [ 36921. 25382. 37791. 76008. 120251.] 154 | [ 57014. 37817. 55372. 112851. 194319.]] 155 | #Testing Data Accuracy: 156 | 0.268241369417615 157 | ``` 158 | 159 | Ouch. What went wrong? Well, a couple things. One thing that hurts us is that Naive Bayes is, well, Naive. While we intuitively know that the meanings 1, 2, 3, 4, 5 have a specific value, NB doesn't have any concept that items labeled 4 and 5 are probably going to be closer than a pair labeled 1 and 5. 160 | 161 | Also, in this example we see a case where testing out training data doesn't have much utility. While an accuracy of `0.376` isn't great, it's still a lot better thatn `0.268`. Validating on the training data would lead us to think that our model is substantially more accurate than it actually is. 162 | 163 | #### Conclusion 164 | 165 | The full code of the first example is in `bayes_binary_tfidf.py`, and the second "extended" example is in `bayes_tfidf.py`. 166 | 167 | ## Lab Activities 168 | **Lab 5 is due on Thursday, March 9th, 2017 at 11:55PM.** 169 | 170 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu). 171 | 172 | **NOTE:** 173 | 174 | * For each problem you may only use, at most, 80% of the dataset to train on. The other 20% should be used for testing your model. (i.e. use `rdd.randomSplit([0.8, 0.2])) 175 | * Our cluster has PySpark version 1.5.2. This is a slightly older version, so we don't have a couple of the cutting-edge ML tools. Use [this](https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#) documentation in your research. 176 | 177 | #### Precision Competition 178 | 179 | Lab Problems 1 and 2 have an aspect of competition this week: We'll be awarding 10% extra credit on this lab to the top 3 students with the highest average precision across the two problems. Make sure that your spark jobs outputs the precision of your models as given by the appropriate metrics class, and that your results are reproducable to be eligable for credit. 180 | 181 | ### 1. Amazon Review Score Classification 182 | This week, we'll be using an Amazon dataset of food reviews. You can find this dataset in HDFS at `/shared/amazon_food_reviews.csv`. The dataset has the following columns: 183 | 184 | ``` 185 | Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text 186 | ``` 187 | Similar to the Yelp Dataset, Amazon's food review dataset provides you with some review text and a review score. Use MLlib to classify these reviews by score. You can use any classifiers and feature extractors that are available. You may also choose to classify either on positive/negative or the more granular stars rating. You'll only be eligable for the precision contest if you classify on stars, not just positive/negative. 188 | 189 | Notes: 190 | 191 | * You can use the any fields other than `HelpfulnessNumerator` or `HelpfulnessDenominator` for feature extraction. 192 | * Use `MulticlassMetrics` to output the `confusionMatrix` and `precision` of your model. You want to maximize the precision. Include this output in your submission. 193 | 194 | ### 2. Amazon Review Helpfulness Regression 195 | 196 | Amazon also gives a metric of "helpfulness". The dataset has the number of users who marked a review as helpful, and the number of users who voted either up or down on the review. 197 | 198 | Define a review's helpfulness score as `HelpfulnessNumerator / HelpfulnessDenominator`. 199 | 200 | Construct and train a model that uses a regression algorithm to predict a review's helpfulnes score from it's text. 201 | 202 | Notes: 203 | 204 | * You can use the any fields other than `Score` for feature extraction. 205 | * We suggest that at for a starting point, you use `pyspark.mllib.regression.LinearRegressionWithSGD` as your regression model. 206 | * Use `pyspark.mllib.evaluation.RegressionMetrics` to output the `explainedVariance` and `rootMeanSquaredError`. You want to minimize the error. 207 | 208 | ### 3. Yelp Business Clustering 209 | 210 | Going back to the Yelp dataset, suppose we want to find clusters of business in the Urbana/Champaign area. Where do businesses aggregate geographically? Could we predict from a set of coordinates which cluster of business a given business is in? Use K-Means to come up with a clustering model for the U-C area. 211 | 212 | How can we determine how good our model is? The simplest way is to just graph it, and see if the clusters match what we would expect. More formally, we can use Within Set Sum of Squared Error ([WSSSE](https://spark.apache.org/docs/1.5.0/mllib-clustering.html#k-means)) to determin the optimal number of clusters. If we plot the error for multiple values of k, we can see the point of diminishing returns to adding more clusters. You should pick a value of k that is around this point of diminishing return. 213 | 214 | Notes: 215 | 216 | * Use `pyspark.mllib.clustering.KMeans` as your clustering algorithm. 217 | * Your task is to: 218 | 1. Extract the business that are in the U-C area and use their coordinates as features for your KMeans clustering model. 219 | 2. Select a proper K such that you get a good approximation of the "actual" clusters of businesses. (May require trial-and-error) 220 | 3. Plot the businesses with `matplotlib.pyplot.scatter` and have each point on the scatter plot be color-keyed by their cluster. 221 | 4. Include both the plot as a PNG and a short justification for your k value (either in comments in your code or in a separate `.txt`) in your submission. 222 | -------------------------------------------------------------------------------- /Labs/Lab5/amazon_helpfulness_regression.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Amazon Helpfulness Regression") 3 | sc = SparkContext(conf=conf) 4 | 5 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv") 6 | 7 | with open('amazon_helpfulness_regression.txt', 'w+') as f: 8 | pass 9 | -------------------------------------------------------------------------------- /Labs/Lab5/amazon_review_classification.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Amazon Review Classification") 3 | sc = SparkContext(conf=conf) 4 | 5 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv") 6 | 7 | with open('amazon_review_classification.txt', 'w+') as f: 8 | pass 9 | -------------------------------------------------------------------------------- /Labs/Lab5/bayes_binary_tfidf.py: -------------------------------------------------------------------------------- 1 | from pyspark.mllib.feature import HashingTF, IDF 2 | from pyspark.mllib.regression import LabeledPoint 3 | from pyspark.mllib.classification import NaiveBayes 4 | from pyspark.mllib.evaluation import MulticlassMetrics 5 | import json 6 | import nltk 7 | from pyspark import SparkContext, SparkConf 8 | conf = SparkConf().setAppName("Bayes Binary TFIDF") 9 | sc = SparkContext(conf=conf) 10 | 11 | 12 | def get_labeled_review(x): 13 | return x.get('stars'), x.get('text') 14 | 15 | 16 | def categorize_review(x): 17 | return (0 if x[0] > 2.5 else 1), x[1] 18 | 19 | 20 | def format_prediction(x): 21 | return "actual: {0}, predicted: {1}".format(x[0], float(x[1])) 22 | 23 | 24 | def produce_tfidf(x): 25 | tf = HashingTF().transform(x) 26 | idf = IDF(minDocFreq=5).fit(tf) 27 | tfidf = idf.transform(tf) 28 | return tfidf 29 | 30 | # Load in reviews 31 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json") 32 | # Parse to json 33 | json_payloads = reviews.map(json.loads) 34 | # Tokenize and weed out bad data 35 | labeled_data = (json_payloads.map(get_labeled_review) 36 | .filter(lambda x: x[0] and x[1]) 37 | .map(lambda x: (float(x[0]), x[1])) 38 | .map(categorize_review) 39 | .mapValues(nltk.word_tokenize)) 40 | labels = labeled_data.map(lambda x: x[0]) 41 | 42 | tf = HashingTF().transform(labeled_data.map(lambda x: x[1])) 43 | idf = IDF(minDocFreq=5).fit(tf) 44 | tfidf = idf.transform(tf) 45 | zipped_data = (labels.zip(tfidf) 46 | .map(lambda x: LabeledPoint(x[0], x[1])) 47 | .cache()) 48 | 49 | # Do a random split so we can test our model on non-trained data 50 | training, test = zipped_data.randomSplit([0.7, 0.3]) 51 | 52 | # Train our model 53 | model = NaiveBayes.train(training) 54 | 55 | # Use our model to predict 56 | train_preds = (training.map(lambda x: x.label) 57 | .zip(model.predict(training.map(lambda x: x.features)))) 58 | test_preds = (test.map(lambda x: x.label) 59 | .zip(model.predict(test.map(lambda x: x.features)))) 60 | 61 | # Ask PySpark for some metrics on how our model predictions performed 62 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1])))) 63 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1])))) 64 | 65 | with open('output_binary.txt', 'w+') as f: 66 | f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n') 67 | f.write(str(trained_metrics.precision()) + '\n') 68 | f.write(str(test_metrics.confusionMatrix().toArray()) + '\n') 69 | f.write(str(test_metrics.precision()) + '\n') 70 | -------------------------------------------------------------------------------- /Labs/Lab5/bayes_tfidf.py: -------------------------------------------------------------------------------- 1 | from pyspark.mllib.feature import HashingTF, IDF 2 | from pyspark.mllib.regression import LabeledPoint 3 | from pyspark.mllib.classification import NaiveBayes 4 | from pyspark.mllib.evaluation import MulticlassMetrics 5 | import json 6 | import nltk 7 | from pyspark import SparkContext, SparkConf 8 | conf = SparkConf().setAppName("Bayes TFIDF") 9 | sc = SparkContext(conf=conf) 10 | 11 | 12 | def get_labeled_review(x): 13 | return x.get('stars'), x.get('text') 14 | 15 | 16 | def format_prediction(x): 17 | return "actual: {0}, predicted: {1}".format(x[0], float(x[1])) 18 | 19 | 20 | def produce_tfidf(x): 21 | tf = HashingTF().transform(x) 22 | idf = IDF(minDocFreq=5).fit(tf) 23 | tfidf = idf.transform(tf) 24 | return tfidf 25 | 26 | # Load in reviews 27 | reviews = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json") 28 | # Parse to json 29 | json_payloads = reviews.map(json.loads) 30 | # Tokenize and weed out bad data 31 | labeled_data = (json_payloads.map(get_labeled_review) 32 | .filter(lambda x: x[0] and x[1]) 33 | .map(lambda x: (float(x[0]), x[1])) 34 | .mapValues(nltk.word_tokenize)) 35 | labels = labeled_data.map(lambda x: x[0]) 36 | 37 | tfidf = produce_tfidf(labeled_data.map(lambda x: x[1])) 38 | zipped_data = (labels.zip(tfidf) 39 | .map(lambda x: LabeledPoint(x[0], x[1])) 40 | .cache()) 41 | 42 | # Do a random split so we can test our model on non-trained data 43 | training, test = zipped_data.randomSplit([0.7, 0.3]) 44 | 45 | # Train our model 46 | model = NaiveBayes.train(training) 47 | 48 | # Use our model to predict 49 | train_preds = (training.map(lambda x: x.label) 50 | .zip(model.predict(training.map(lambda x: x.features)))) 51 | test_preds = (test.map(lambda x: x.label) 52 | .zip(model.predict(test.map(lambda x: x.features)))) 53 | 54 | # Ask PySpark for some metrics on how our model predictions performed 55 | trained_metrics = MulticlassMetrics(train_preds.map(lambda x: (x[0], float(x[1])))) 56 | test_metrics = MulticlassMetrics(test_preds.map(lambda x: (x[0], float(x[1])))) 57 | 58 | with open('output_discrete.txt', 'w+') as f: 59 | f.write(str(trained_metrics.confusionMatrix().toArray()) + '\n') 60 | f.write(str(trained_metrics.precision()) + '\n') 61 | f.write(str(test_metrics.confusionMatrix().toArray()) + '\n') 62 | f.write(str(test_metrics.precision()) + '\n') 63 | -------------------------------------------------------------------------------- /Labs/Lab5/yelp_clustering.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | conf = SparkConf().setAppName("Yelp Clustering") 3 | sc = SparkContext(conf=conf) 4 | 5 | businesses = sc.textFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json") 6 | 7 | with open('yelp_clustering.txt', 'w+') as f: 8 | pass -------------------------------------------------------------------------------- /Labs/Lab6/README.md: -------------------------------------------------------------------------------- 1 | # Lab 6: Spark SQL 2 | 3 | ## Introduction 4 | 5 | Spark SQL is a powerful way for interacting with large amounts of structured data. Spark SQL gives us the concept of "dataframes", which will be familiar if you've ever done work with Pandas or R. DataFrames can also be thought of as similar to tables in databases. 6 | 7 | With Spark SQL Dataframes we can interact with our data using the Structured Query Language (SQL). This gives us a declarative way to query our data, as opposed to the imperative methods we've studied in past weeks (i.e. discrete operations on sets of RDDs) 8 | 9 | ## SQL Crash Course 10 | 11 | SQL is a declarative language used for querying data. The simplest SQL query is a `SELECT - FROM - WHERE` query. This selects a set of attributes (SELECT) from a specific table (FROM) where a given set of conditions holds (WHERE). 12 | 13 | However, SQL also has a series of more advanced aggregation commands for grouping data. This is accomplished with the `GROUP BY` keyword. We can also join tables on attributes or conditions with the set of `JOIN ... ON` commands. We won't be expecting advanced knowledge of these more difficult topics, but developing a working understanding of how these work will be useful in completing this lab. 14 | 15 | Spark SQL has a pretty good [Programming Guide](https://spark.apache.org/docs/1.5.1/sql-programming-guide.html) that's worth looking at. 16 | 17 | Additionally, you may find [SQL tutorials](https://www.w3schools.com/sql/default.asp) online useful for this assignment. 18 | 19 | ## Examples 20 | 21 | ### Loading Tables 22 | 23 | The easiest way to get data into Spark SQL is by registering a DataFrame as a table. A DataFrame is essentially an instance of a Table: it has a schema (columns with data types and names), and data. 24 | 25 | We can create a DataFrame by passing an RDD of data tuples and a schema to `sqlContext.createDataFrame`: 26 | 27 | ``` 28 | data = sc.parallelize([('Tyler', 1), ('Quinn', 2), ('Ben', 3)]) 29 | df = sqlContext.createDataFrame(data, ['name', 'instructor_id']) 30 | ``` 31 | 32 | This creates a DataFrame with 2 columns: `name` and `instructor_id`. 33 | 34 | We can then register this frame with the sqlContext to be able to query it generally: 35 | 36 | ``` 37 | sqlContext.registerDataFrameAsTable(df, "instructors") 38 | ``` 39 | 40 | Now we can query the table: 41 | 42 | ``` 43 | sqlContext.sql("SELECT name FROM instructors WHERE instructor_id=3") 44 | ``` 45 | 46 | ### Specific Business Subset 47 | 48 | Suppose we want to find all the businesses located in Champaign, IL that have 5 star ratings. We can do this with a simple `SELECT - FROM - WHERE` query: 49 | 50 | ```python 51 | sqlContext.sql("SELECT * " 52 | "FROM businesses " 53 | "WHERE stars=5 " 54 | "AND city='Champaign' AND state='IL'").collect() 55 | ``` 56 | 57 | This selects all the rows from the `businesses` table that match the criteria described in the `WHERE` clause. 58 | 59 | ### Highest Number of Reviews 60 | 61 | Suppose we want to rank users by how many reviews they've written. We can do this query with aggregation and grouping: 62 | 63 | ```python 64 | sqlContext.sql("SELECT user_id, COUNT(*) AS c" 65 | "FROM reviews " 66 | "GROUP BY user_id " 67 | "SORT BY c DESC " 68 | "LIMIT 10").collect() 69 | ``` 70 | 71 | This query groups rows by the `user_id` column, and collapses those rows into tuples of `(user_id, COUNT(*))`, where `COUNT(*)` is the number of collapsed rows per grouping. This gives us the review count of each user. We then do `SORT BY c DESC` to show the top counts first, and `LIMIT 10` to only show the top 10 results. 72 | 73 | ## Lab Activities 74 | **Lab 6 is due on Thursday, March 16th, 2017 at 11:55PM.** 75 | 76 | Please zip your source files **and your output text files** for the following exercises and upload it to Moodle (learn.illinois.edu). 77 | 78 | **NOTE:** 79 | 80 | * For each of these problems you may use RDDs *only* for loading in and saving data to/from HDFS. All of your "computation" must be performed on DataFrames, either via the SQLContext or DataFrame interfaces. 81 | * We _suggest_ using the SQLContext for most of these problems, as it's generally a more straight-forward interface. 82 | 83 | ### 1. Quizzical Queries 84 | 85 | For this problem, we'll construct some simple SQL queries on the Amazon Review dataset that we used last week. Your first task is to create a DataFrame from the CSV set. Once you've done this, write queries that get the requested information about the data. Format and save your output and include it in your submission. 86 | 87 | **NOTE:** For this problem, you *must* use `sqlContext.sql` to run your queries. This means, you have to run `sqlContext.registerDataFrameAsTable` on your constructed DataFrame and write queries in raw SQL. 88 | 89 | Queries: 90 | 91 | 1. What is the review text of the review with id `22010`? 92 | 2. How many 5-star ratings does product `B000E5C1YE` have? 93 | 3. How any unique users have written reviews? 94 | 95 | Notes: 96 | 97 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review. 98 | 99 | ### 2. Aggregation Aggravation 100 | 101 | For this problem, we'll use some more complicated parts of the SQL language. Often times, we'll want to learn aggregate statistics about our data. We'll use `GROUP BY` and aggregation methods like `COUNT`, `MAX`, `AVG` to find out more interesting information about our dataset. 102 | 103 | Queries: 104 | 105 | 1. How many reviews has the person who has written the most number of reviews written? What is that user's UserId? 106 | 2. List the ProductIds of the products with the top 10 highest average review scores of products that have more than 10 reviews, sorted by product score, with ties broken by number of reviews. 107 | 3. List the Id of the reviews with the top 10 highest ratios between `HelpfulnessNumerator` and `HelpfulnessDenominator`, which have `HelpfulnessDenominator` more than 10, sorted by that ratio, with ties broken by `HelpfulnessDenominator`. 108 | 109 | Notes: 110 | 111 | * You'll want to use `csv.reader` to parse your data. Using `str.split(',')` is insufficient, as there will be commas in the Text field of the review. 112 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames. 113 | 114 | ### 3. Jaunting with Joins 115 | 116 | For this problem, we'll switch back to the Yelp dataset. Note that you can use the very handy [jsonFile](https://spark.apache.org/docs/1.5.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonFile) method to load in the dataset as a DataFrame. 117 | 118 | There are some times that we need to access data that is split accross multiple tables. For instance, when we look at a single Yelp review, we cannot directly get the user's name, because we only have their id. But, we can match users with their reviews by "joining" on their user id. The database does this by looking for rows with matching values for the join columns. 119 | 120 | You'll want to look up the JOIN (specifically INNER JOIN) SQL commands for these problems. 121 | 122 | Queries: 123 | 124 | 1. What state has had the most Yelp check-ins? 125 | 2. What is the maximum number of "funny" ratings left on a review created by someone who's been yelping since 2012? 126 | 3. List the user ids of anyone who has left a 1-star review, has created more than 250 reviews, and has left a review in Champaign, IL. 127 | 128 | Notes: 129 | 130 | * You may use DataFrame query methods other than `sqlContext.sql`, but you must still do all your computations on DataFrames. 131 | -------------------------------------------------------------------------------- /Labs/Lab6/aggregation_aggravation.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | from pyspark.sql import SQLContext 3 | import csv 4 | conf = SparkConf().setAppName("Aggregation Aggravation") 5 | sc = SparkContext(conf=conf) 6 | sqlContext = SQLContext(sc) 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',') 8 | 9 | 10 | def parse_csv(x): 11 | x = x.replace('\n', '') 12 | d = csv.reader([x]) 13 | return next(d) 14 | 15 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv") 16 | first = reviews.first() 17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv) 18 | 19 | df = sqlContext.createDataFrame(csv_payloads, schema) 20 | sqlContext.registerDataFrameAsTable(df, "amazon") 21 | 22 | # Do your queries here 23 | 24 | with open('aggregation_aggravation.txt', 'w+') as f: 25 | pass 26 | -------------------------------------------------------------------------------- /Labs/Lab6/jaunting_with_joins.py: -------------------------------------------------------------------------------- 1 | from pyspark import SparkContext, SparkConf 2 | from pyspark.sql import SQLContext 3 | conf = SparkConf().setAppName("Jaunting With Joins") 4 | sc = SparkContext(conf=conf) 5 | sqlContext = SQLContext(sc) 6 | 7 | reviews = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_review.json") 8 | businesses = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_business.json") 9 | checkins = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_checkin.json") 10 | users = sqlContext.jsonFile("hdfs:///shared/yelp/yelp_academic_dataset_user.json") 11 | 12 | sqlContext.registerDataFrameAsTable(reviews, "reviews") 13 | sqlContext.registerDataFrameAsTable(businesses, "businesses") 14 | sqlContext.registerDataFrameAsTable(checkins, "checkins") 15 | sqlContext.registerDataFrameAsTable(users, "users") 16 | 17 | # Do your queries here 18 | 19 | with open('jaunting_with_joins.txt', 'w+') as f: 20 | pass 21 | -------------------------------------------------------------------------------- /Labs/Lab6/quizzical_queries.py: -------------------------------------------------------------------------------- 1 | import csv 2 | from pyspark import SparkContext, SparkConf 3 | from pyspark.sql import SQLContext 4 | conf = SparkConf().setAppName("Quizzical Queries") 5 | sc = SparkContext(conf=conf) 6 | sqlContext = SQLContext(sc) 7 | schema = "Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text".split(',') 8 | 9 | 10 | def parse_csv(x): 11 | x = x.replace('\n', '') 12 | d = csv.reader([x]) 13 | return next(d) 14 | 15 | reviews = sc.textFile("hdfs:///shared/amazon_food_reviews.csv") 16 | first = reviews.first() 17 | csv_payloads = reviews.filter(lambda x: x != first).map(parse_csv) 18 | 19 | # Do your queries here 20 | 21 | with open('quizzical_queries.txt', 'w+') as f: 22 | pass 23 | -------------------------------------------------------------------------------- /Lectures/Lecture 10.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 10.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 11 - Clouds.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 11 - Clouds.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 12 - Streaming.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 12 - Streaming.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 13- Networking.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 13- Networking.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 6- More Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 6- More Spark.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 7- MLib.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 7- MLib.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 8- SQL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 8- SQL.pdf -------------------------------------------------------------------------------- /Lectures/Lecture 9- NoSQL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/Lecture 9- NoSQL.pdf -------------------------------------------------------------------------------- /Lectures/week1/Lecture 1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week1/Lecture 1.pdf -------------------------------------------------------------------------------- /Lectures/week2/Lecture 2 - Git, Latex, and Other Intros.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week2/Lecture 2 - Git, Latex, and Other Intros.pdf -------------------------------------------------------------------------------- /Lectures/week2/Lecture 2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week2/Lecture 2.pdf -------------------------------------------------------------------------------- /Lectures/week3/Lecture 3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week3/Lecture 3.pdf -------------------------------------------------------------------------------- /Lectures/week4/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week4/.DS_Store -------------------------------------------------------------------------------- /Lectures/week4/Lecture 4- Data.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week4/Lecture 4- Data.pdf -------------------------------------------------------------------------------- /Lectures/week5/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week5/.DS_Store -------------------------------------------------------------------------------- /Lectures/week5/Lecture 5- Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcdm-uiuc/cs199-sp17/5a923bd3e99fc25b8affeceb3eb26be8dd435140/Lectures/week5/Lecture 5- Spark.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CS199: Applied Cloud Computing 2 | 3 | Professor: Dr. Robert J. Brunner 4 | 5 | Course Staffs: 6 | 7 | - Benjamin Congdon, [@bcongdon](https://github.com/bcongdon) 8 | 9 | - Quinn Jarrell, [@TheRushingWookie](https://github.com/TheRushingWookie) 10 | 11 | - Tyler Kim, [@tyler-thetyrant](https://github.com/tyler-thetyrant) 12 | 13 | - Sameet Sapra, [@sameetandpotatoes](https://github.com/sameetandpotatoes) 14 | 15 | - Bhuvan Venkatesh, [@bhuvan-venkatesh](https://github.com/bhuvan-venkatesh) 16 | 17 | ## Overview 18 | This course will introduce cloud computing with an emphasis on gaining hands-on experience in implementing big data technologies in a cloud computing environment. Students will be expected to work in small groups to develop and implement specific cloud computing solutions and to create technical reports that document these solutions and technologies. 19 | 20 | This course is intended for underclassmen in CS and ECE. 21 | 22 | ## Prerequisites 23 | Grade A- or higher in CS125. Previous experience in python programming is required. 24 | 25 | ## Tentative list of Topics 26 | 1) Understand motivation behind cloud computing 27 | 28 | 2) Building a Hadoop cluster. 29 | 30 | 3) Building a Spark cluster. 31 | 32 | 4) Text Analytics at scale 33 | 34 | 5) Graph Analytics at scale 35 | 36 | 6) Spark Analysis 37 | 38 | 7) NoSQL Data Stores (installation and operation). 39 | 40 | 8) Writing Technical Reports. 41 | 42 | 43 | ## Grading 44 | 45 | | **Grading Item** | **Distribution** | 46 | | --------------------- | -------------- | 47 | | Attendance | 10% | 48 | | Labs | 30% | 49 | | Technical Report | 60% | 50 | 51 | 52 | ## Grading Scale 53 | | Percentage | Letter Grade | 54 | | ---------- | ------------ | 55 | | [98, 100] | A+ | 56 | | [92, 98) | A | 57 | | [90, 92) | A- | 58 | | [88, 90) | B+ | 59 | | [82, 88) | B | 60 | | [80, 82) | B- | 61 | | [78, 80) | C+ | 62 | | [72, 78) | C | 63 | | [70, 72) | C- | 64 | | [68, 70) | D+ | 65 | | [62, 68) | D | 66 | | [60, 62) | D- | 67 | | Below 60 | F | 68 | 69 | 70 | ## Labs and Late Submission 71 | There will be about six to seven labs throughout the course, responsible for 30% of the total grade. 72 | 73 | No late submission will be allowed unless you have received permission from a course staff ahead of time under special circumstances. 74 | 75 | 76 | ## Common Errors 77 | * If you reboot your VM, you WILL NEED TO REMOUNT THE SHARED FOLDER 78 | 79 | `sudo mount -t vboxsf -o rw,uid=1000,gid=1000 NAMEOFYOURSHAREDFOLDERONHOST NAMEOFFOLDERONVMTOSHARETO` 80 | 81 | * SSH 82 | `ssh user@localhost -p 2222` 83 | 84 | 85 | 86 | ## License 87 | This course is licensed under the University of Illinois/NCSA Open Source License. For a full copy of this license take a look at the LICENSE file. 88 | --------------------------------------------------------------------------------