├── Assignment1.md ├── Assignment2.md ├── Assignment3.md ├── Assignment4.md ├── README.md └── screens ├── 0-CreateRepo.png ├── 1-CreateBranchOnGithub.png ├── 1_5-Collaborators.png ├── 2-GitClone.png ├── 3-LocalNewBranch.png ├── 4-AddCommitHW.png ├── 5-PushNewBranchToGithub.png ├── 6-PullRequest.png ├── 6.5-Assignee.png └── 7-FinalOutput.png /Assignment1.md: -------------------------------------------------------------------------------- 1 | 2 | # Getting Started # 3 | 4 | This assignment will step you through the process of running a simple computation over a data set using Map/Reduce via mrjob. The goal 5 | of the assignment is to have you walk through the process of using, github, python, mrjob, and AWS and ensure you are setup with 6 | all the various tools and services. 7 | 8 | ## Recommended Readings ## 9 | 10 | * [Getting started with Amazon AWS video tutorials](http://aws.amazon.com/getting-started/) 11 | * [Introduction to AWS training](https://www.youtube.com/playlist?list=PLhr1KZpdzukcMmx04RbtWuQ0yYOp1vQi4) 12 | * [A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others](http://pages.cs.wisc.edu/~akella/CS838/F12/notes/Cloud_Providers_Comparison.pdf) 13 | * [A Survey on Cloud Provider Security Measures](http://www.cs.ucsb.edu/~koc/ns/projects/12Reports/PucherDimopoulos.pdf) 14 | 15 | ## Tasks ## 16 | Note: Keep track of the time necessary to run the process. For Linux/Mac users, you can use the `time` command to compute this. 17 | 18 | ### Part 1 ### 19 | 20 | 21 | 22 | 1. Follow the instructions for running the program [locally]( https://github.com/commoncrawl/cc-mrjob#running-locally) and measure the completion time. 23 | 24 | ### Part 2 ### 25 | 26 | 1. Follow the process for running the program on on on [Amazon Elastic MapReduce]( https://github.com/commoncrawl/cc-mrjob#running-via-elastic-mapreduce) and measure the completion time. 27 | 2. Download the output from S3. 28 | 29 | # Setup for running on AWS EMR # 30 | 31 | ## AWS Setup ## 32 | You can create users instead of using your root aws credentials 33 | If you do not have a user/group with access to EMR, you'll need to do the following procedure. 34 | 35 | First, you need to setup a user to run EMR: 36 | 37 | 1. Visit http://aws.amazon.com/ and sign up for an account. 38 | 2. Select the "Identity and Access Management" (or IAM) from your console or visit https://console.aws.amazon.com/iam/home 39 | 3. Select "Users" from the list on the left. 40 | 3. Click on the "Create New Users" 41 | 4. Enter a user name for yourself and create the user. 42 | 5. The next screen will give you an option to download the credentials for this user. Do so and store them in a safe place. You will not be able to retrieve them again. 43 | 44 | Second, you need to create a group with the right roles: 45 | 46 | 1. Select "Groups" from the list on the left. 47 | 2. Click on "Create New Group". 48 | 3. Enter a name and click on "Next Step". 49 | 4. Scroll down to "Amazon Elastic MapReduce Full Access" click on "Select". 50 | 5. Once the policy document is displayed, click on "Next Step". 51 | 6. Click on "Create Group" to create the group. 52 | 53 | Third, you need to assign your user to the group: 54 | 55 | 1. Select the check box next to your group. 56 | 2. Click on the "Group Actions" drop-down menu and click on "Add Users to Group". 57 | 3. Select your user by clicking on the check box. 58 | 4. Click on "Add Users". 59 | 60 | ## Configure mrjob with the new user credentials## 61 | 62 | You need to configure mrjob to access your AWS account: 63 | 64 | 1. Edit the mrjob.conf 65 | 2. Locate the `#aws_access_key_id:` and `#aws_secret_access_key:` lines. 66 | 3. Remove the hash (#) and add your AWS key and secret after the colon (:). You should have these from previously creating the user. 67 | 68 | ## Setup an Output Bucket on S3 ## 69 | 70 | You need to create an output bucket on S3 for the results of your computation: 71 | 72 | 1. Go to https://aws.amazon.com/ in your browser. 73 | 2. Click on the 'S3' service link. 74 | 3. Click on the 'Create Bucket' button. 75 | 4. Enter a name and hit create. 76 | 77 | Keep in mind that the bucket name is unique to all of Amazon. If you use some common name, it is likely to clash with other 78 | users. One suggestion is to use a common prefix (e.g. a domain name) for all your bucket names. 79 | 80 | 81 | 82 | 83 | ## What to Turn In ## 84 | 85 | You must turn in a pull request containing the following: 86 | 87 | 1. A copy of the output directory for the tag counter running locally (name the directory 'out'). 88 | 2. A copy of the output from S3 for the tag counter running on AWS (name the directory 'emr-out'). 89 | 3. How long did it take to run the process for each of these? 90 | 4. How many `address` tags are there in the input? 91 | 5. Does the local version and EMR version give the same answer? 92 | 93 | Please submit the answers to 3-5 in a text file called `answers.txt` 94 | 95 | 96 | -------------------------------------------------------------------------------- /Assignment2.md: -------------------------------------------------------------------------------- 1 | # Mining Social Media Data # 2 | 3 | Twitter is commonly used for sentiment analysis. However, the first task is to gather and store the data about the topic of interest before examining the sentiment. The purpose of this assignment is to provide an opportunity for you to work on gathering and storing the twitter data. 4 | 5 | 6 | ## Data Acquisition Task ## 7 | 8 | In this task, you will gather the data for these hashtags: #NBAFinals2015 and #Warriors. 9 | As part of this task, you need to: 10 | 11 | 1. Write an acquisition program to pull the tweets for each hashtag and the tweets that have both of the hashtags simultaneously with in a week. You also need to chunk your data (using your design decisions) and give yourself the ability to re-run the process reliable in case of failures (Resiliency). 12 | 2. Organize the resulting raw data into a set of tweets and store these tweets into S3. 13 | 3. Analyze the acquired tweets by producing a histogram (a graph) of the words. 14 | 15 | 16 | ## What to Turn In ## 17 | 18 | 1. A link to your S3 bucket documented in your README.md file. Make sure to make it publicly accessible. 19 | 20 | 2. Your twitter acquisition code. 21 | 22 | 3. The histogram. 23 | -------------------------------------------------------------------------------- /Assignment3.md: -------------------------------------------------------------------------------- 1 | # Storing, Retrieving, and Analyzing Social Media Data Using MongoDB# 2 | 3 | 4 | 5 | 6 | ## Background Info ## 7 | This assignment is built on top of the previous assignment. 8 | 9 | We'd like to utilize the twitter data (raw formats) that you gathered in assignment 2. 10 | 11 | ##Lexical diversity ## 12 | Lexical diversity is a measurement that provides a quantitative measure for the diversity of an individual's or group's vocabulary. It is calculated by finding the number of unique tokens in the text divided by the total number of tokens in the text. 13 | 14 | ##Tasks ## 15 | 16 | For those of you that didn't store all the *metadata* for tweets associated with the #NBAFinals2015 hashtag and the tweets associated with the #Warriors hashtag in the assginment 2, you may find very few tweets with those hashtags now. In that case, you can use two different *related* hashtags about a more current event to ensure you get a good amount of data and use that corpus for the rest of the assignment. Make sure, you indicate these changes in your readme file as part of your submission if you decide to do so. 17 | 18 | Note: When you start working on the data acquistion parts, make sure you look at part 2.3 which also requires data pulls and plan accordingly. 19 | 20 | ## 1-Storing Tasks ## 21 | 22 | 23 | 1.1- Write a python program to automatically retrieve and store the JSON files (associated with the tweets that include #NBAFinals2015 hashtag and the tweets that include #Warriors hashtag) 24 | returned by the twitter REST api in a MongoDB database called db_restT. 25 | 26 | 1.2- Write a python program to insert the chunked tweets associated with the #NBAFinals2015 hashtag and the tweets associated with the #Warriors hashtag that you have gathered in the assignment 2 and stored on S3 to a MongoDB database called db_tweets. This program should pull the inputs automatically from your S3 buckets holding the chunked tweets and insert them into the db_tweets. 27 | 28 | ## 2-Retrieving and Analyzing Tasks ## 29 | 2.1- Analyze the tweets stored in db_tweets by finding the top 30 retweets as well as their associated usernames (users authored them) and the locations 30 | of users. 31 | 32 | 2.2- Compute the lexical diversity of the texts of the tweets for each of the users in db_restT and store the results back to Mongodb. To compute the lexical diversity of a user, you need to find all the tweets of a particular user (a user's tweets corpus), find the number of unique words in the user's tweets corpus, and divide that number by the total number of words in the user's tweets corpus. 33 | 34 | You need to create a collection 35 | with appropriate structure for storing the results of your analysis. 36 | 37 | 2.3- Write a python program to create a db called db_followers that stores all the followers for all the users that 38 | you find in task 2.1. Then, write a program to find the un-followed friends after a week for the top 10 users( users that have the highest number of followers in task 2.1) 39 | since the time that you extracted the tweets. In other words, you need to look for the people following the top 10 users at time X (the time that you extracted the tweets) and then look at the people following the same top 10 users at a later time Y (one-week after X) to see who stopped following the top 10 users. 40 | 41 | 2.4- .(Bonus task) Write a python program and use NLTK to analyze the top 30 retweets of task 2.1 as positive or negative (sentiment analysis). This is the bonus part of the assignment. 42 | 43 | ##3-Storing and Retrieving Task## 44 | 45 | 3.1- Write a python program to create and store the backups of both db_tweets and db_restT to S3. It also should have a capability of 46 | loading the backups if necessary. 47 | 48 | 49 | ## What to Turn In ## 50 | 51 | 1. A link to your S3 bucket that holds the backups documented in your README.md file. Make sure to make it publicly accessible. 52 | 53 | 2. Your python codes. 54 | 55 | 3. The plot of your lexical diversities in task 2.2 showing the lexical diveristies of the top 30 users and the result of the sentiment analysis in task 2.4 if you complete the bonus part. 56 | -------------------------------------------------------------------------------- /Assignment4.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | In this assignment, you implement map-reduce jobs for various tasks: 6 | 7 | #**Tasks** 8 | 9 | **Data collection:** 10 | 11 | Write an acquisition program that can acquire the tweets between June 6th and July 5th of 2015 for the official FIFA Women World Cup hashtag (“#FIFAWWC”), as well as team code hashtags (e.g. “#USA” and “#GER”) and store them with appropriate structures in *WC2015.csv* on S3. You can find more information about teams [here](http://www.fifa.com/womensworldcup/teams/) and the hashtags [here](https://twitter.com/fifawwc). *WC2015.csv* will be used in the following tasks. 12 | 13 | There is no hard requirement for the amount of tweets that you should gather. However, you should gather reasonable amount of tweets to be able to perform the analysis part. Note that you need to gather the historical data for which you need to design a strategy and use techniques such as web scrapping for the specified time frame (June 6th and July 5th of 2015). 14 | 15 | #**Analysis Programs** 16 | 17 | In this part, you will write map-reduce programs to analyze the tweets stored in *WC2015.csv* 18 | 19 | 1. Write a map-reduce program that counts the number of words with more than 10000 occurrences. 20 | 2. Write a map-reduce program to compute the tweet volume on an hourly basis (i.e., number of tweets per hour) . A sample output is like: 21 | 22 | | Date | Hour | Number of tweets | 23 | | :------- | ----: | :---: | 24 | | 8/15 | 00| 1763 | 25 | | 8/15 | 01 | 5432 | 26 | | 8/15 | 02 | 3279 | 27 | 28 | 3 -Write a map-reduce program to compute the top 20 URLs tweeted by the users 29 | 30 | 4 -Write a map-reduce program that for each word in the tweets' text, computes the occurrences of other words appear in every tweet that it appeared. 31 | For example, assume that we have the following text as a tweet text. 32 | 33 | > I have seen USA match today. I am so happy that they won :) 34 | 35 | *With the word I,* 36 | 37 | • have occurs once. 38 | 39 | • seen occurs once. 40 | 41 | • and so on. 42 | 43 | *With the word have,* 44 | 45 | • I occurs twice. 46 | 47 | Here are an example of sample outputs: 48 | 49 | – USA occurs 10000 times with Canada 50 | 51 | – Germany occurs 500 with England 52 | 53 | 54 | 55 | 56 | 5- (bonus) Modifying the program in 4, write a map-reduce program to compute [pointwise mutual information](http://en.wikipedia.org/wiki/Pointwise_mutual_information), which is a function of two events x and y: 57 | 58 | ![](https://cloud.githubusercontent.com/assets/10637404/9157468/b026a744-3ecc-11e5-8683-2199edc62bf5.jpg) 59 | 60 | The larger the PMI for x and y is, the more information can be gathered about the probability of seeing y having just seen x. Your program should compute the PMI of words that appear together more than 50 times or more among entities in *WC2015.csv*. To be more specific, you need to find pairs of words that co-occur in 50 or more tweets. 61 | 62 | 63 | 64 | **Questions:** 65 | 66 | Using/modifying the above programs, answer the following questions: 67 | 68 | 1. What is the average length of tweets (in number of characters) in *WC2015.csv*? 69 | 2. Draw a table with all the team support hashtags you can find and the number of support messages. What country did get apparently the most support? 70 | 3. How many times the word *USA* occur with the word *Japan*? 71 | 4. How many times the word *champion* occur with the word *USA*? 72 | 73 | 74 | > **Note:** 75 | 76 | > - Convert every string into lower-case. 77 | > - Only consider english tweets 78 | > - For simplicity, get rid of all punctuation, i.e., any character other than a to z and space. 79 | > - The output of mapper and reducer is part of your design decisions. However, you should be able to answer the above questions. 80 | > - Use EMR for running your map-reduce tasks and include the configuration of your cluster in the architecture design document. 81 | 82 | 83 | 84 | #**Twitter Archive Search** 85 | 86 | You will use the *whoosh* API to index the tweets in the dataset to answer some standard queries as described bellow. 87 | 88 | ###**Task** 89 | 90 | - Write a python program that uses whoosh to index the archive (*WC2015.csv*) based on various fields. The fields are part of your design decisions. 91 | - Write a python program that takes queries (you need to design the supported queries) and search through the indexed archive using *whoosh.* A sample query to the program can be: *RT:yes, keywords* returns all the retweets that are related to the keywords. Your program should handle at least 4 queries ( of your choice) similar to the sample query. 92 | 93 | 94 | 95 | #**Deliverables** 96 | 97 | 1. A link to your collected tweets and the index directory created by whoosh on S3 98 | 2. Your source codes. Make sure you follow the assignment submission guidelines 99 | 2. You should answer to each of the questions in the architecture design file. You also need to explain how you used map-reduce to obtain the data you needed in each case as well as how the overall index/search structure is designed and describe the supported keyword search queries. 100 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # You need to use Github to turn in your assignment # 2 | 3 | ## Prerequisites ## 4 | 5 | You need to create a Github account and for each assignment, you need to create a repository. 6 | Note: your repositories will be public unless you [request a private one](https://education.github.com/discount_requests/new) 7 | 8 | 9 | ## The Process ## 10 | 0. Create a new repository. Make it private if you can... github will give out a free Micro account that can create 5 private repos with new student accounts. 11 | ![Create Repo](screens/0-CreateRepo.png?raw=true "Create Repo") 12 | 13 | 1. Add MIDS-W205 as a collaborator 14 | ![Add Collaborator](screens/1_5-Collaborators.png?raw=true "Add Collaborator") 15 | 16 | 2. [Create a branch](https://help.github.com/articles/creating-and-deleting-branches-within-your-repository/) of your repository for the homework. 17 | ![Create a Branch](screens/1-CreateBranchOnGithub.png?raw=true "Create a branch") 18 | 19 | 3. Clone the repository to a local directory 20 | ![Git Clone](screens/2-GitClone.png?raw=true "Git Clone") 21 | 22 | 4. Change to the local branch of the same branch name as what was created in Github earlier 23 | ![Git Local Branch](screens/3-LocalNewBranch.png?raw=true "Git Local Branch") 24 | 25 | 5. Create/Add any homework files to the local branch/directory 26 | 27 | 6. Commit your created/added files to the local branch 28 | ![Commit to Local Branch](screens/4-AddCommitHW.png?raw=true "Commit to Local Branch") 29 | 30 | 7. Push your local branch back up to Github 31 | ![Push to Github](screens/5-PushNewBranchToGithub.png?raw=true "Push to Github") 32 | 33 | 8. [Create a pull request](https://help.github.com/articles/creating-a-pull-request/) from your created branch to master for the code you'd like to turn in by clicking the Compare & pull request button 34 | ![Pull Request](screens/6-PullRequest.png?raw=true "Pull Request") 35 | 36 | 9. On the right hand sign of the "Open a Pull Request" screen, select the Assignee gear, and assign MIDS-W205 to the issue. 37 | ![Assignee](screens/6.5-Assignee.png?raw=true "Assignee") 38 | 39 | 10. Your instructor can now view the pull request and grade the assignment. **Please do not merge the pull request.** 40 | ![Pull Request](screens/7-FinalOutput.png?raw=true "Pull Request") 41 | 42 | 11. Once your instructor has graded the assignment, your pull request is merged as a final notification. 43 | 12. You can now delete the branch as the changes have been merged with the master. 44 | -------------------------------------------------------------------------------- /screens/0-CreateRepo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/0-CreateRepo.png -------------------------------------------------------------------------------- /screens/1-CreateBranchOnGithub.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/1-CreateBranchOnGithub.png -------------------------------------------------------------------------------- /screens/1_5-Collaborators.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/1_5-Collaborators.png -------------------------------------------------------------------------------- /screens/2-GitClone.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/2-GitClone.png -------------------------------------------------------------------------------- /screens/3-LocalNewBranch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/3-LocalNewBranch.png -------------------------------------------------------------------------------- /screens/4-AddCommitHW.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/4-AddCommitHW.png -------------------------------------------------------------------------------- /screens/5-PushNewBranchToGithub.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/5-PushNewBranchToGithub.png -------------------------------------------------------------------------------- /screens/6-PullRequest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/6-PullRequest.png -------------------------------------------------------------------------------- /screens/6.5-Assignee.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/6.5-Assignee.png -------------------------------------------------------------------------------- /screens/7-FinalOutput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/neuralinfo/Assignments/10965ecdea5aded07e0b90aa4d142e5248a13dfa/screens/7-FinalOutput.png --------------------------------------------------------------------------------