├── .gitignore ├── README.md ├── a0 ├── Log.txt ├── README.md ├── Setup.md ├── ShortAnswer.txt ├── a0.py ├── candidates.txt └── network.png ├── a1 ├── .gitignore ├── Log.txt ├── README.md └── a1.py ├── a2 ├── .gitignore ├── Log.txt ├── README.md ├── ShortAnswer.txt ├── a2.py └── accuracies.png ├── bonus ├── README.md └── bonus.py ├── project └── README.md └── update.sh /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | *.py[cod] 3 | 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **under construction** 2 | 3 | Each student has their own private GitHub repository at: 4 | 5 | 6 | This is where you will submit all assignments. 7 | 8 | Your repository should already contain starter code for each assignment. This starter code has been pulled from the assignment repository at . 9 | 10 | Throughout the course, I may update the assignments to clarify questions or add content. To ensure you have the latest content, you can run the `update.sh`, which will fetch and merge the content from the assignments repository. 11 | 12 | For each assignment, then, you should do the following: 13 | 14 | 1. Run `./update.sh` to get the latest starter code. 15 | 16 | 2. Do the homework, adding and modifying files in the assignment directory. **Commit often!** 17 | 18 | 3. Before the deadline, push all of your changes to GitHub. E.g.: 19 | ``` 20 | cd a0 21 | git add * 22 | git commit -m 'homework completed' 23 | git push 24 | ``` 25 | 26 | 4. Double-check that you don't have any outstanding changes to commit: 27 | ``` 28 | git status 29 | # On branch master 30 | nothing to commit, working directory clean 31 | ``` 32 | 33 | 5. Double-check that everything works, by cloning your repository into a new directory and executing all tests. 34 | ``` 35 | cd 36 | mkdir tmp 37 | cd tmp 38 | git clone https://github.com/iit-cs579/[your_iit_id] 39 | cd [your_iit_id]/a0 40 | [...run any relevant scripts/tests] 41 | ``` 42 | 43 | 6. You can also view your code on Github with a web browser to make sure all your code has been submitted. 44 | 45 | 7. Assignments contain [doctests](https://docs.python.org/3/library/doctest.html). You can run these for a file `foo.py` using `python -m doctest foo.py`. If all tests pass, you'll see no output. To see output even for passing tests, add a `-v` flag to the command. 46 | 47 | 8. Typically, each assignment contains a number of methods for you to complete. I recommend tackling these one at a time, debugging and testing, and then moving onto the next method. Implementing everything and then running at the end will likely result in many errors that can be difficult to track down. In order to run the doctests from a single function, you can use [nose](https://github.com/nose-devs/nose). E.g., to run only the doctests for the `get_twitter` function in `a0.py`, you would call: 48 | - `nosetests --with-doctest a0.py:get_twitter` 49 | 50 | 9. For some assignments, I also include a `Log.txt` file which contains the expected output when running the assignment's main method (e.g., `python a0.py`). You should look to make sure your output matches. Occasionally, some deviations are expected, particularly if sets are used, which are unordered. 51 | 52 | 10. Feel free to open issues at to ask for clarifications, discuss problems, etc. 53 | -------------------------------------------------------------------------------- /a0/Log.txt: -------------------------------------------------------------------------------- 1 | Established Twitter connection. 2 | Read screen names: ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump'] 3 | found 4 users with screen_names ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump'] 4 | Friends per candidate: 5 | BernieSanders 1390 6 | JoeBiden 22 7 | SenWarren 493 8 | realDonaldTrump 47 9 | Most common friends: 10 | [(818910970567344128, 3), (822215673812119553, 3), (15764644, 2), (15808765, 2), (24195214, 2)] 11 | Friend Overlap: 12 | [('BernieSanders', 'SenWarren', 14), ('BernieSanders', 'JoeBiden', 3), ('JoeBiden', 'SenWarren', 3), ('JoeBiden', 'realDonaldTrump', 2), ('BernieSanders', 'realDonaldTrump', 1), ('SenWarren', 'realDonaldTrump', 1)] 13 | User followed by Bernie and Donald: 14 | graph has 24 nodes and 42 edges 15 | network drawn to network.png 16 | -------------------------------------------------------------------------------- /a0/README.md: -------------------------------------------------------------------------------- 1 | ## Assignment 0 2 | 3 | **50 points** 4 | 5 | 6 | 1. Get started with git and python by following the instructions at [Setup.md](Setup.md). 7 | 8 | 2. Complete the data collection assignment, following the instructions in [a0.py](a0.py). 9 | 10 | 3. Complete the short answer questions in [ShortAnswer.txt](ShortAnswer.txt). 11 | 12 | 3. Push your all your code and supporting files (e.g., .png) to your **private** GitHub repo in the folder `a0/`. 13 | -------------------------------------------------------------------------------- /a0/Setup.md: -------------------------------------------------------------------------------- 1 | # Setup 2 | 3 | 1. Learn Python by completing this online tutorial: (3 hours) 4 | 2. Create a GitHub account at 5 | 3. Setup git by following (30 minutes) 6 | 4. Learn git by completing the [Introduction to GitHub](https://lab.github.com/githubtraining/introduction-to-github) tutorial, reading the [git handbook](https://guides.github.com/introduction/git-handbook/), then completing the [Managing merge conflicts](https://lab.github.com/githubtraining/managing-merge-conflicts) tutorial (1 hour). 7 | 5. Install the Python data science stack from . **We will use Python 3** (30 minutes) 8 | 6. Complete the scikit-learn tutorial from (2 hours) 9 | 7. Understand how python packages work by going through the [Python Packaging User Guide](https://packaging.python.org/tutorials/) (you can skip the "Creating Documentation" section). (1 hour) 10 | 8. After I have created all the project repositories you can then clone your private class repository 11 | ``` 12 | git clone https://github.com/iit-cs579/[github-username].git 13 | ``` 14 | E.g., for me this would be: 15 | ``` 16 | git clone https://github.com/iit-cs579/aronwc.git 17 | ``` 18 | - You should have read/write (pull/push) access to your private repository. 19 | - This is where you will submit assignments. 20 | - **Note:** This step will not work until I have setup your private repository. This usually happens by the second week of the semester (and this is why I need you to complete the course survey). 21 | 22 | See for instructions on submitting assignments. 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /a0/ShortAnswer.txt: -------------------------------------------------------------------------------- 1 | Enter your responses inline below and push this file to your private GitHub 2 | repository. 3 | 4 | 5 | 1. Assume I plan to use the friend_overlap function above to quantify the 6 | similarity of two users. E.g., because 14 is larger than 1, I conclude that 7 | Bernie Sanders and Elizabeth Warrent are more similar than Bernie Sanders and Donald 8 | Trump. 9 | 10 | How is this approach misleading? How might you fix it? 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 2. Looking at the output of your followed_by_bernie_and_donald function, why 22 | do you think this user is followed by both Bernie Sanders and Donald Trump, 23 | who are rivals? Do some web searches to see if you can find out more 24 | information. 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 3. There is a big difference in how many accounts each candidate follows (Bernie Sanders follows over 1.3K accounts, while Donald Trump follows less than 39 | 50). Why do you think this is? How might that affect our analysis? 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 4. The follower graph we've collected is incomplete. To expand it, we would 50 | have to also collect the list of accounts followed by each of the 51 | friends. That is, for each user X that Donald Trump follows, we would have to 52 | also collect all the users that X follows. Assuming we again use the API call 53 | https://dev.twitter.com/rest/reference/get/friends/ids, how many requests will 54 | we have to make? Given how Twitter does rate limiting 55 | (https://dev.twitter.com/rest/public/rate-limiting), approximately how many 56 | minutes will it take to collect this data? 57 | -------------------------------------------------------------------------------- /a0/a0.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | """ 4 | CS579: Assignment 0 5 | Collecting a political social network 6 | 7 | In this assignment, I've given you a list of Twitter accounts of 4 8 | U.S. presedential candidates from the previous election. 9 | 10 | The goal is to use the Twitter API to construct a social network of these 11 | accounts. We will then use the [networkx](http://networkx.github.io/) library 12 | to plot these links, as well as print some statistics of the resulting graph. 13 | 14 | 1. Create an account on [twitter.com](http://twitter.com). 15 | 2. Generate authentication tokens by following the instructions [here](https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html). 16 | 3. Add your tokens to the key/token variables below. (API Key == Consumer Key) 17 | 4. Be sure you've installed the Python modules 18 | [networkx](http://networkx.github.io/) and 19 | [TwitterAPI](https://github.com/geduldig/TwitterAPI). Assuming you've already 20 | installed [pip](http://pip.readthedocs.org/en/latest/installing.html), you can 21 | do this with `pip install networkx TwitterAPI`. 22 | 23 | OK, now you're ready to start collecting some data! 24 | 25 | I've provided a partial implementation below. Your job is to complete the 26 | code where indicated. You need to modify the 10 methods indicated by 27 | #TODO. 28 | 29 | Your output should match the sample provided in Log.txt. 30 | """ 31 | 32 | # Imports you'll need. 33 | from collections import Counter 34 | import matplotlib.pyplot as plt 35 | import networkx as nx 36 | import sys 37 | import time 38 | from TwitterAPI import TwitterAPI 39 | 40 | consumer_key = 'fixme' 41 | consumer_secret = 'fixme' 42 | access_token = 'fixme' 43 | access_token_secret = 'fixme' 44 | 45 | 46 | # This method is done for you. 47 | def get_twitter(): 48 | """ Construct an instance of TwitterAPI using the tokens you entered above. 49 | Returns: 50 | An instance of TwitterAPI. 51 | """ 52 | return TwitterAPI(consumer_key, consumer_secret, access_token, access_token_secret) 53 | 54 | 55 | def read_screen_names(filename): 56 | """ 57 | Read a text file containing Twitter screen_names, one per line. 58 | 59 | Params: 60 | filename....Name of the file to read. 61 | Returns: 62 | A list of strings, one per screen_name, sorted in ascending 63 | alphabetical order. 64 | 65 | Here's a doctest to confirm your implementation is correct. 66 | >>> read_screen_names('candidates.txt') 67 | ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump'] 68 | """ 69 | ###TODO 70 | pass 71 | 72 | 73 | # I've provided the method below to handle Twitter's rate limiting. 74 | # You should call this method whenever you need to access the Twitter API. 75 | def robust_request(twitter, resource, params, max_tries=5): 76 | """ If a Twitter request fails, sleep for 15 minutes. 77 | Do this at most max_tries times before quitting. 78 | Args: 79 | twitter .... A TwitterAPI object. 80 | resource ... A resource string to request; e.g., "friends/ids" 81 | params ..... A parameter dict for the request, e.g., to specify 82 | parameters like screen_name or count. 83 | max_tries .. The maximum number of tries to attempt. 84 | Returns: 85 | A TwitterResponse object, or None if failed. 86 | """ 87 | for i in range(max_tries): 88 | request = twitter.request(resource, params) 89 | if request.status_code == 200: 90 | return request 91 | else: 92 | print('Got error %s \nsleeping for 15 minutes.' % request.text) 93 | sys.stderr.flush() 94 | time.sleep(61 * 15) 95 | 96 | 97 | def get_users(twitter, screen_names): 98 | """Retrieve the Twitter user objects for each screen_name. 99 | Params: 100 | twitter........The TwitterAPI object. 101 | screen_names...A list of strings, one per screen_name 102 | Returns: 103 | A list of dicts, one per user, containing all the user information 104 | (e.g., screen_name, id, location, etc) 105 | 106 | See the API documentation here: https://dev.twitter.com/rest/reference/get/users/lookup 107 | 108 | In this example, I test retrieving two users: twitterapi and twitter. 109 | 110 | >>> twitter = get_twitter() 111 | >>> users = get_users(twitter, ['twitterapi', 'twitter']) 112 | >>> [u['id'] for u in users] 113 | [6253282, 783214] 114 | """ 115 | ###TODO 116 | pass 117 | 118 | 119 | def get_friends(twitter, screen_name): 120 | """ Return a list of Twitter IDs for users that this person follows, up to 5000. 121 | See https://dev.twitter.com/rest/reference/get/friends/ids 122 | 123 | Note, because of rate limits, it's best to test this method for one candidate before trying 124 | on all candidates. 125 | 126 | Args: 127 | twitter.......The TwitterAPI object 128 | screen_name... a string of a Twitter screen name 129 | Returns: 130 | A list of ints, one per friend ID, sorted in ascending order. 131 | 132 | Note: If a user follows more than 5000 accounts, we will limit ourselves to 133 | the first 5000 accounts returned. 134 | 135 | In this test case, I return the first 5 accounts that I follow. 136 | >>> twitter = get_twitter() 137 | >>> get_friends(twitter, 'aronwc')[:5] 138 | [695023, 1697081, 8381682, 10204352, 11669522] 139 | """ 140 | ###TODO 141 | pass 142 | 143 | 144 | def add_all_friends(twitter, users): 145 | """ Get the list of accounts each user follows. 146 | I.e., call the get_friends method for all 4 candidates. 147 | 148 | Store the result in each user's dict using a new key called 'friends'. 149 | 150 | Args: 151 | twitter...The TwitterAPI object. 152 | users.....The list of user dicts. 153 | Returns: 154 | Nothing 155 | 156 | >>> twitter = get_twitter() 157 | >>> users = [{'screen_name': 'aronwc'}] 158 | >>> add_all_friends(twitter, users) 159 | >>> users[0]['friends'][:5] 160 | [695023, 1697081, 8381682, 10204352, 11669522] 161 | """ 162 | ###TODO 163 | pass 164 | 165 | 166 | def print_num_friends(users): 167 | """Print the number of friends per candidate, sorted by candidate name. 168 | See Log.txt for an example. 169 | Args: 170 | users....The list of user dicts. 171 | Returns: 172 | Nothing 173 | """ 174 | ###TODO 175 | pass 176 | 177 | 178 | def count_friends(users): 179 | """ Count how often each friend is followed. 180 | Args: 181 | users: a list of user dicts 182 | Returns: 183 | a Counter object mapping each friend to the number of candidates who follow them. 184 | Counter documentation: https://docs.python.org/dev/library/collections.html#collections.Counter 185 | 186 | In this example, friend '2' is followed by three different users. 187 | >>> c = count_friends([{'friends': [1,2]}, {'friends': [2,3]}, {'friends': [2,3]}]) 188 | >>> c.most_common() 189 | [(2, 3), (3, 2), (1, 1)] 190 | """ 191 | ###TODO 192 | pass 193 | 194 | 195 | def friend_overlap(users): 196 | """ 197 | Compute the number of shared accounts followed by each pair of users. 198 | 199 | Args: 200 | users...The list of user dicts. 201 | 202 | Return: A list of tuples containing (user1, user2, N), where N is the 203 | number of accounts that both user1 and user2 follow. This list should 204 | be sorted in descending order of N. Ties are broken first by user1's 205 | screen_name, then by user2's screen_name (sorted in ascending 206 | alphabetical order). See Python's builtin sorted method. 207 | 208 | In this example, users 'a' and 'c' follow the same 3 accounts: 209 | >>> friend_overlap([ 210 | ... {'screen_name': 'a', 'friends': ['1', '2', '3']}, 211 | ... {'screen_name': 'b', 'friends': ['2', '3', '4']}, 212 | ... {'screen_name': 'c', 'friends': ['1', '2', '3']}, 213 | ... ]) 214 | [('a', 'c', 3), ('a', 'b', 2), ('b', 'c', 2)] 215 | """ 216 | ###TODO 217 | pass 218 | 219 | 220 | def followed_by_bernie_and_donald(users, twitter): 221 | """ 222 | Find and return the screen_names of the Twitter users followed by both Bernie 223 | Sanders and Donald Trump. You will need to use the TwitterAPI to convert 224 | the Twitter ID to a screen_name. See: 225 | https://dev.twitter.com/rest/reference/get/users/lookup 226 | 227 | Params: 228 | users.....The list of user dicts 229 | twitter...The Twitter API object 230 | Returns: 231 | A list of strings containing the Twitter screen_names of the users 232 | that are followed by both Bernie Sanders and Donald Trump. 233 | """ 234 | ###TODO 235 | pass 236 | 237 | 238 | def create_graph(users, friend_counts): 239 | """ Create a networkx undirected Graph, adding each candidate and friend 240 | as a node. Note: while all candidates should be added to the graph, 241 | only add friends to the graph if they are followed by more than one 242 | candidate. (This is to reduce clutter.) 243 | 244 | Each candidate in the Graph will be represented by their screen_name, 245 | while each friend will be represented by their user id. 246 | 247 | Args: 248 | users...........The list of user dicts. 249 | friend_counts...The Counter dict mapping each friend to the number of candidates that follow them. 250 | Returns: 251 | A networkx Graph 252 | """ 253 | ###TODO 254 | pass 255 | 256 | 257 | def draw_network(graph, users, filename): 258 | """ 259 | Draw the network to a file. Only label the candidate nodes; the friend 260 | nodes should have no labels (to reduce clutter). 261 | 262 | Methods you'll need include networkx.draw_networkx, plt.figure, and plt.savefig. 263 | 264 | Your figure does not have to look exactly the same as mine, but try to 265 | make it look presentable. 266 | """ 267 | ###TODO 268 | pass 269 | 270 | 271 | def main(): 272 | """ Main method. You should not modify this. """ 273 | twitter = get_twitter() 274 | screen_names = read_screen_names('candidates.txt') 275 | print('Established Twitter connection.') 276 | print('Read screen names: %s' % screen_names) 277 | users = sorted(get_users(twitter, screen_names), key=lambda x: x['screen_name']) 278 | print('found %d users with screen_names %s' % 279 | (len(users), str([u['screen_name'] for u in users]))) 280 | add_all_friends(twitter, users) 281 | print('Friends per candidate:') 282 | print_num_friends(users) 283 | friend_counts = count_friends(users) 284 | print('Most common friends:\n%s' % str(friend_counts.most_common(5))) 285 | print('Friend Overlap:\n%s' % str(friend_overlap(users))) 286 | print('User followed by Bernie and Donald: %s' % str(followed_by_bernie_and_donald(users, twitter))) 287 | 288 | graph = create_graph(users, friend_counts) 289 | print('graph has %s nodes and %s edges' % (len(graph.nodes()), len(graph.edges()))) 290 | draw_network(graph, users, 'network.png') 291 | print('network drawn to network.png') 292 | 293 | 294 | if __name__ == '__main__': 295 | main() 296 | 297 | # That's it for now! This should give you an introduction to some of the data we'll study in this course. 298 | -------------------------------------------------------------------------------- /a0/candidates.txt: -------------------------------------------------------------------------------- 1 | realDonaldTrump 2 | SenWarren 3 | BernieSanders 4 | JoeBiden 5 | -------------------------------------------------------------------------------- /a0/network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iit-cs579/assignments/07b7e41763890df74d72bf9bbb30dbb1fd670ea8/a0/network.png -------------------------------------------------------------------------------- /a1/.gitignore: -------------------------------------------------------------------------------- 1 | edges.txt.gz 2 | -------------------------------------------------------------------------------- /a1/Log.txt: -------------------------------------------------------------------------------- 1 | full graph has 5062 nodes and 6060 edges 2 | subgraph has 712 nodes and 1710 edges 3 | 4 | 5 | computing norm_cut scores by max_depth... 6 | max_depth norm_cut_score 7 | 1 1.007 8 | 2 1.001 9 | 3 0.122 10 | 4 0.122 11 | 12 | 13 | getting result with max_depth=3 14 | 2 clusters 15 | first partition: cluster 1 has 701 nodes and cluster 2 has 11 nodes 16 | smaller cluster nodes: 17 | ['Arthur A. Levine Books', 'Clifford The Big Red Dog', 'READ 180', 'Scholastic', 'Scholastic Book Fairs', 'Scholastic Canada', 'Scholastic Parents', 'Scholastic Reading Club', 'Scholastic Teachers', 'The Hunger Games', 'WordGirl'] 18 | 19 | 20 | partitioning by eigenvector... 21 | cluster 1 has 86 nodes and cluster 2 has 626 nodes 22 | norm_cut score=0.389 23 | 10 nodes from smaller cluster: 24 | ['Aeon Magazine', 'American Museum of Natural History', 'Astronomy Picture of the Day (APOD)', 'Big History Project', 'Bradshaw Foundation', 'California Charter Schools Association', 'California Council for the Social Studies', 'California Geographic Alliance', 'Center for Civic Education', 'CityClub Seattle'] 25 | -------------------------------------------------------------------------------- /a1/README.md: -------------------------------------------------------------------------------- 1 | Assignment 1 2 | 3 | See `a1.py`. 4 | -------------------------------------------------------------------------------- /a1/a1.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | # # CS579: Assignment 1 4 | # 5 | # In this assignment, we'll implement community detection algorithms using Facebook "like" data. 6 | # 7 | # The file `edges.txt.gz` indicates like relationships between facebook users. This was collected using snowball sampling: beginning with the user "Bill Gates", I crawled all the people he "likes", then, for each newly discovered user, I crawled all the people they liked. 8 | # 9 | # We'll cluster the resulting graph into communities. 10 | # 11 | # Complete the methods below that are indicated by `TODO`. I've provided some sample output to help guide your implementation. 12 | 13 | 14 | # You should not use any imports not listed here: 15 | from collections import Counter, defaultdict, deque 16 | import copy 17 | from itertools import combinations 18 | import math 19 | import networkx as nx 20 | from numpy.linalg import eigh 21 | import numpy as np 22 | import urllib.request 23 | 24 | 25 | ## Community Detection 26 | 27 | def example_graph(): 28 | """ 29 | Create the example graph from class. Used for testing. 30 | Do not modify. 31 | """ 32 | g = nx.Graph() 33 | g.add_edges_from([('A', 'B'), ('A', 'C'), ('B', 'C'), ('B', 'D'), ('D', 'E'), ('D', 'F'), ('D', 'G'), ('E', 'F'), ('G', 'F')]) 34 | return g 35 | 36 | def bfs(graph, root, max_depth): 37 | """ 38 | Perform breadth-first search to compute the shortest paths from a root node to all 39 | other nodes in the graph. To reduce running time, the max_depth parameter ends 40 | the search after the specified depth. 41 | E.g., if max_depth=2, only paths of length 2 or less will be considered. 42 | This means that nodes greather than max_depth distance from the root will not 43 | appear in the result. 44 | 45 | You may use these two classes to help with this implementation: 46 | https://docs.python.org/3.5/library/collections.html#collections.defaultdict 47 | https://docs.python.org/3.5/library/collections.html#collections.deque 48 | 49 | Params: 50 | graph.......A networkx Graph 51 | root........The root node in the search graph (a string). We are computing 52 | shortest paths from this node to all others. 53 | max_depth...An integer representing the maximum depth to search. 54 | 55 | Returns: 56 | node2distances...dict from each node to the length of the shortest path from 57 | the root node 58 | node2num_paths...dict from each node to the number of shortest paths from the 59 | root node to this node. 60 | node2parents.....dict from each node to the list of its parents in the search 61 | tree 62 | 63 | In the doctests below, we first try with max_depth=5, then max_depth=2. 64 | 65 | >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 5) 66 | >>> sorted(node2distances.items()) 67 | [('A', 3), ('B', 2), ('C', 3), ('D', 1), ('E', 0), ('F', 1), ('G', 2)] 68 | >>> sorted(node2num_paths.items()) 69 | [('A', 1), ('B', 1), ('C', 1), ('D', 1), ('E', 1), ('F', 1), ('G', 2)] 70 | >>> sorted((node, sorted(parents)) for node, parents in node2parents.items()) 71 | [('A', ['B']), ('B', ['D']), ('C', ['B']), ('D', ['E']), ('F', ['E']), ('G', ['D', 'F'])] 72 | >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 2) 73 | >>> sorted(node2distances.items()) 74 | [('B', 2), ('D', 1), ('E', 0), ('F', 1), ('G', 2)] 75 | >>> sorted(node2num_paths.items()) 76 | [('B', 1), ('D', 1), ('E', 1), ('F', 1), ('G', 2)] 77 | >>> sorted((node, sorted(parents)) for node, parents in node2parents.items()) 78 | [('B', ['D']), ('D', ['E']), ('F', ['E']), ('G', ['D', 'F'])] 79 | """ 80 | ###TODO 81 | pass 82 | 83 | 84 | def complexity_of_bfs(V, E, K): 85 | """ 86 | If V is the number of vertices in a graph, E is the number of 87 | edges, and K is the max_depth of our approximate breadth-first 88 | search algorithm, then what is the *worst-case* run-time of 89 | this algorithm? As usual in complexity analysis, you can ignore 90 | any constant factors. E.g., if you think the answer is 2V * E + 3log(K), 91 | you would return V * E + math.log(K) 92 | >>> v = complexity_of_bfs(13, 23, 7) 93 | >>> type(v) == int or type(v) == float 94 | True 95 | """ 96 | ###TODO 97 | pass 98 | 99 | 100 | def bottom_up(root, node2distances, node2num_paths, node2parents): 101 | """ 102 | Compute the final step of the Girvan-Newman algorithm. 103 | See p 352 From your text: 104 | https://github.com/iit-cs579/main/blob/master/read/lru-10.pdf 105 | The third and final step is to calculate for each edge e the sum 106 | over all nodes Y of the fraction of shortest paths from the root 107 | X to Y that go through e. This calculation involves computing this 108 | sum for both nodes and edges, from the bottom. Each node other 109 | than the root is given a credit of 1, representing the shortest 110 | path to that node. This credit may be divided among nodes and 111 | edges above, since there could be several different shortest paths 112 | to the node. The rules for the calculation are as follows: ... 113 | 114 | Params: 115 | root.............The root node in the search graph (a string). We are computing 116 | shortest paths from this node to all others. 117 | node2distances...dict from each node to the length of the shortest path from 118 | the root node 119 | node2num_paths...dict from each node to the number of shortest paths from the 120 | root node that pass through this node. 121 | node2parents.....dict from each node to the list of its parents in the search 122 | tree 123 | Returns: 124 | A dict mapping edges to credit value. Each key is a tuple of two strings 125 | representing an edge (e.g., ('A', 'B')). Make sure each of these tuples 126 | are sorted alphabetically (so, it's ('A', 'B'), not ('B', 'A')). 127 | 128 | Any edges excluded from the results in bfs should also be exluded here. 129 | 130 | >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 5) 131 | >>> result = bottom_up('E', node2distances, node2num_paths, node2parents) 132 | >>> sorted(result.items()) 133 | [(('A', 'B'), 1.0), (('B', 'C'), 1.0), (('B', 'D'), 3.0), (('D', 'E'), 4.5), (('D', 'G'), 0.5), (('E', 'F'), 1.5), (('F', 'G'), 0.5)] 134 | """ 135 | ###TODO 136 | pass 137 | 138 | 139 | def approximate_betweenness(graph, max_depth): 140 | """ 141 | Compute the approximate betweenness of each edge, using max_depth to reduce 142 | computation time in breadth-first search. 143 | 144 | You should call the bfs and bottom_up functions defined above for each node 145 | in the graph, and sum together the results. Be sure to divide by 2 at the 146 | end to get the final betweenness. 147 | 148 | Params: 149 | graph.......A networkx Graph 150 | max_depth...An integer representing the maximum depth to search. 151 | 152 | Returns: 153 | A dict mapping edges to betweenness. Each key is a tuple of two strings 154 | representing an edge (e.g., ('A', 'B')). Make sure each of these tuples 155 | are sorted alphabetically (so, it's ('A', 'B'), not ('B', 'A')). 156 | 157 | >>> sorted(approximate_betweenness(example_graph(), 2).items()) 158 | [(('A', 'B'), 2.0), (('A', 'C'), 1.0), (('B', 'C'), 2.0), (('B', 'D'), 6.0), (('D', 'E'), 2.5), (('D', 'F'), 2.0), (('D', 'G'), 2.5), (('E', 'F'), 1.5), (('F', 'G'), 1.5)] 159 | """ 160 | ###TODO 161 | pass 162 | 163 | 164 | def get_components(graph): 165 | """ 166 | A helper function you may use below. 167 | Returns the list of all connected components in the given graph. 168 | """ 169 | return [graph.subgraph(c).copy() for c in nx.connected_components(graph)] 170 | 171 | def partition_girvan_newman(graph, max_depth): 172 | """ 173 | Use your approximate_betweenness implementation to partition a graph. 174 | Unlike in class, here you will not implement this recursively. Instead, 175 | just remove edges until more than one component is created, then return 176 | those components. 177 | That is, compute the approximate betweenness of all edges, and remove 178 | them until multiple components are created. 179 | 180 | You only need to compute the betweenness once. 181 | If there are ties in edge betweenness, break by edge name (e.g., 182 | (('A', 'B'), 1.0) comes before (('B', 'C'), 1.0)). 183 | 184 | Note: the original graph variable should not be modified. Instead, 185 | make a copy of the original graph prior to removing edges. 186 | See the Graph.copy method https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.copy.html 187 | Params: 188 | graph.......A networkx Graph 189 | max_depth...An integer representing the maximum depth to search. 190 | 191 | Returns: 192 | A list of networkx Graph objects, one per partition. 193 | 194 | >>> components = partition_girvan_newman(example_graph(), 5) 195 | >>> components = sorted(components, key=lambda x: sorted(x.nodes())[0]) 196 | >>> sorted(components[0].nodes()) 197 | ['A', 'B', 'C'] 198 | >>> sorted(components[1].nodes()) 199 | ['D', 'E', 'F', 'G'] 200 | """ 201 | ###TODO 202 | pass 203 | 204 | def get_subgraph(graph, min_degree): 205 | """Return a subgraph containing nodes whose degree is 206 | greater than or equal to min_degree. 207 | We'll use this in the main method to prune the original graph. 208 | 209 | Params: 210 | graph........a networkx graph 211 | min_degree...degree threshold 212 | Returns: 213 | a networkx graph, filtered as defined above. 214 | 215 | >>> subgraph = get_subgraph(example_graph(), 3) 216 | >>> sorted(subgraph.nodes()) 217 | ['B', 'D', 'F'] 218 | >>> len(subgraph.edges()) 219 | 2 220 | """ 221 | ###TODO 222 | pass 223 | 224 | 225 | """" 226 | Compute the normalized cut for each discovered cluster. 227 | I've broken this down into the three next methods. 228 | """ 229 | 230 | def volume(nodes, graph): 231 | """ 232 | Compute the volume for a list of nodes, which 233 | is the number of edges in `graph` with at least one end in 234 | nodes. 235 | Params: 236 | nodes...a list of strings for the nodes to compute the volume of. 237 | graph...a networkx graph 238 | 239 | >>> volume(['A', 'B', 'C'], example_graph()) 240 | 4 241 | """ 242 | ###TODO 243 | pass 244 | 245 | 246 | def cut(S, T, graph): 247 | """ 248 | Compute the cut-set of the cut (S,T), which is 249 | the set of edges that have one endpoint in S and 250 | the other in T. 251 | Params: 252 | S.......set of nodes in first subset 253 | T.......set of nodes in second subset 254 | graph...networkx graph 255 | Returns: 256 | An int representing the cut-set. 257 | 258 | >>> cut(['A', 'B', 'C'], ['D', 'E', 'F', 'G'], example_graph()) 259 | 1 260 | """ 261 | ###TODO 262 | pass 263 | 264 | 265 | def norm_cut(S, T, graph): 266 | """ 267 | The normalized cut value for the cut S/T. (See lec06.) 268 | Params: 269 | S.......set of nodes in first subset 270 | T.......set of nodes in second subset 271 | graph...networkx graph 272 | Returns: 273 | An float representing the normalized cut value 274 | 275 | """ 276 | ###TODO 277 | pass 278 | 279 | def score_max_depths(graph, max_depths): 280 | """ 281 | In order to assess the quality of the approximate partitioning method 282 | we've developed, we will run it with different values for max_depth 283 | and see how it affects the norm_cut score of the resulting partitions. 284 | Recall that smaller norm_cut scores correspond to better partitions. 285 | 286 | Params: 287 | graph........a networkx Graph 288 | max_depths...a list of ints for the max_depth values to be passed 289 | to calls to partition_girvan_newman 290 | 291 | Returns: 292 | A list of (int, float) tuples representing the max_depth and the 293 | norm_cut value obtained by the partitions returned by 294 | partition_girvan_newman. See Log.txt for an example. 295 | """ 296 | ###TODO 297 | pass 298 | 299 | """ 300 | Next, use eigenvalue decomposition to partition a graph. 301 | """ 302 | 303 | def get_second_eigenvector(graph): 304 | """ 305 | 1. Create the Laplacian matrix. 306 | 2. Obtain its eigenvector matrix using the eigh function. 307 | 3. Return the second column eigenvector 308 | 309 | Returns: 310 | a 1d numpy array containing the second eigenvector 311 | 312 | >>> np.round(get_second_eigenvector(example_graph()), 2) 313 | array([ 0.49, 0.3 , 0.49, -0.21, -0.36, -0.36, -0.36]) 314 | """ 315 | ###TODO 316 | pass 317 | 318 | def partition_by_eigenvector(graph): 319 | """ 320 | Using the get_second_eigenvector function above, partition the graph into 321 | two components using a splitting threshold of 0. That is, nodes 322 | whose corresponding value in the second eigenvector is >= 0 are in one cluster, 323 | and the rest are in the other cluster. 324 | 325 | Returns: 326 | A list of two networkx Graph objects, one per partition. 327 | Sort these in ascending order of partition size. 328 | 329 | >>> graph = example_graph() 330 | >>> result = partition_by_eigenvector(graph) 331 | >>> sorted(result[0].nodes()) 332 | ['A', 'B', 'C'] 333 | >>> sorted(result[1].nodes()) 334 | ['D', 'E', 'F', 'G'] 335 | >>> round(norm_cut(result[0].nodes(), result[1].nodes(), graph), 2) 336 | 0.42 337 | """ 338 | ###TODO 339 | pass 340 | 341 | """ 342 | Next, we'll download a real dataset to see how our algorithm performs. 343 | """ 344 | def download_data(): 345 | """ 346 | Download the data. Done for you. 347 | """ 348 | urllib.request.urlretrieve('http://cs.iit.edu/~culotta/cs579/a1/edges.txt.gz', 'edges.txt.gz') 349 | 350 | 351 | def read_graph(): 352 | """ Read 'edges.txt.gz' into a networkx **undirected** graph. 353 | Done for you. 354 | Returns: 355 | A networkx undirected graph. 356 | """ 357 | return nx.read_edgelist('edges.txt.gz', delimiter='\t') 358 | 359 | def main(): 360 | """ 361 | FYI: This takes ~10-15 seconds to run on my laptop. 362 | """ 363 | download_data() 364 | graph = read_graph() 365 | print('full graph has %d nodes and %d edges' % 366 | (graph.order(), graph.number_of_edges())) 367 | subgraph = get_subgraph(graph, 2) 368 | print('subgraph has %d nodes and %d edges' % 369 | (subgraph.order(), subgraph.number_of_edges())) 370 | print('\n\ncomputing norm_cut scores by max_depth...\nmax_depth\tnorm_cut_score') 371 | for max_depth, score in score_max_depths(subgraph, range(1,5)): 372 | print('%d\t\t%.3f' % (max_depth, score)) 373 | print('\n\ngetting result with max_depth=3') 374 | clusters = partition_girvan_newman(subgraph, 3) 375 | print('%d clusters' % len(clusters)) 376 | print('first partition: cluster 1 has %d nodes and cluster 2 has %d nodes' % 377 | (clusters[0].order(), clusters[1].order())) 378 | print('smaller cluster nodes:') 379 | print(sorted(sorted(clusters, key=lambda x: x.order())[0].nodes())) 380 | 381 | print('\n\npartitioning by eigenvector...') 382 | clusters2 = partition_by_eigenvector(subgraph) 383 | print('cluster 1 has %d nodes and cluster 2 has %d nodes' % 384 | (clusters2[0].order(), clusters2[1].order())) 385 | print('norm_cut score=%.3f' % norm_cut(clusters2[0].nodes(), 386 | clusters2[1].nodes(), 387 | subgraph)) 388 | print('10 nodes from smaller cluster:') 389 | print(sorted(clusters2[0].nodes())[:10]) 390 | 391 | 392 | 393 | if __name__ == '__main__': 394 | main() 395 | -------------------------------------------------------------------------------- /a2/.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | imdg.tgz 3 | -------------------------------------------------------------------------------- /a2/Log.txt: -------------------------------------------------------------------------------- 1 | best cross-validation result: 2 | {'punct': True, 'features': (, ), 'min_freq': 2, 'accuracy': 0.7700000000000001} 3 | worst cross-validation result: 4 | {'punct': True, 'features': (,), 'min_freq': 2, 'accuracy': 0.6475} 5 | 6 | Mean Accuracies per Setting: 7 | features=token_pair_features lexicon_features: 0.75125 8 | features=token_features token_pair_features lexicon_features: 0.74583 9 | features=token_features token_pair_features: 0.73542 10 | features=token_pair_features: 0.72875 11 | min_freq=2: 0.72250 12 | punct=False: 0.72024 13 | min_freq=5: 0.71857 14 | punct=True: 0.70810 15 | min_freq=10: 0.70143 16 | features=token_features lexicon_features: 0.69667 17 | features=token_features: 0.69000 18 | features=lexicon_features: 0.65125 19 | 20 | TOP COEFFICIENTS PER CLASS: 21 | negative words: 22 | neg_words: 0.66113 23 | token_pair=the__worst: 0.37465 24 | token_pair=is__so: 0.31499 25 | token_pair=about__the: 0.30307 26 | token_pair=like__a: 0.27059 27 | 28 | positive words: 29 | pos_words: 0.52554 30 | token_pair=it__is: 0.24468 31 | token_pair=a__lot: 0.21013 32 | token_pair=to__find: 0.20849 33 | token_pair=the__and: 0.20232 34 | testing accuracy=0.730000 35 | 36 | TOP MISCLASSIFIED TEST DOCUMENTS: 37 | 38 | truth=0 predicted=1 proba=0.993731 39 | I absolutely despise this film. I wanted to love it - I really wanted to. But man, oh man - they were SO off with Sara. And the father living was pretty cheesy. That's straight out of the Shirley Temple film.

I highly recommend THE BOOK. It is amazing. In the book, Sara is honorable and decent and she does the right thing... BECAUSE IT IS RIGHT. She doesn't have a spiteful bone in her body.

In the film, she is mean-spirited and spiteful. She does little things to get back at Miss Minchin. In the book, Sara is above such things. She DOES stand up to Miss Minchin. She tells the truth and is not cowed by her. But she does not do the stupid, spiteful things that the Sara in the film does.

It's really rather unsettling to me that so many here say they loved the book and they love the movie. I can't help but wonder... did we read the same book? The whole point of the book was personal responsibility, behaving with honor and integrity, ALWAYS telling the truth and facing adversity with calm and integrity.

Sara has a happy ending in the book - not the ridiculous survival of her father, but the joining with his partner who has been searching for her. In the book, she is taken in by this new father figure who loves and cares for her and Becky. And Miss Minchin is NOT a chimney sweep - that part of the film really was stupid.

To see all this praise for this wretched film is disturbing to me. We are praising a film that glorifies petty, spiteful behavior with a few tips of the hat to kindness? Sara in the book was kind to the bone and full of integrity. I don't even recognize her in the film... she's not in it.

Good thing Mrs. Burnett isn't alive to see this horrid thing. It's ghastly and undeserving to bear the title of her book. 40 | 41 | truth=0 predicted=1 proba=0.991584 42 | When I attended college in the early 70s, it was a simpler time. Except for a brief occurrence in 1994, I've been totally free of the influence of illegal substances ever since and I've never regretted it...until now. DB:TBTE has got to be, hands-down, the best movie to watch when stoned. The odd, dreamlike state it creates is very strange when you're not smoking anything, but I'm sure that it would seem completely normal after a big doobie. (Not that I'm recommending this, you understand.) The soothing narration, provided, as it usually is in quality cinema, by a TB victim trapped in a painting, would be ideal to help the stoned viewer to follow along as things get complicated. Plus, everything in the film is pretty organic...from old-fashioned natural breasts to the bucket of fried chicken.

Now, there's also no question that the young man with the (ahem) "hand problem" is absolutely sailing away in the film. At one point, you just KNOW that he's going to say, "Hey! When I move my hand, it leaves trails!!" Trust me...you'll know when you get to that point.

The only other thing we have to address is this: How good can a film be when at least half the budget was spent on moving a huge bed frame around for interior and exterior shots?

Definitely a must-see for horror aficionados, but suitable for the general audiences under the right conditions (if you know what I mean, and I think that you do). It only earns four stars because I can't actually say that it took any talent to make. 43 | 44 | truth=1 predicted=0 proba=0.990791 45 | In defense of this movie I must repeat what I had stated previously. The movie is called Arachina, it has a no name cast and I do not mean no name as in actors who play in little seen art house films. I mean no name as in your local high school decided to make a film no name and it might have a 2 dollar budget. So what does one expect? Hitchcock?

I felt the movie never took itself seriously which automatically takes it out of the worst movie list. That list is only for big budget all star cast movies that takes itself way too seriously. THe movie The Oscar comes to mind, most of Sylvester Stallone's movies. THe two leads were not Hepburn and Tracy but they did their jobs well enough for this movie. The woman kicked butt and the guy was not a blithering idiot. The actor who played the old man was actually very good. The man who played anal retentive professor was no Clifton Webb but he did a god job. And the Bimbo's for lack of a better were played by two competent actors. I laughed at the 50 cent special effects. But that was part of the charm of the movie. It played like a hybrid Tremors meets Night of the Living Dead. The premise of the movie is just like all Giant Bug movies of the 50's. A Meteor or radiation stir up the ecosystem and before you know it we have Giant Ants, Lobsters, rocks or Lizards terrorizing the locals. A meteor was the cause of the problems this time. I was was very entertained. I didn't expect much and I go a lot more then I bargained for. 46 | 47 | truth=1 predicted=0 proba=0.990739 48 | Being a freshman in college, this movie reminded me of my relationship with my mom. Of course, my situation doesn't parrallel with Natalie Portman and Surandon's situation; but my mom and I have grown up with the typical mother and daughter fights. There is always the mother telling you what to do, or not being the kind of mother you want to be. I was balling my eyes at the end of this movie. Surandon's reaction of her daughter going to the East coast, miles away, after all they've been through reminded me of how I felt, being from a small city in the West coast, going to New York.

The movie is meant for women who have children that are now all grown up. It is very touching, I was moved by the movie. Every feeling out of the characters in this movie was utterly real, you didn't get any phony sentimentality. I was sitting through the credits at the screening of this movie, alone, wishing my mother was sitting next to me so I could hug her and thank her for everything. This movie is a bit corny of course, but everything is trully momentous. Its all about what a mom can learn from her child; and what a child learns from her mother. 8/10 49 | 50 | truth=1 predicted=0 proba=0.974652 51 | Ah, classic comedy. At the point in the movie where brains get messed together, a two minute scene with Bruce Campbell beating himself up partially, reminds me of how simplistic movies and ideas can grab you and wrap you into a whole movie.

For years and years, Bruce Campbell knows what kind of movies we want out of him. We want to see weird movies like Bubba Ho Tep. We want to see cameo roles in Sam Raimi movies, and we want to see 'Man with the Screaming Brain'. With the title alone, one knows that it's going to border that completely silly type of movie, like Army of Darkness, only with more silly and less monsters.

The idea of the movie is simple. Bruce sees doctor. Doctor has new idea. Bruce gets bad things happen to him on way to see doctor. Coincidentally, it's the thing the doctor wanted to show him that saves him. Hilarity ensues.

With the addition of Ted Raimi as a weird Russian guy, and journeyman Stacy Keach as Dr. Ivan Ivanovich Ivanov, it's funny, that does this movie. Complete funny. Never a point of scary.

If you like the silly Bruce Campbell, you'll like this. Then again, why would you be watching this if you didn't like Bruce Campbell? 52 | -------------------------------------------------------------------------------- /a2/README.md: -------------------------------------------------------------------------------- 1 | In this assignment, you will use sklearn to classify movie reviews as positive or negative. You will implement a number of features and compare accuracy. 2 | 3 | Finally, read ShortAnswer.md, which requires you to come up with new features to improve the classifier. 4 | 5 | -------------------------------------------------------------------------------- /a2/ShortAnswer.txt: -------------------------------------------------------------------------------- 1 | 1. Looking at the top errors printed by get_top_misclassified, name two ways you would modify your classifier to improve accuracy (it could be features, tokenization, or something else.) 2 | 3 | 4 | 5 | 6 | 7 | 8 | 2. Implement one of the above methods. How did it affect the results? -------------------------------------------------------------------------------- /a2/a2.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | """ 4 | CS579: Assignment 2 5 | 6 | In this assignment, you will build a text classifier to determine whether a 7 | movie review is expressing positive or negative sentiment. The data come from 8 | the website IMDB.com. 9 | 10 | You'll write code to preprocess the data in different ways (creating different 11 | features), then compare the cross-validation accuracy of each approach. Then, 12 | you'll compute accuracy on a test set and do some analysis of the errors. 13 | 14 | The main method takes about 40 seconds for me to run on my laptop. Places to 15 | check for inefficiency include the vectorize function and the 16 | eval_all_combinations function. 17 | 18 | Complete the 14 methods below, indicated by TODO. 19 | 20 | As usual, completing one method at a time, and debugging with doctests, should 21 | help. 22 | """ 23 | 24 | # No imports allowed besides these. 25 | from collections import Counter, defaultdict 26 | from itertools import chain, combinations 27 | import glob 28 | import matplotlib.pyplot as plt 29 | import numpy as np 30 | import os 31 | import re 32 | from scipy.sparse import csr_matrix 33 | from sklearn.model_selection import KFold 34 | from sklearn.linear_model import LogisticRegression 35 | import string 36 | import tarfile 37 | import urllib.request 38 | 39 | 40 | def download_data(): 41 | """ Download and unzip data. 42 | DONE ALREADY. 43 | """ 44 | url = 'https://www.dropbox.com/s/8oehplrobcgi9cq/imdb.tgz?dl=1' 45 | urllib.request.urlretrieve(url, 'imdb.tgz') 46 | tar = tarfile.open("imdb.tgz") 47 | tar.extractall() 48 | tar.close() 49 | 50 | 51 | def read_data(path): 52 | """ 53 | Walks all subdirectories of this path and reads all 54 | the text files and labels. 55 | DONE ALREADY. 56 | 57 | Params: 58 | path....path to files 59 | Returns: 60 | docs.....list of strings, one per document 61 | labels...list of ints, 1=positive, 0=negative label. 62 | Inferred from file path (i.e., if it contains 63 | 'pos', it is 1, else 0) 64 | """ 65 | fnames = sorted([f for f in glob.glob(os.path.join(path, 'pos', '*.txt'))]) 66 | data = [(1, open(f).readlines()[0]) for f in sorted(fnames)] 67 | fnames = sorted([f for f in glob.glob(os.path.join(path, 'neg', '*.txt'))]) 68 | data += [(0, open(f).readlines()[0]) for f in sorted(fnames)] 69 | data = sorted(data, key=lambda x: x[1]) 70 | return np.array([d[1] for d in data]), np.array([d[0] for d in data]) 71 | 72 | 73 | def tokenize(doc, keep_internal_punct=False): 74 | """ 75 | Tokenize a string. 76 | The string should be converted to lowercase. 77 | If keep_internal_punct is False, then return only the alphanumerics (letters, numbers and underscore). 78 | If keep_internal_punct is True, then also retain punctuation that 79 | is inside of a word. E.g., in the example below, the token "isn't" 80 | is maintained when keep_internal_punct=True; otherwise, it is 81 | split into "isn" and "t" tokens. 82 | 83 | Params: 84 | doc....a string. 85 | keep_internal_punct...see above 86 | Returns: 87 | a numpy array containing the resulting tokens. 88 | 89 | >>> tokenize(" Hi there! Isn't this fun?", keep_internal_punct=False) 90 | array(['hi', 'there', 'isn', 't', 'this', 'fun'], dtype='>> tokenize("Hi there! Isn't this fun? ", keep_internal_punct=True) 92 | array(['hi', 'there', "isn't", 'this', 'fun'], dtype='>> feats = defaultdict(lambda: 0) 142 | >>> token_pair_features(np.array(['a', 'b', 'c', 'd']), feats) 143 | >>> sorted(feats.items()) 144 | [('token_pair=a__b', 1), ('token_pair=a__c', 1), ('token_pair=b__c', 2), ('token_pair=b__d', 1), ('token_pair=c__d', 1)] 145 | """ 146 | ###TODO 147 | pass 148 | 149 | 150 | neg_words = set(['bad', 'hate', 'horrible', 'worst', 'boring']) 151 | pos_words = set(['awesome', 'amazing', 'best', 'good', 'great', 'love', 'wonderful']) 152 | 153 | def lexicon_features(tokens, feats): 154 | """ 155 | Add features indicating how many time a token appears that matches either 156 | the neg_words or pos_words (defined above). The matching should ignore 157 | case. 158 | 159 | Params: 160 | tokens...array of token strings from a document. 161 | feats....dict from feature name to frequency 162 | Returns: 163 | nothing; feats is modified in place. 164 | 165 | In this example, 'LOVE' and 'great' match the pos_words, 166 | and 'boring' matches the neg_words list. 167 | >>> feats = defaultdict(lambda: 0) 168 | >>> lexicon_features(np.array(['i', 'LOVE', 'this', 'great', 'boring', 'movie']), feats) 169 | >>> sorted(feats.items()) 170 | [('neg_words', 1), ('pos_words', 2)] 171 | """ 172 | ###TODO 173 | pass 174 | 175 | 176 | def featurize(tokens, feature_fns): 177 | """ 178 | Compute all features for a list of tokens from 179 | a single document. 180 | 181 | Params: 182 | tokens........array of token strings from a document. 183 | feature_fns...a list of functions, one per feature 184 | Returns: 185 | list of (feature, value) tuples, SORTED alphabetically 186 | by the feature name. 187 | 188 | >>> feats = featurize(np.array(['i', 'LOVE', 'this', 'great', 'movie']), [token_features, lexicon_features]) 189 | >>> feats 190 | [('neg_words', 0), ('pos_words', 2), ('token=LOVE', 1), ('token=great', 1), ('token=i', 1), ('token=movie', 1), ('token=this', 1)] 191 | """ 192 | ###TODO 193 | pass 194 | 195 | 196 | def vectorize(tokens_list, feature_fns, min_freq, vocab=None): 197 | """ 198 | Given the tokens for a set of documents, create a sparse 199 | feature matrix, where each row represents a document, and 200 | each column represents a feature. 201 | 202 | Params: 203 | tokens_list...a list of lists; each sublist is an 204 | array of token strings from a document. 205 | feature_fns...a list of functions, one per feature 206 | min_freq......Remove features that do not appear in 207 | at least min_freq different documents. 208 | Returns: 209 | - a csr_matrix: See https://goo.gl/f5TiF1 for documentation. 210 | This is a sparse matrix (zero values are not stored). 211 | - vocab: a dict from feature name to column index. NOTE 212 | that the columns are sorted alphabetically (so, the feature 213 | "token=great" is column 0 and "token=horrible" is column 1 214 | because "great" < "horrible" alphabetically), 215 | 216 | When vocab is None, we build a new vocabulary from the given data. 217 | when vocab is not None, we do not build a new vocab, and we do not 218 | add any new terms to the vocabulary. This setting is to be used 219 | at test time. 220 | 221 | >>> docs = ["Isn't this movie great?", "Horrible, horrible movie"] 222 | >>> tokens_list = [tokenize(d) for d in docs] 223 | >>> feature_fns = [token_features] 224 | >>> X, vocab = vectorize(tokens_list, feature_fns, min_freq=1) 225 | >>> type(X) 226 | 227 | >>> X.toarray() 228 | array([[1, 0, 1, 1, 1, 1], 229 | [0, 2, 0, 1, 0, 0]], dtype=int64) 230 | >>> sorted(vocab.items(), key=lambda x: x[1]) 231 | [('token=great', 0), ('token=horrible', 1), ('token=isn', 2), ('token=movie', 3), ('token=t', 4), ('token=this', 5)] 232 | """ 233 | ###TODO 234 | pass 235 | 236 | 237 | def accuracy_score(truth, predicted): 238 | """ Compute accuracy of predictions. 239 | DONE ALREADY 240 | Params: 241 | truth.......array of true labels (0 or 1) 242 | predicted...array of predicted labels (0 or 1) 243 | """ 244 | return len(np.where(truth==predicted)[0]) / len(truth) 245 | 246 | 247 | def cross_validation_accuracy(clf, X, labels, k): 248 | """ 249 | Compute the average testing accuracy over k folds of cross-validation. You 250 | can use sklearn's KFold class here (no random seed, and no shuffling 251 | needed). 252 | 253 | Params: 254 | clf......A LogisticRegression classifier. 255 | X........A csr_matrix of features. 256 | labels...The true labels for each instance in X 257 | k........The number of cross-validation folds. 258 | 259 | Returns: 260 | The average testing accuracy of the classifier 261 | over each fold of cross-validation. 262 | """ 263 | ###TODO 264 | pass 265 | 266 | 267 | def eval_all_combinations(docs, labels, punct_vals, 268 | feature_fns, min_freqs): 269 | """ 270 | Enumerate all possible classifier settings and compute the 271 | cross validation accuracy for each setting. We will use this 272 | to determine which setting has the best accuracy. 273 | 274 | For each setting, construct a LogisticRegression classifier 275 | and compute its cross-validation accuracy for that setting. 276 | 277 | In addition to looping over possible assignments to 278 | keep_internal_punct and min_freqs, we will enumerate all 279 | possible combinations of feature functions. So, if 280 | feature_fns = [token_features, token_pair_features, lexicon_features], 281 | then we will consider all 7 combinations of features (see Log.txt 282 | for more examples). 283 | 284 | Params: 285 | docs..........The list of original training documents. 286 | labels........The true labels for each training document (0 or 1) 287 | punct_vals....List of possible assignments to 288 | keep_internal_punct (e.g., [True, False]) 289 | feature_fns...List of possible feature functions to use 290 | min_freqs.....List of possible min_freq values to use 291 | (e.g., [2,5,10]) 292 | 293 | Returns: 294 | A list of dicts, one per combination. Each dict has 295 | four keys: 296 | 'punct': True or False, the setting of keep_internal_punct 297 | 'features': The list of functions used to compute features. 298 | 'min_freq': The setting of the min_freq parameter. 299 | 'accuracy': The average cross_validation accuracy for this setting, using 5 folds. 300 | 301 | This list should be SORTED in descending order of accuracy. 302 | 303 | This function will take a bit longer to run (~20s for me). 304 | """ 305 | ###TODO 306 | pass 307 | 308 | 309 | def plot_sorted_accuracies(results): 310 | """ 311 | Plot all accuracies from the result of eval_all_combinations 312 | in ascending order of accuracy. 313 | Save to "accuracies.png". 314 | """ 315 | ###TODO 316 | pass 317 | 318 | 319 | def mean_accuracy_per_setting(results): 320 | """ 321 | To determine how important each model setting is to overall accuracy, 322 | we'll compute the mean accuracy of all combinations with a particular 323 | setting. For example, compute the mean accuracy of all runs with 324 | min_freq=2. 325 | 326 | Params: 327 | results...The output of eval_all_combinations 328 | Returns: 329 | A list of (accuracy, setting) tuples, SORTED in 330 | descending order of accuracy. 331 | """ 332 | ###TODO 333 | pass 334 | 335 | 336 | def fit_best_classifier(docs, labels, best_result): 337 | """ 338 | Using the best setting from eval_all_combinations, 339 | re-vectorize all the training data and fit a 340 | LogisticRegression classifier to all training data. 341 | (i.e., no cross-validation done here) 342 | 343 | Params: 344 | docs..........List of training document strings. 345 | labels........The true labels for each training document (0 or 1) 346 | best_result...Element of eval_all_combinations 347 | with highest accuracy 348 | Returns: 349 | clf.....A LogisticRegression classifier fit to all 350 | training data. 351 | vocab...The dict from feature name to column index. 352 | """ 353 | ###TODO 354 | pass 355 | 356 | 357 | def top_coefs(clf, label, n, vocab): 358 | """ 359 | Find the n features with the highest coefficients in 360 | this classifier for this label. 361 | See the .coef_ attribute of LogisticRegression. 362 | 363 | Params: 364 | clf.....LogisticRegression classifier 365 | label...1 or 0; if 1, return the top coefficients 366 | for the positive class; else for negative. 367 | n.......The number of coefficients to return. 368 | vocab...Dict from feature name to column index. 369 | Returns: 370 | List of (feature_name, coefficient) tuples, SORTED 371 | in descending order of the coefficient for the 372 | given class label. 373 | """ 374 | ###TODO 375 | pass 376 | 377 | 378 | def parse_test_data(best_result, vocab): 379 | """ 380 | Using the vocabulary fit to the training data, read 381 | and vectorize the testing data. Note that vocab should 382 | be passed to the vectorize function to ensure the feature 383 | mapping is consistent from training to testing. 384 | 385 | Note: use read_data function defined above to read the 386 | test data. 387 | 388 | Params: 389 | best_result...Element of eval_all_combinations 390 | with highest accuracy 391 | vocab.........dict from feature name to column index, 392 | built from the training data. 393 | Returns: 394 | test_docs.....List of strings, one per testing document, 395 | containing the raw. 396 | test_labels...List of ints, one per testing document, 397 | 1 for positive, 0 for negative. 398 | X_test........A csr_matrix representing the features 399 | in the test data. Each row is a document, 400 | each column is a feature. 401 | """ 402 | ###TODO 403 | pass 404 | 405 | 406 | def print_top_misclassified(test_docs, test_labels, X_test, clf, n): 407 | """ 408 | Print the n testing documents that are misclassified by the 409 | largest margin. By using the .predict_proba function of 410 | LogisticRegression , we can get the 411 | predicted probabilities of each class for each instance. 412 | We will first identify all incorrectly classified documents, 413 | then sort them in descending order of the predicted probability 414 | for the incorrect class. 415 | E.g., if document i is misclassified as positive, we will 416 | consider the probability of the positive class when sorting. 417 | 418 | Params: 419 | test_docs.....List of strings, one per test document 420 | test_labels...Array of true testing labels 421 | X_test........csr_matrix for test data 422 | clf...........LogisticRegression classifier fit on all training 423 | data. 424 | n.............The number of documents to print. 425 | 426 | Returns: 427 | Nothing; see Log.txt for example printed output. 428 | """ 429 | ###TODO 430 | pass 431 | 432 | 433 | def main(): 434 | """ 435 | Put it all together. 436 | ALREADY DONE. 437 | """ 438 | feature_fns = [token_features, token_pair_features, lexicon_features] 439 | # Download and read data. 440 | download_data() 441 | docs, labels = read_data(os.path.join('data', 'train')) 442 | # Evaluate accuracy of many combinations 443 | # of tokenization/featurization. 444 | results = eval_all_combinations(docs, labels, 445 | [True, False], 446 | feature_fns, 447 | [2,5,10]) 448 | # Print information about these results. 449 | best_result = results[0] 450 | worst_result = results[-1] 451 | print('best cross-validation result:\n%s' % str(best_result)) 452 | print('worst cross-validation result:\n%s' % str(worst_result)) 453 | plot_sorted_accuracies(results) 454 | print('\nMean Accuracies per Setting:') 455 | print('\n'.join(['%s: %.5f' % (s,v) for v,s in mean_accuracy_per_setting(results)])) 456 | 457 | # Fit best classifier. 458 | clf, vocab = fit_best_classifier(docs, labels, results[0]) 459 | 460 | # Print top coefficients per class. 461 | print('\nTOP COEFFICIENTS PER CLASS:') 462 | print('negative words:') 463 | print('\n'.join(['%s: %.5f' % (t,v) for t,v in top_coefs(clf, 0, 5, vocab)])) 464 | print('\npositive words:') 465 | print('\n'.join(['%s: %.5f' % (t,v) for t,v in top_coefs(clf, 1, 5, vocab)])) 466 | 467 | # Parse test data 468 | test_docs, test_labels, X_test = parse_test_data(best_result, vocab) 469 | 470 | # Evaluate on test set. 471 | predictions = clf.predict(X_test) 472 | print('testing accuracy=%f' % 473 | accuracy_score(test_labels, predictions)) 474 | 475 | print('\nTOP MISCLASSIFIED TEST DOCUMENTS:') 476 | print_top_misclassified(test_docs, test_labels, X_test, clf, 5) 477 | 478 | 479 | if __name__ == '__main__': 480 | main() 481 | -------------------------------------------------------------------------------- /a2/accuracies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iit-cs579/assignments/07b7e41763890df74d72bf9bbb30dbb1fd670ea8/a2/accuracies.png -------------------------------------------------------------------------------- /bonus/README.md: -------------------------------------------------------------------------------- 1 | Worth 15 bonus points. 2 | 3 | See bonus.py 4 | -------------------------------------------------------------------------------- /bonus/bonus.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | # Bonus: Recommendation systems 4 | # 5 | # Here we'll implement a content-based recommendation algorithm. 6 | # It will use the list of genres for a movie as the content. 7 | # The data come from the MovieLens project: http://grouplens.org/datasets/movielens/ 8 | # Note that I have not provided many doctests for this one. I strongly 9 | # recommend that you write your own for each function to ensure your 10 | # implementation is correct. 11 | 12 | # Please only use these imports. 13 | from collections import Counter, defaultdict 14 | import math 15 | import numpy as np 16 | import os 17 | import pandas as pd 18 | import re 19 | from scipy.sparse import csr_matrix 20 | import urllib.request 21 | import zipfile 22 | 23 | def download_data(): 24 | """ DONE. Download and unzip data. 25 | """ 26 | url = 'https://www.dropbox.com/s/p9wmkvbqt1xr6lc/ml-latest-small.zip?dl=1' 27 | urllib.request.urlretrieve(url, 'ml-latest-small.zip') 28 | zfile = zipfile.ZipFile('ml-latest-small.zip') 29 | zfile.extractall() 30 | zfile.close() 31 | 32 | 33 | def tokenize_string(my_string): 34 | """ DONE. You should use this in your tokenize function. 35 | """ 36 | return re.findall('[\w\-]+', my_string.lower()) 37 | 38 | 39 | def tokenize(movies): 40 | """ 41 | Append a new column to the movies DataFrame with header 'tokens'. 42 | This will contain a list of strings, one per token, extracted 43 | from the 'genre' field of each movie. Use the tokenize_string method above. 44 | 45 | Note: you may modify the movies parameter directly; no need to make 46 | a new copy. 47 | Params: 48 | movies...The movies DataFrame 49 | Returns: 50 | The movies DataFrame, augmented to include a new column called 'tokens'. 51 | 52 | >>> movies = pd.DataFrame([[123, 'Horror|Romance'], [456, 'Sci-Fi']], columns=['movieId', 'genres']) 53 | >>> movies = tokenize(movies) 54 | >>> movies['tokens'].tolist() 55 | [['horror', 'romance'], ['sci-fi']] 56 | """ 57 | ###TODO 58 | pass 59 | 60 | 61 | def featurize(movies): 62 | """ 63 | Append a new column to the movies DataFrame with header 'features'. 64 | Each row will contain a csr_matrix of shape (1, num_features). Each 65 | entry in this matrix will contain the tf-idf value of the term, as 66 | defined in class: 67 | tfidf(i, d) := tf(i, d) / max_k tf(k, d) * log10(N/df(i)) 68 | where: 69 | i is a term 70 | d is a document (movie) 71 | tf(i, d) is the frequency of term i in document d 72 | max_k tf(k, d) is the maximum frequency of any term in document d 73 | N is the number of documents (movies) 74 | df(i) is the number of unique documents containing term i 75 | 76 | Params: 77 | movies...The movies DataFrame 78 | Returns: 79 | A tuple containing: 80 | - The movies DataFrame, which has been modified to include a column named 'features'. 81 | - The vocab, a dict from term to int. Make sure the vocab is sorted alphabetically as in a2 (e.g., {'aardvark': 0, 'boy': 1, ...}) 82 | """ 83 | ###TODO 84 | pass 85 | 86 | 87 | def train_test_split(ratings): 88 | """DONE. 89 | Returns a random split of the ratings matrix into a training and testing set. 90 | """ 91 | test = set(range(len(ratings))[::1000]) 92 | train = sorted(set(range(len(ratings))) - test) 93 | test = sorted(test) 94 | return ratings.iloc[train], ratings.iloc[test] 95 | 96 | 97 | def cosine_sim(a, b): 98 | """ 99 | Compute the cosine similarity between two 1-d csr_matrices. 100 | Each matrix represents the tf-idf feature vector of a movie. 101 | Params: 102 | a...A csr_matrix with shape (1, number_features) 103 | b...A csr_matrix with shape (1, number_features) 104 | Returns: 105 | A float. The cosine similarity, defined as: dot(a, b) / ||a|| * ||b|| 106 | where ||a|| indicates the Euclidean norm (aka L2 norm) of vector a. 107 | """ 108 | ###TODO 109 | pass 110 | 111 | 112 | def make_predictions(movies, ratings_train, ratings_test): 113 | """ 114 | Using the ratings in ratings_train, predict the ratings for each 115 | row in ratings_test. 116 | 117 | To predict the rating of user u for movie i: Compute the weighted average 118 | rating for every other movie that u has rated. Restrict this weighted 119 | average to movies that have a positive cosine similarity with movie 120 | i. The weight for movie m corresponds to the cosine similarity between m 121 | and i. 122 | 123 | If there are no other movies with positive cosine similarity to use in the 124 | prediction, use the mean rating of the target user in ratings_train as the 125 | prediction. 126 | 127 | Params: 128 | movies..........The movies DataFrame. 129 | ratings_train...The subset of ratings used for making predictions. These are the "historical" data. 130 | ratings_test....The subset of ratings that need to predicted. These are the "future" data. 131 | Returns: 132 | A numpy array containing one predicted rating for each element of ratings_test. 133 | """ 134 | ###TODO 135 | pass 136 | 137 | 138 | def mean_absolute_error(predictions, ratings_test): 139 | """DONE. 140 | Return the mean absolute error of the predictions. 141 | """ 142 | return np.abs(predictions - np.array(ratings_test.rating)).mean() 143 | 144 | 145 | def main(): 146 | download_data() 147 | path = 'ml-latest-small' 148 | ratings = pd.read_csv(path + os.path.sep + 'ratings.csv') 149 | movies = pd.read_csv(path + os.path.sep + 'movies.csv') 150 | movies = tokenize(movies) 151 | movies, vocab = featurize(movies) 152 | print('vocab:') 153 | print(sorted(vocab.items())[:10]) 154 | ratings_train, ratings_test = train_test_split(ratings) 155 | print('%d training ratings; %d testing ratings' % (len(ratings_train), len(ratings_test))) 156 | predictions = make_predictions(movies, ratings_train, ratings_test) 157 | print('error=%f' % mean_absolute_error(predictions, ratings_test)) 158 | print(predictions[:10]) 159 | 160 | 161 | if __name__ == '__main__': 162 | main() 163 | -------------------------------------------------------------------------------- /project/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Project 4 | 5 | The project is an open-ended investigation into an OSNA problem. 6 | 7 | Project guidelines: 8 | 9 | - The data used should be as raw as possible. E.g., you may not simply download a pre-processed dataset from the UCI repository. 10 | Instead, you should collect data directly from an online social networking source (e.g., Twitter, Facebook, Instagram, Reddit, etc.) 11 | - The groups may be up to size 3. 12 | - Sample projects can be found [here](http://snap.stanford.edu/class/cs224w-2016/projects.html) or by reading recent papers published in ICWSM, a leading OSNA confernece: 13 | - You may use existing libraries (e.g., nltk, TensorFlow, theano), but your project should be your own. 14 | - After the Proposal survey is submitted (see below), a new private repository will be created for your group. 15 | - I've given you starter code that contains a command-line interface. You will implement each of the commands. 16 | - There's no need to store the raw data in github, but it should be clear how to collect it by reading your report and code. 17 | - See an [example project repo](https://github.com/iit-cs579/sample-project) 18 | 19 | ### Proposal 20 | Complete this survey, **one per team**, to submit your propsal: 21 | 22 | 23 | 24 | You will specify the following: 25 | 26 | 1. Problem Overview: Describe the problem you are solving; what makes it interesting? 27 | 2. Data: Which data will you use? How will you collect it? What problems do you anticipate? 28 | 3. Method: What method or algorithm will you use? Will you use an existing library to do so? Do you plan to modify the code at all? 29 | 4. Related Work: List at least 5 references (with links) to research papers that are related to your project (use Google Scholar to search). 30 | 5. Evaluation: How will you evaluate your results? What baseline method will you compare against? What are the key plots or tables you will produce? What performance metrics will you use? What descriptive evaluation will you do (e.g., look at specific predictions made by your system; visualizations)? 31 | 32 | Once you've submitted the form, I will create github repositories for each team, with the appropriate access. 33 | 34 | ### Milestone 35 | 36 | Your project milestone report will be 2 - 3 pages using the provided template. The following is a suggested structure for your report: 37 | 38 | 1. Title, Author(s) 39 | 2. Problem Overview: Describe the problem you are solving. State it as precisely as you can. 40 | 3. Data: Which data are you using; how did you collect it? 41 | 4. Method: What method or algorithm are you using. Are you using an existing library to do so? Did you introduce any new variations to these methods? How will you evaluate the results? Which baselines will you compare against? 42 | 5. Intermediate/Preliminary Experiments & Results: State and evaluate your results up to the milestone 43 | 6. Related work: Summarize at least five related research papers related to your project. How is your project similar/different? 44 | 7. Who does what: State which group member is responsible for which aspects of the project. 45 | 8. Timeline: What are the remaining steps you plan to complete, and when do you plan to complete them? 46 | 9. References: list of references cited in your report. 47 | 48 | 49 | Submit the report as a **PDF** file in the root folder of your project repository under **milestone.pdf**. 50 | 51 | ### Report 52 | 53 | A 6-8 page summary of your project. Examples are [here](http://nlp.stanford.edu/courses/cs224n/). 54 | 55 | 1. Title, Author(s) 56 | 2. Abstract: It should not be more than 300 words. What did you do and what was the main conclusion? 57 | 3. Introduction: Describe the problem precisely and why it is important. 58 | 4. Background/Related Work: Summarize at least five related research papers related to your project. How is your project similar/different? 59 | 5. Approach: What method or algorithm are you using. Are you using an existing library to do so? Did you introduce any new variations to these methods? This section details the framework of your project. Be specific, which means you might want to include equations, figures, plots, etc 60 | 6. Experiment: What kind of experiments did you do; what kind of dataset(s) you're using; what baseline method are you comparing against; and how you will evaluate your results. Report the results of your experiments in detail, including both quantitative evaluations (show numbers, figures, tables, etc) as well as qualitative evaluations (show images, example results, example errors, etc). 61 | 7. Conclusion: What have you learned? Suggest future ideas. 62 | 8. References: list of references cited in your report. 63 | 64 | Submit the report as a **PDF** file in the root folder of your project repository under **report.pdf**. 65 | 66 | ### Presentation 67 | 68 | **The presentation.pdf file should be uploaded the night before the presentation.** 69 | 70 | A **maximum** eight minute presentation summarizing your project, following a similar template as the report. 71 | 72 | Upload your slides in the root of your project folder as **presentation.pdf**. 73 | 74 | #### If your entire team consists of remote students: 75 | - You will present your presentation using a screencast, e.g. 76 | - Using QuickTime: http://www.abeautifulsite.net/recording-a-screencast-with-quicktime/ 77 | - Using screencast-o-matic: http://www.screencast-o-matic.com/ 78 | - Record and save the screencast and upload it to the root of your project folder, using the name **presentation.mp4**. 79 | 80 | ### Grading 81 | 82 | The project is worth 100 points total, consisting of: 83 | - Proposal (10%) 84 | - Milestone (15%) 85 | - Report: 86 | - Clarity, related work, discussion (15%) 87 | - Technical correctness and depth (15%) 88 | - Evaluation and results (15%) 89 | - Project presentation (15%) 90 | - Code quality and thoroughness (15%) 91 | 92 | 93 | -------------------------------------------------------------------------------- /update.sh: -------------------------------------------------------------------------------- 1 | git remote add template https://github.com/iit-cs579/assignments 2 | git fetch template 3 | git merge template/master 4 | --------------------------------------------------------------------------------