├── .gitignore
├── README.md
├── a0
    ├── Log.txt
    ├── README.md
    ├── Setup.md
    ├── ShortAnswer.txt
    ├── a0.py
    ├── candidates.txt
    └── network.png
├── a1
    ├── .gitignore
    ├── Log.txt
    ├── README.md
    └── a1.py
├── a2
    ├── .gitignore
    ├── Log.txt
    ├── README.md
    ├── ShortAnswer.txt
    ├── a2.py
    └── accuracies.png
├── bonus
    ├── README.md
    └── bonus.py
├── project
    └── README.md
└── update.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | *.py[cod]
3 | 
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | **under construction**
 2 | 
 3 | Each student has their own private GitHub repository at:  
 4 | <https://github.com/iit-cs579/[your-github-id]>
 5 | 
 6 | This is where you will submit all assignments.
 7 | 
 8 | Your repository should already contain starter code for each assignment. This starter code has been pulled from the assignment repository at <https://github.com/iit-cs579/assignments>.
 9 | 
10 | Throughout the course, I may update the assignments to clarify questions or add content. To ensure you have the latest content, you can run the `update.sh`, which will fetch and merge the content from the assignments repository.
11 | 
12 | For each assignment, then, you should do the following:
13 | 
14 | 1. Run `./update.sh` to get the latest starter code.
15 | 
16 | 2. Do the homework, adding and modifying files in the assignment directory. **Commit often!**
17 | 
18 | 3. Before the deadline, push all of your changes to GitHub. E.g.:
19 |   ```
20 |   cd a0
21 |   git add *
22 |   git commit -m 'homework completed'
23 |   git push
24 |   ```
25 | 
26 | 4. Double-check that you don't have any outstanding changes to commit:
27 |   ```
28 |   git status
29 |   # On branch master
30 |   nothing to commit, working directory clean
31 |   ```
32 | 
33 | 5. Double-check that everything works, by cloning your repository into a new directory and executing all tests.
34 |   ```
35 |   cd 
36 |   mkdir tmp
37 |   cd tmp
38 |   git clone https://github.com/iit-cs579/[your_iit_id]
39 |   cd [your_iit_id]/a0
40 |   [...run any relevant scripts/tests]
41 |   ```
42 | 
43 | 6. You can also view your code on Github with a web browser to make sure all your code has been submitted.
44 | 
45 | 7. Assignments contain [doctests](https://docs.python.org/3/library/doctest.html). You can run these for a file `foo.py` using `python -m doctest foo.py`. If all tests pass, you'll see no output. To see output even for passing tests, add a `-v` flag to the command.
46 | 
47 | 8. Typically, each assignment contains a number of methods for you to complete. I recommend tackling these one at a time, debugging and testing, and then moving onto the next method. Implementing everything and then running at the end will likely result in many errors that can be difficult to track down. In order to run the doctests from a single function, you can use [nose](https://github.com/nose-devs/nose). E.g., to run only the doctests for the `get_twitter` function in `a0.py`, you would call:
48 |   - `nosetests --with-doctest a0.py:get_twitter`
49 | 
50 | 9. For some assignments, I also include a `Log.txt` file which contains the expected output when running the assignment's main method (e.g., `python a0.py`). You should look to make sure your output matches. Occasionally, some deviations are expected, particularly if sets are used, which are unordered.
51 | 
52 | 10. Feel free to open issues at <https://github.com/iit-cs579/main/issues> to ask for clarifications, discuss problems, etc.
53 | 


--------------------------------------------------------------------------------
/a0/Log.txt:
--------------------------------------------------------------------------------
 1 | Established Twitter connection.
 2 | Read screen names: ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump']
 3 | found 4 users with screen_names ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump']
 4 | Friends per candidate:
 5 | BernieSanders 1390
 6 | JoeBiden 22
 7 | SenWarren 493
 8 | realDonaldTrump 47
 9 | Most common friends:
10 | [(818910970567344128, 3), (822215673812119553, 3), (15764644, 2), (15808765, 2), (24195214, 2)]
11 | Friend Overlap:
12 | [('BernieSanders', 'SenWarren', 14), ('BernieSanders', 'JoeBiden', 3), ('JoeBiden', 'SenWarren', 3), ('JoeBiden', 'realDonaldTrump', 2), ('BernieSanders', 'realDonaldTrump', 1), ('SenWarren', 'realDonaldTrump', 1)]
13 | User followed by Bernie and Donald: <REDACTED>
14 | graph has 24 nodes and 42 edges
15 | network drawn to network.png
16 | 


--------------------------------------------------------------------------------
/a0/README.md:
--------------------------------------------------------------------------------
 1 | ## Assignment 0
 2 | 
 3 | **50 points**  
 4 | 
 5 | 
 6 | 1. Get started with git and python by following the instructions at [Setup.md](Setup.md).
 7 |   
 8 | 2. Complete the data collection assignment, following the instructions in [a0.py](a0.py).
 9 | 
10 | 3. Complete the short answer questions in [ShortAnswer.txt](ShortAnswer.txt).
11 | 
12 | 3. Push your all your code and supporting files (e.g., .png) to your **private** GitHub repo in the folder `a0/`.
13 | 


--------------------------------------------------------------------------------
/a0/Setup.md:
--------------------------------------------------------------------------------
 1 | # Setup
 2 | 
 3 | 1. Learn Python by completing this online tutorial: <https://www.learnpython.org/> (3 hours)
 4 | 2. Create a GitHub account at <https://github.com/>
 5 | 3. Setup git by following <https://help.github.com/en/articles/set-up-git> (30 minutes)
 6 | 4. Learn git by completing the [Introduction to GitHub](https://lab.github.com/githubtraining/introduction-to-github) tutorial, reading the [git handbook](https://guides.github.com/introduction/git-handbook/), then completing the [Managing merge conflicts](https://lab.github.com/githubtraining/managing-merge-conflicts) tutorial (1 hour).
 7 | 5. Install the Python data science stack from <https://www.anaconda.com/distribution/> . **We will use Python 3** (30 minutes)
 8 | 6. Complete the scikit-learn tutorial from <https://www.datacamp.com/community/tutorials/machine-learning-python> (2 hours)
 9 | 7. Understand how python packages work by going through the [Python Packaging User Guide](https://packaging.python.org/tutorials/) (you can skip the "Creating Documentation" section). (1 hour)
10 | 8. After I have created all the project repositories you can then clone your private class repository
11 | ```
12 | git clone https://github.com/iit-cs579/[github-username].git
13 | ```
14 | E.g., for me this would be:
15 |   ```
16 |    git clone https://github.com/iit-cs579/aronwc.git
17 |   ```
18 |   - You should have read/write (pull/push) access to your private repository.
19 |   - This is where you will submit assignments.
20 |   - **Note:** This step will not work until I have setup your private repository. This usually happens by the second week of the semester (and this is why I need you to complete the course survey).
21 | 
22 | See <https://github.com/iit-cs579/assignments> for instructions on submitting assignments.
23 | 
24 | 
25 | 
26 | 
27 | 
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/a0/ShortAnswer.txt:
--------------------------------------------------------------------------------
 1 | Enter your responses inline below and push this file to your private GitHub
 2 | repository.
 3 | 
 4 | 
 5 | 1. Assume I plan to use the friend_overlap function above to quantify the
 6 | similarity of two users. E.g., because 14 is larger than 1, I conclude that
 7 | Bernie Sanders and Elizabeth Warrent are more similar than Bernie Sanders and Donald
 8 | Trump.
 9 | 
10 | How is this approach misleading? How might you fix it?
11 | 
12 | 
13 | 
14 | 
15 | 
16 | 
17 | 
18 | 
19 | 
20 | 
21 | 2. Looking at the output of your followed_by_bernie_and_donald function, why
22 | do you think this user is followed by both Bernie Sanders and Donald Trump,
23 | who are rivals? Do some web searches to see if you can find out more
24 | information.
25 | 
26 | 
27 | 
28 | 
29 | 
30 | 
31 | 
32 | 
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 3. There is a big difference in how many accounts each candidate follows (Bernie Sanders follows over 1.3K accounts, while Donald Trump follows less than
39 | 50). Why do you think this is? How might that affect our analysis?
40 | 
41 | 
42 | 
43 | 
44 | 
45 | 
46 | 
47 | 
48 | 
49 | 4. The follower graph we've collected is incomplete. To expand it, we would
50 | have to also collect the list of accounts followed by each of the
51 | friends. That is, for each user X that Donald Trump follows, we would have to
52 | also collect all the users that X follows. Assuming we again use the API call
53 | https://dev.twitter.com/rest/reference/get/friends/ids, how many requests will
54 | we have to make? Given how Twitter does rate limiting
55 | (https://dev.twitter.com/rest/public/rate-limiting), approximately how many
56 | minutes will it take to collect this data?
57 | 


--------------------------------------------------------------------------------
/a0/a0.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | """
  4 | CS579: Assignment 0
  5 | Collecting a political social network
  6 | 
  7 | In this assignment, I've given you a list of Twitter accounts of 4
  8 | U.S. presedential candidates from the previous election.
  9 | 
 10 | The goal is to use the Twitter API to construct a social network of these
 11 | accounts. We will then use the [networkx](http://networkx.github.io/) library
 12 | to plot these links, as well as print some statistics of the resulting graph.
 13 | 
 14 | 1. Create an account on [twitter.com](http://twitter.com).
 15 | 2. Generate authentication tokens by following the instructions [here](https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html).
 16 | 3. Add your tokens to the key/token variables below. (API Key == Consumer Key)
 17 | 4. Be sure you've installed the Python modules
 18 | [networkx](http://networkx.github.io/) and
 19 | [TwitterAPI](https://github.com/geduldig/TwitterAPI). Assuming you've already
 20 | installed [pip](http://pip.readthedocs.org/en/latest/installing.html), you can
 21 | do this with `pip install networkx TwitterAPI`.
 22 | 
 23 | OK, now you're ready to start collecting some data!
 24 | 
 25 | I've provided a partial implementation below. Your job is to complete the
 26 | code where indicated.  You need to modify the 10 methods indicated by
 27 | #TODO.
 28 | 
 29 | Your output should match the sample provided in Log.txt.
 30 | """
 31 | 
 32 | # Imports you'll need.
 33 | from collections import Counter
 34 | import matplotlib.pyplot as plt
 35 | import networkx as nx
 36 | import sys
 37 | import time
 38 | from TwitterAPI import TwitterAPI
 39 | 
 40 | consumer_key = 'fixme'
 41 | consumer_secret = 'fixme'
 42 | access_token = 'fixme'
 43 | access_token_secret = 'fixme'
 44 | 
 45 | 
 46 | # This method is done for you.
 47 | def get_twitter():
 48 |     """ Construct an instance of TwitterAPI using the tokens you entered above.
 49 |     Returns:
 50 |       An instance of TwitterAPI.
 51 |     """
 52 |     return TwitterAPI(consumer_key, consumer_secret, access_token, access_token_secret)
 53 | 
 54 | 
 55 | def read_screen_names(filename):
 56 |     """
 57 |     Read a text file containing Twitter screen_names, one per line.
 58 | 
 59 |     Params:
 60 |         filename....Name of the file to read.
 61 |     Returns:
 62 |         A list of strings, one per screen_name, sorted in ascending
 63 |         alphabetical order.
 64 | 
 65 |     Here's a doctest to confirm your implementation is correct.
 66 |     >>> read_screen_names('candidates.txt')
 67 |     ['BernieSanders', 'JoeBiden', 'SenWarren', 'realDonaldTrump']
 68 |     """
 69 |     ###TODO
 70 |     pass
 71 | 
 72 | 
 73 | # I've provided the method below to handle Twitter's rate limiting.
 74 | # You should call this method whenever you need to access the Twitter API.
 75 | def robust_request(twitter, resource, params, max_tries=5):
 76 |     """ If a Twitter request fails, sleep for 15 minutes.
 77 |     Do this at most max_tries times before quitting.
 78 |     Args:
 79 |       twitter .... A TwitterAPI object.
 80 |       resource ... A resource string to request; e.g., "friends/ids"
 81 |       params ..... A parameter dict for the request, e.g., to specify
 82 |                    parameters like screen_name or count.
 83 |       max_tries .. The maximum number of tries to attempt.
 84 |     Returns:
 85 |       A TwitterResponse object, or None if failed.
 86 |     """
 87 |     for i in range(max_tries):
 88 |         request = twitter.request(resource, params)
 89 |         if request.status_code == 200:
 90 |             return request
 91 |         else:
 92 |             print('Got error %s \nsleeping for 15 minutes.' % request.text)
 93 |             sys.stderr.flush()
 94 |             time.sleep(61 * 15)
 95 | 
 96 | 
 97 | def get_users(twitter, screen_names):
 98 |     """Retrieve the Twitter user objects for each screen_name.
 99 |     Params:
100 |         twitter........The TwitterAPI object.
101 |         screen_names...A list of strings, one per screen_name
102 |     Returns:
103 |         A list of dicts, one per user, containing all the user information
104 |         (e.g., screen_name, id, location, etc)
105 | 
106 |     See the API documentation here: https://dev.twitter.com/rest/reference/get/users/lookup
107 | 
108 |     In this example, I test retrieving two users: twitterapi and twitter.
109 | 
110 |     >>> twitter = get_twitter()
111 |     >>> users = get_users(twitter, ['twitterapi', 'twitter'])
112 |     >>> [u['id'] for u in users]
113 |     [6253282, 783214]
114 |     """
115 |     ###TODO
116 |     pass
117 | 
118 | 
119 | def get_friends(twitter, screen_name):
120 |     """ Return a list of Twitter IDs for users that this person follows, up to 5000.
121 |     See https://dev.twitter.com/rest/reference/get/friends/ids
122 | 
123 |     Note, because of rate limits, it's best to test this method for one candidate before trying
124 |     on all candidates.
125 | 
126 |     Args:
127 |         twitter.......The TwitterAPI object
128 |         screen_name... a string of a Twitter screen name
129 |     Returns:
130 |         A list of ints, one per friend ID, sorted in ascending order.
131 | 
132 |     Note: If a user follows more than 5000 accounts, we will limit ourselves to
133 |     the first 5000 accounts returned.
134 | 
135 |     In this test case, I return the first 5 accounts that I follow.
136 |     >>> twitter = get_twitter()
137 |     >>> get_friends(twitter, 'aronwc')[:5]
138 |     [695023, 1697081, 8381682, 10204352, 11669522]
139 |     """
140 |     ###TODO
141 |     pass
142 | 
143 | 
144 | def add_all_friends(twitter, users):
145 |     """ Get the list of accounts each user follows.
146 |     I.e., call the get_friends method for all 4 candidates.
147 | 
148 |     Store the result in each user's dict using a new key called 'friends'.
149 | 
150 |     Args:
151 |         twitter...The TwitterAPI object.
152 |         users.....The list of user dicts.
153 |     Returns:
154 |         Nothing
155 | 
156 |     >>> twitter = get_twitter()
157 |     >>> users = [{'screen_name': 'aronwc'}]
158 |     >>> add_all_friends(twitter, users)
159 |     >>> users[0]['friends'][:5]
160 |     [695023, 1697081, 8381682, 10204352, 11669522]
161 |     """
162 |     ###TODO
163 |     pass
164 | 
165 | 
166 | def print_num_friends(users):
167 |     """Print the number of friends per candidate, sorted by candidate name.
168 |     See Log.txt for an example.
169 |     Args:
170 |         users....The list of user dicts.
171 |     Returns:
172 |         Nothing
173 |     """
174 |     ###TODO
175 |     pass
176 | 
177 | 
178 | def count_friends(users):
179 |     """ Count how often each friend is followed.
180 |     Args:
181 |         users: a list of user dicts
182 |     Returns:
183 |         a Counter object mapping each friend to the number of candidates who follow them.
184 |         Counter documentation: https://docs.python.org/dev/library/collections.html#collections.Counter
185 | 
186 |     In this example, friend '2' is followed by three different users.
187 |     >>> c = count_friends([{'friends': [1,2]}, {'friends': [2,3]}, {'friends': [2,3]}])
188 |     >>> c.most_common()
189 |     [(2, 3), (3, 2), (1, 1)]
190 |     """
191 |     ###TODO
192 |     pass
193 | 
194 | 
195 | def friend_overlap(users):
196 |     """
197 |     Compute the number of shared accounts followed by each pair of users.
198 | 
199 |     Args:
200 |         users...The list of user dicts.
201 | 
202 |     Return: A list of tuples containing (user1, user2, N), where N is the
203 |         number of accounts that both user1 and user2 follow.  This list should
204 |         be sorted in descending order of N. Ties are broken first by user1's
205 |         screen_name, then by user2's screen_name (sorted in ascending
206 |         alphabetical order). See Python's builtin sorted method.
207 | 
208 |     In this example, users 'a' and 'c' follow the same 3 accounts:
209 |     >>> friend_overlap([
210 |     ...     {'screen_name': 'a', 'friends': ['1', '2', '3']},
211 |     ...     {'screen_name': 'b', 'friends': ['2', '3', '4']},
212 |     ...     {'screen_name': 'c', 'friends': ['1', '2', '3']},
213 |     ...     ])
214 |     [('a', 'c', 3), ('a', 'b', 2), ('b', 'c', 2)]
215 |     """
216 |     ###TODO
217 |     pass
218 | 
219 | 
220 | def followed_by_bernie_and_donald(users, twitter):
221 |     """
222 |     Find and return the screen_names of the Twitter users followed by both Bernie
223 |     Sanders and Donald Trump. You will need to use the TwitterAPI to convert
224 |     the Twitter ID to a screen_name. See:
225 |     https://dev.twitter.com/rest/reference/get/users/lookup
226 | 
227 |     Params:
228 |         users.....The list of user dicts
229 |         twitter...The Twitter API object
230 |     Returns:
231 |         A list of strings containing the Twitter screen_names of the users
232 |         that are followed by both Bernie Sanders and Donald Trump.
233 |     """
234 |     ###TODO
235 |     pass
236 | 
237 | 
238 | def create_graph(users, friend_counts):
239 |     """ Create a networkx undirected Graph, adding each candidate and friend
240 |         as a node.  Note: while all candidates should be added to the graph,
241 |         only add friends to the graph if they are followed by more than one
242 |         candidate. (This is to reduce clutter.)
243 | 
244 |         Each candidate in the Graph will be represented by their screen_name,
245 |         while each friend will be represented by their user id.
246 | 
247 |     Args:
248 |       users...........The list of user dicts.
249 |       friend_counts...The Counter dict mapping each friend to the number of candidates that follow them.
250 |     Returns:
251 |       A networkx Graph
252 |     """
253 |     ###TODO
254 |     pass
255 | 
256 | 
257 | def draw_network(graph, users, filename):
258 |     """
259 |     Draw the network to a file. Only label the candidate nodes; the friend
260 |     nodes should have no labels (to reduce clutter).
261 | 
262 |     Methods you'll need include networkx.draw_networkx, plt.figure, and plt.savefig.
263 | 
264 |     Your figure does not have to look exactly the same as mine, but try to
265 |     make it look presentable.
266 |     """
267 |     ###TODO
268 |     pass
269 | 
270 | 
271 | def main():
272 |     """ Main method. You should not modify this. """
273 |     twitter = get_twitter()
274 |     screen_names = read_screen_names('candidates.txt')
275 |     print('Established Twitter connection.')
276 |     print('Read screen names: %s' % screen_names)
277 |     users = sorted(get_users(twitter, screen_names), key=lambda x: x['screen_name'])
278 |     print('found %d users with screen_names %s' %
279 |           (len(users), str([u['screen_name'] for u in users])))
280 |     add_all_friends(twitter, users)
281 |     print('Friends per candidate:')
282 |     print_num_friends(users)
283 |     friend_counts = count_friends(users)
284 |     print('Most common friends:\n%s' % str(friend_counts.most_common(5)))
285 |     print('Friend Overlap:\n%s' % str(friend_overlap(users)))
286 |     print('User followed by Bernie and Donald: %s' % str(followed_by_bernie_and_donald(users, twitter)))
287 | 
288 |     graph = create_graph(users, friend_counts)
289 |     print('graph has %s nodes and %s edges' % (len(graph.nodes()), len(graph.edges())))
290 |     draw_network(graph, users, 'network.png')
291 |     print('network drawn to network.png')
292 | 
293 | 
294 | if __name__ == '__main__':
295 |     main()
296 | 
297 | # That's it for now! This should give you an introduction to some of the data we'll study in this course.
298 | 


--------------------------------------------------------------------------------
/a0/candidates.txt:
--------------------------------------------------------------------------------
1 | realDonaldTrump
2 | SenWarren
3 | BernieSanders
4 | JoeBiden
5 | 


--------------------------------------------------------------------------------
/a0/network.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iit-cs579/assignments/07b7e41763890df74d72bf9bbb30dbb1fd670ea8/a0/network.png


--------------------------------------------------------------------------------
/a1/.gitignore:
--------------------------------------------------------------------------------
1 | edges.txt.gz
2 | 


--------------------------------------------------------------------------------
/a1/Log.txt:
--------------------------------------------------------------------------------
 1 | full graph has 5062 nodes and 6060 edges
 2 | subgraph has 712 nodes and 1710 edges
 3 | 
 4 | 
 5 | computing norm_cut scores by max_depth...
 6 | max_depth	norm_cut_score
 7 | 1		1.007
 8 | 2		1.001
 9 | 3		0.122
10 | 4		0.122
11 | 
12 | 
13 | getting result with max_depth=3
14 | 2 clusters
15 | first partition: cluster 1 has 701 nodes and cluster 2 has 11 nodes
16 | smaller cluster nodes:
17 | ['Arthur A. Levine Books', 'Clifford The Big Red Dog', 'READ 180', 'Scholastic', 'Scholastic Book Fairs', 'Scholastic Canada', 'Scholastic Parents', 'Scholastic Reading Club', 'Scholastic Teachers', 'The Hunger Games', 'WordGirl']
18 | 
19 | 
20 | partitioning by eigenvector...
21 | cluster 1 has 86 nodes and cluster 2 has 626 nodes
22 | norm_cut score=0.389
23 | 10 nodes from smaller cluster:
24 | ['Aeon Magazine', 'American Museum of Natural History', 'Astronomy Picture of the Day (APOD)', 'Big History Project', 'Bradshaw Foundation', 'California Charter Schools Association', 'California Council for the Social Studies', 'California Geographic Alliance', 'Center for Civic Education', 'CityClub Seattle']
25 | 


--------------------------------------------------------------------------------
/a1/README.md:
--------------------------------------------------------------------------------
1 | Assignment 1
2 | 
3 | See `a1.py`.
4 | 


--------------------------------------------------------------------------------
/a1/a1.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | # # CS579: Assignment 1
  4 | #
  5 | # In this assignment, we'll implement community detection algorithms using Facebook "like" data.
  6 | #
  7 | # The file `edges.txt.gz` indicates like relationships between facebook users. This was collected using snowball sampling: beginning with the user "Bill Gates", I crawled all the people he "likes", then, for each newly discovered user, I crawled all the people they liked.
  8 | #
  9 | # We'll cluster the resulting graph into communities.
 10 | #
 11 | # Complete the methods below that are indicated by `TODO`. I've provided some sample output to help guide your implementation.
 12 | 
 13 | 
 14 | # You should not use any imports not listed here:
 15 | from collections import Counter, defaultdict, deque
 16 | import copy
 17 | from itertools import combinations
 18 | import math
 19 | import networkx as nx
 20 | from numpy.linalg import eigh
 21 | import numpy as np
 22 | import urllib.request
 23 | 
 24 | 
 25 | ## Community Detection
 26 | 
 27 | def example_graph():
 28 |     """
 29 |     Create the example graph from class. Used for testing.
 30 |     Do not modify.
 31 |     """
 32 |     g = nx.Graph()
 33 |     g.add_edges_from([('A', 'B'), ('A', 'C'), ('B', 'C'), ('B', 'D'), ('D', 'E'), ('D', 'F'), ('D', 'G'), ('E', 'F'), ('G', 'F')])
 34 |     return g
 35 | 
 36 | def bfs(graph, root, max_depth):
 37 |     """
 38 |     Perform breadth-first search to compute the shortest paths from a root node to all
 39 |     other nodes in the graph. To reduce running time, the max_depth parameter ends
 40 |     the search after the specified depth.
 41 |     E.g., if max_depth=2, only paths of length 2 or less will be considered.
 42 |     This means that nodes greather than max_depth distance from the root will not
 43 |     appear in the result.
 44 | 
 45 |     You may use these two classes to help with this implementation:
 46 |       https://docs.python.org/3.5/library/collections.html#collections.defaultdict
 47 |       https://docs.python.org/3.5/library/collections.html#collections.deque
 48 | 
 49 |     Params:
 50 |       graph.......A networkx Graph
 51 |       root........The root node in the search graph (a string). We are computing
 52 |                   shortest paths from this node to all others.
 53 |       max_depth...An integer representing the maximum depth to search.
 54 | 
 55 |     Returns:
 56 |       node2distances...dict from each node to the length of the shortest path from
 57 |                        the root node
 58 |       node2num_paths...dict from each node to the number of shortest paths from the
 59 |                        root node to this node.
 60 |       node2parents.....dict from each node to the list of its parents in the search
 61 |                        tree
 62 | 
 63 |     In the doctests below, we first try with max_depth=5, then max_depth=2.
 64 | 
 65 |     >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 5)
 66 |     >>> sorted(node2distances.items())
 67 |     [('A', 3), ('B', 2), ('C', 3), ('D', 1), ('E', 0), ('F', 1), ('G', 2)]
 68 |     >>> sorted(node2num_paths.items())
 69 |     [('A', 1), ('B', 1), ('C', 1), ('D', 1), ('E', 1), ('F', 1), ('G', 2)]
 70 |     >>> sorted((node, sorted(parents)) for node, parents in node2parents.items())
 71 |     [('A', ['B']), ('B', ['D']), ('C', ['B']), ('D', ['E']), ('F', ['E']), ('G', ['D', 'F'])]
 72 |     >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 2)
 73 |     >>> sorted(node2distances.items())
 74 |     [('B', 2), ('D', 1), ('E', 0), ('F', 1), ('G', 2)]
 75 |     >>> sorted(node2num_paths.items())
 76 |     [('B', 1), ('D', 1), ('E', 1), ('F', 1), ('G', 2)]
 77 |     >>> sorted((node, sorted(parents)) for node, parents in node2parents.items())
 78 |     [('B', ['D']), ('D', ['E']), ('F', ['E']), ('G', ['D', 'F'])]
 79 |     """
 80 |     ###TODO
 81 |     pass
 82 | 
 83 | 
 84 | def complexity_of_bfs(V, E, K):
 85 |     """
 86 |     If V is the number of vertices in a graph, E is the number of
 87 |     edges, and K is the max_depth of our approximate breadth-first
 88 |     search algorithm, then what is the *worst-case* run-time of
 89 |     this algorithm? As usual in complexity analysis, you can ignore
 90 |     any constant factors. E.g., if you think the answer is 2V * E + 3log(K),
 91 |     you would return V * E + math.log(K)
 92 |     >>> v = complexity_of_bfs(13, 23, 7)
 93 |     >>> type(v) == int or type(v) == float
 94 |     True
 95 |     """
 96 |     ###TODO
 97 |     pass
 98 | 
 99 | 
100 | def bottom_up(root, node2distances, node2num_paths, node2parents):
101 |     """
102 |     Compute the final step of the Girvan-Newman algorithm.
103 |     See p 352 From your text:
104 |     https://github.com/iit-cs579/main/blob/master/read/lru-10.pdf
105 |         The third and final step is to calculate for each edge e the sum
106 |         over all nodes Y of the fraction of shortest paths from the root
107 |         X to Y that go through e. This calculation involves computing this
108 |         sum for both nodes and edges, from the bottom. Each node other
109 |         than the root is given a credit of 1, representing the shortest
110 |         path to that node. This credit may be divided among nodes and
111 |         edges above, since there could be several different shortest paths
112 |         to the node. The rules for the calculation are as follows: ...
113 | 
114 |     Params:
115 |       root.............The root node in the search graph (a string). We are computing
116 |                        shortest paths from this node to all others.
117 |       node2distances...dict from each node to the length of the shortest path from
118 |                        the root node
119 |       node2num_paths...dict from each node to the number of shortest paths from the
120 |                        root node that pass through this node.
121 |       node2parents.....dict from each node to the list of its parents in the search
122 |                        tree
123 |     Returns:
124 |       A dict mapping edges to credit value. Each key is a tuple of two strings
125 |       representing an edge (e.g., ('A', 'B')). Make sure each of these tuples
126 |       are sorted alphabetically (so, it's ('A', 'B'), not ('B', 'A')).
127 | 
128 |       Any edges excluded from the results in bfs should also be exluded here.
129 | 
130 |     >>> node2distances, node2num_paths, node2parents = bfs(example_graph(), 'E', 5)
131 |     >>> result = bottom_up('E', node2distances, node2num_paths, node2parents)
132 |     >>> sorted(result.items())
133 |     [(('A', 'B'), 1.0), (('B', 'C'), 1.0), (('B', 'D'), 3.0), (('D', 'E'), 4.5), (('D', 'G'), 0.5), (('E', 'F'), 1.5), (('F', 'G'), 0.5)]
134 |     """
135 |     ###TODO
136 |     pass
137 | 
138 | 
139 | def approximate_betweenness(graph, max_depth):
140 |     """
141 |     Compute the approximate betweenness of each edge, using max_depth to reduce
142 |     computation time in breadth-first search.
143 | 
144 |     You should call the bfs and bottom_up functions defined above for each node
145 |     in the graph, and sum together the results. Be sure to divide by 2 at the
146 |     end to get the final betweenness.
147 | 
148 |     Params:
149 |       graph.......A networkx Graph
150 |       max_depth...An integer representing the maximum depth to search.
151 | 
152 |     Returns:
153 |       A dict mapping edges to betweenness. Each key is a tuple of two strings
154 |       representing an edge (e.g., ('A', 'B')). Make sure each of these tuples
155 |       are sorted alphabetically (so, it's ('A', 'B'), not ('B', 'A')).
156 | 
157 |     >>> sorted(approximate_betweenness(example_graph(), 2).items())
158 |     [(('A', 'B'), 2.0), (('A', 'C'), 1.0), (('B', 'C'), 2.0), (('B', 'D'), 6.0), (('D', 'E'), 2.5), (('D', 'F'), 2.0), (('D', 'G'), 2.5), (('E', 'F'), 1.5), (('F', 'G'), 1.5)]
159 |     """
160 |     ###TODO
161 |     pass
162 | 
163 | 
164 | def get_components(graph):
165 |     """
166 |     A helper function you may use below.
167 |     Returns the list of all connected components in the given graph.
168 |     """
169 |     return [graph.subgraph(c).copy() for c in nx.connected_components(graph)]
170 | 
171 | def partition_girvan_newman(graph, max_depth):
172 |     """
173 |     Use your approximate_betweenness implementation to partition a graph.
174 |     Unlike in class, here you will not implement this recursively. Instead,
175 |     just remove edges until more than one component is created, then return
176 |     those components.
177 |     That is, compute the approximate betweenness of all edges, and remove
178 |     them until multiple components are created.
179 | 
180 |     You only need to compute the betweenness once.
181 |     If there are ties in edge betweenness, break by edge name (e.g.,
182 |     (('A', 'B'), 1.0) comes before (('B', 'C'), 1.0)).
183 | 
184 |     Note: the original graph variable should not be modified. Instead,
185 |     make a copy of the original graph prior to removing edges.
186 |     See the Graph.copy method https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.copy.html
187 |     Params:
188 |       graph.......A networkx Graph
189 |       max_depth...An integer representing the maximum depth to search.
190 | 
191 |     Returns:
192 |       A list of networkx Graph objects, one per partition.
193 | 
194 |     >>> components = partition_girvan_newman(example_graph(), 5)
195 |     >>> components = sorted(components, key=lambda x: sorted(x.nodes())[0])
196 |     >>> sorted(components[0].nodes())
197 |     ['A', 'B', 'C']
198 |     >>> sorted(components[1].nodes())
199 |     ['D', 'E', 'F', 'G']
200 |     """
201 |     ###TODO
202 |     pass
203 | 
204 | def get_subgraph(graph, min_degree):
205 |     """Return a subgraph containing nodes whose degree is
206 |     greater than or equal to min_degree.
207 |     We'll use this in the main method to prune the original graph.
208 | 
209 |     Params:
210 |       graph........a networkx graph
211 |       min_degree...degree threshold
212 |     Returns:
213 |       a networkx graph, filtered as defined above.
214 | 
215 |     >>> subgraph = get_subgraph(example_graph(), 3)
216 |     >>> sorted(subgraph.nodes())
217 |     ['B', 'D', 'F']
218 |     >>> len(subgraph.edges())
219 |     2
220 |     """
221 |     ###TODO
222 |     pass
223 | 
224 | 
225 | """"
226 | Compute the normalized cut for each discovered cluster.
227 | I've broken this down into the three next methods.
228 | """
229 | 
230 | def volume(nodes, graph):
231 |     """
232 |     Compute the volume for a list of nodes, which
233 |     is the number of edges in `graph` with at least one end in
234 |     nodes.
235 |     Params:
236 |       nodes...a list of strings for the nodes to compute the volume of.
237 |       graph...a networkx graph
238 | 
239 |     >>> volume(['A', 'B', 'C'], example_graph())
240 |     4
241 |     """
242 |     ###TODO
243 |     pass
244 | 
245 | 
246 | def cut(S, T, graph):
247 |     """
248 |     Compute the cut-set of the cut (S,T), which is
249 |     the set of edges that have one endpoint in S and
250 |     the other in T.
251 |     Params:
252 |       S.......set of nodes in first subset
253 |       T.......set of nodes in second subset
254 |       graph...networkx graph
255 |     Returns:
256 |       An int representing the cut-set.
257 | 
258 |     >>> cut(['A', 'B', 'C'], ['D', 'E', 'F', 'G'], example_graph())
259 |     1
260 |     """
261 |     ###TODO
262 |     pass
263 | 
264 | 
265 | def norm_cut(S, T, graph):
266 |     """
267 |     The normalized cut value for the cut S/T. (See lec06.)
268 |     Params:
269 |       S.......set of nodes in first subset
270 |       T.......set of nodes in second subset
271 |       graph...networkx graph
272 |     Returns:
273 |       An float representing the normalized cut value
274 | 
275 |     """
276 |     ###TODO
277 |     pass
278 | 
279 | def score_max_depths(graph, max_depths):
280 |     """
281 |     In order to assess the quality of the approximate partitioning method
282 |     we've developed, we will run it with different values for max_depth
283 |     and see how it affects the norm_cut score of the resulting partitions.
284 |     Recall that smaller norm_cut scores correspond to better partitions.
285 | 
286 |     Params:
287 |       graph........a networkx Graph
288 |       max_depths...a list of ints for the max_depth values to be passed
289 |                    to calls to partition_girvan_newman
290 | 
291 |     Returns:
292 |       A list of (int, float) tuples representing the max_depth and the
293 |       norm_cut value obtained by the partitions returned by
294 |       partition_girvan_newman. See Log.txt for an example.
295 |     """
296 |     ###TODO
297 |     pass
298 | 
299 | """
300 | Next, use eigenvalue decomposition to partition a graph.
301 | """
302 | 
303 | def get_second_eigenvector(graph):
304 |     """
305 |     1. Create the Laplacian matrix.
306 |     2. Obtain its eigenvector matrix using the eigh function.
307 |     3. Return the second column eigenvector 
308 | 
309 |     Returns:
310 |       a 1d numpy array containing the second eigenvector
311 | 
312 |     >>> np.round(get_second_eigenvector(example_graph()), 2)
313 |     array([ 0.49,  0.3 ,  0.49, -0.21, -0.36, -0.36, -0.36])
314 |     """
315 |     ###TODO
316 |     pass
317 | 
318 | def partition_by_eigenvector(graph):
319 |     """
320 |     Using the get_second_eigenvector function above, partition the graph into
321 |     two components using a splitting threshold of 0. That is, nodes
322 |     whose corresponding value in the second eigenvector is >= 0 are in one cluster,
323 |     and the rest are in the other cluster.
324 | 
325 |     Returns:
326 |       A list of two networkx Graph objects, one per partition. 
327 |       Sort these in ascending order of partition size.
328 | 
329 |     >>> graph = example_graph()
330 |     >>> result = partition_by_eigenvector(graph)
331 |     >>> sorted(result[0].nodes())
332 |     ['A', 'B', 'C']
333 |     >>> sorted(result[1].nodes())
334 |     ['D', 'E', 'F', 'G']
335 |     >>> round(norm_cut(result[0].nodes(), result[1].nodes(), graph),  2)
336 |     0.42
337 |     """
338 |     ###TODO
339 |     pass
340 | 
341 | """
342 | Next, we'll download a real dataset to see how our algorithm performs.
343 | """
344 | def download_data():
345 |     """
346 |     Download the data. Done for you.
347 |     """
348 |     urllib.request.urlretrieve('http://cs.iit.edu/~culotta/cs579/a1/edges.txt.gz', 'edges.txt.gz')
349 | 
350 | 
351 | def read_graph():
352 |     """ Read 'edges.txt.gz' into a networkx **undirected** graph.
353 |     Done for you.
354 |     Returns:
355 |       A networkx undirected graph.
356 |     """
357 |     return nx.read_edgelist('edges.txt.gz', delimiter='\t')
358 | 
359 | def main():
360 |     """
361 |     FYI: This takes ~10-15 seconds to run on my laptop.
362 |     """
363 |     download_data()
364 |     graph = read_graph()
365 |     print('full graph has %d nodes and %d edges' %
366 |           (graph.order(), graph.number_of_edges()))
367 |     subgraph = get_subgraph(graph, 2)
368 |     print('subgraph has %d nodes and %d edges' %
369 |           (subgraph.order(), subgraph.number_of_edges()))
370 |     print('\n\ncomputing norm_cut scores by max_depth...\nmax_depth\tnorm_cut_score')
371 |     for max_depth, score in score_max_depths(subgraph, range(1,5)):
372 |         print('%d\t\t%.3f' % (max_depth, score))
373 |     print('\n\ngetting result with max_depth=3')
374 |     clusters = partition_girvan_newman(subgraph, 3)
375 |     print('%d clusters' % len(clusters))
376 |     print('first partition: cluster 1 has %d nodes and cluster 2 has %d nodes' %
377 |           (clusters[0].order(), clusters[1].order()))
378 |     print('smaller cluster nodes:')
379 |     print(sorted(sorted(clusters, key=lambda x: x.order())[0].nodes()))
380 | 
381 |     print('\n\npartitioning by eigenvector...')
382 |     clusters2 = partition_by_eigenvector(subgraph)
383 |     print('cluster 1 has %d nodes and cluster 2 has %d nodes' %
384 |           (clusters2[0].order(), clusters2[1].order()))
385 |     print('norm_cut score=%.3f' % norm_cut(clusters2[0].nodes(),
386 |                                            clusters2[1].nodes(),
387 |                                            subgraph))
388 |     print('10 nodes from smaller cluster:')
389 |     print(sorted(clusters2[0].nodes())[:10])
390 | 
391 | 
392 | 
393 | if __name__ == '__main__':
394 |     main()
395 | 


--------------------------------------------------------------------------------
/a2/.gitignore:
--------------------------------------------------------------------------------
1 | data/
2 | imdg.tgz
3 | 


--------------------------------------------------------------------------------
/a2/Log.txt:
--------------------------------------------------------------------------------
 1 | best cross-validation result:
 2 | {'punct': True, 'features': (<function token_pair_features at 0x106560d90>, <function lexicon_features at 0x106560e18>), 'min_freq': 2, 'accuracy': 0.7700000000000001}
 3 | worst cross-validation result:
 4 | {'punct': True, 'features': (<function lexicon_features at 0x106560e18>,), 'min_freq': 2, 'accuracy': 0.6475}
 5 | 
 6 | Mean Accuracies per Setting:
 7 | features=token_pair_features lexicon_features: 0.75125
 8 | features=token_features token_pair_features lexicon_features: 0.74583
 9 | features=token_features token_pair_features: 0.73542
10 | features=token_pair_features: 0.72875
11 | min_freq=2: 0.72250
12 | punct=False: 0.72024
13 | min_freq=5: 0.71857
14 | punct=True: 0.70810
15 | min_freq=10: 0.70143
16 | features=token_features lexicon_features: 0.69667
17 | features=token_features: 0.69000
18 | features=lexicon_features: 0.65125
19 | 
20 | TOP COEFFICIENTS PER CLASS:
21 | negative words:
22 | neg_words: 0.66113
23 | token_pair=the__worst: 0.37465
24 | token_pair=is__so: 0.31499
25 | token_pair=about__the: 0.30307
26 | token_pair=like__a: 0.27059
27 | 
28 | positive words:
29 | pos_words: 0.52554
30 | token_pair=it__is: 0.24468
31 | token_pair=a__lot: 0.21013
32 | token_pair=to__find: 0.20849
33 | token_pair=the__and: 0.20232
34 | testing accuracy=0.730000
35 | 
36 | TOP MISCLASSIFIED TEST DOCUMENTS:
37 | 
38 | truth=0 predicted=1 proba=0.993731
39 | I absolutely despise this film. I wanted to love it - I really wanted to. But man, oh man - they were SO off with Sara. And the father living was pretty cheesy. That's straight out of the Shirley Temple film.<br /><br />I highly recommend THE BOOK. It is amazing. In the book, Sara is honorable and decent and she does the right thing... BECAUSE IT IS RIGHT. She doesn't have a spiteful bone in her body.<br /><br />In the film, she is mean-spirited and spiteful. She does little things to get back at Miss Minchin. In the book, Sara is above such things. She DOES stand up to Miss Minchin. She tells the truth and is not cowed by her. But she does not do the stupid, spiteful things that the Sara in the film does.<br /><br />It's really rather unsettling to me that so many here say they loved the book and they love the movie. I can't help but wonder... did we read the same book? The whole point of the book was personal responsibility, behaving with honor and integrity, ALWAYS telling the truth and facing adversity with calm and integrity.<br /><br />Sara has a happy ending in the book - not the ridiculous survival of her father, but the joining with his partner who has been searching for her. In the book, she is taken in by this new father figure who loves and cares for her and Becky. And Miss Minchin is NOT a chimney sweep - that part of the film really was stupid.<br /><br />To see all this praise for this wretched film is disturbing to me. We are praising a film that glorifies petty, spiteful behavior with a few tips of the hat to kindness? Sara in the book was kind to the bone and full of integrity. I don't even recognize her in the film... she's not in it.<br /><br />Good thing Mrs. Burnett isn't alive to see this horrid thing. It's ghastly and undeserving to bear the title of her book.
40 | 
41 | truth=0 predicted=1 proba=0.991584
42 | When I attended college in the early 70s, it was a simpler time. Except for a brief occurrence in 1994, I've been totally free of the influence of illegal substances ever since and I've never regretted it...until now. DB:TBTE has got to be, hands-down, the best movie to watch when stoned. The odd, dreamlike state it creates is very strange when you're not smoking anything, but I'm sure that it would seem completely normal after a big doobie. (Not that I'm recommending this, you understand.) The soothing narration, provided, as it usually is in quality cinema, by a TB victim trapped in a painting, would be ideal to help the stoned viewer to follow along as things get complicated. Plus, everything in the film is pretty organic...from old-fashioned natural breasts to the bucket of fried chicken.<br /><br />Now, there's also no question that the young man with the (ahem) "hand problem" is absolutely sailing away in the film. At one point, you just KNOW that he's going to say, "Hey! When I move my hand, it leaves trails!!" Trust me...you'll know when you get to that point.<br /><br />The only other thing we have to address is this: How good can a film be when at least half the budget was spent on moving a huge bed frame around for interior and exterior shots? <br /><br />Definitely a must-see for horror aficionados, but suitable for the general audiences under the right conditions (if you know what I mean, and I think that you do). It only earns four stars because I can't actually say that it took any talent to make.
43 | 
44 | truth=1 predicted=0 proba=0.990791
45 | In defense of this movie I must repeat what I had stated previously. The movie is called Arachina, it has a no name cast and I do not mean no name as in actors who play in little seen art house films. I mean no name as in your local high school decided to make a film no name and it might have a 2 dollar budget. So what does one expect? Hitchcock?<br /><br />I felt the movie never took itself seriously which automatically takes it out of the worst movie list. That list is only for big budget all star cast movies that takes itself way too seriously. THe movie The Oscar comes to mind, most of Sylvester Stallone's movies. THe two leads were not Hepburn and Tracy but they did their jobs well enough for this movie. The woman kicked butt and the guy was not a blithering idiot. The actor who played the old man was actually very good. The man who played anal retentive professor was no Clifton Webb but he did a god job. And the Bimbo's for lack of a better were played by two competent actors. I laughed at the 50 cent special effects. But that was part of the charm of the movie. It played like a hybrid Tremors meets Night of the Living Dead. The premise of the movie is just like all Giant Bug movies of the 50's. A Meteor or radiation stir up the ecosystem and before you know it we have Giant Ants, Lobsters, rocks or Lizards terrorizing the locals. A meteor was the cause of the problems this time. I was was very entertained. I didn't expect much and I go a lot more then I bargained for.
46 | 
47 | truth=1 predicted=0 proba=0.990739
48 | Being a freshman in college, this movie reminded me of my relationship with my mom. Of course, my situation doesn't parrallel with Natalie Portman and Surandon's situation; but my mom and I have grown up with the typical mother and daughter fights. There is always the mother telling you what to do, or not being the kind of mother you want to be. I was balling my eyes at the end of this movie. Surandon's reaction of her daughter going to the East coast, miles away, after all they've been through reminded me of how I felt, being from a small city in the West coast, going to New York. <br /><br />The movie is meant for women who have children that are now all grown up. It is very touching, I was moved by the movie. Every feeling out of the characters in this movie was utterly real, you didn't get any phony sentimentality. I was sitting through the credits at the screening of this movie, alone, wishing my mother was sitting next to me so I could hug her and thank her for everything. This movie is a bit corny of course, but everything is trully momentous. Its all about what a mom can learn from her child; and what a child learns from her mother. 8/10
49 | 
50 | truth=1 predicted=0 proba=0.974652
51 | Ah, classic comedy. At the point in the movie where brains get messed together, a two minute scene with Bruce Campbell beating himself up partially, reminds me of how simplistic movies and ideas can grab you and wrap you into a whole movie.<br /><br />For years and years, Bruce Campbell knows what kind of movies we want out of him. We want to see weird movies like Bubba Ho Tep. We want to see cameo roles in Sam Raimi movies, and we want to see 'Man with the Screaming Brain'. With the title alone, one knows that it's going to border that completely silly type of movie, like Army of Darkness, only with more silly and less monsters.<br /><br />The idea of the movie is simple. Bruce sees doctor. Doctor has new idea. Bruce gets bad things happen to him on way to see doctor. Coincidentally, it's the thing the doctor wanted to show him that saves him. Hilarity ensues.<br /><br />With the addition of Ted Raimi as a weird Russian guy, and journeyman Stacy Keach as Dr. Ivan Ivanovich Ivanov, it's funny, that does this movie. Complete funny. Never a point of scary.<br /><br />If you like the silly Bruce Campbell, you'll like this. Then again, why would you be watching this if you didn't like Bruce Campbell?
52 | 


--------------------------------------------------------------------------------
/a2/README.md:
--------------------------------------------------------------------------------
1 | In this assignment, you will use sklearn to classify movie reviews as positive or negative. You will implement a number of features and compare accuracy.
2 | 
3 | Finally, read ShortAnswer.md, which requires you to come up with new features to improve the classifier.
4 | 
5 | 


--------------------------------------------------------------------------------
/a2/ShortAnswer.txt:
--------------------------------------------------------------------------------
1 | 1. Looking at the top errors printed by get_top_misclassified, name two ways you would modify your classifier to improve accuracy (it could be features, tokenization, or something else.)
2 | 
3 | 
4 | 
5 | 
6 | 
7 | 
8 | 2. Implement one of the above methods. How did it affect the results?


--------------------------------------------------------------------------------
/a2/a2.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | """
  4 | CS579: Assignment 2
  5 | 
  6 | In this assignment, you will build a text classifier to determine whether a
  7 | movie review is expressing positive or negative sentiment. The data come from
  8 | the website IMDB.com.
  9 | 
 10 | You'll write code to preprocess the data in different ways (creating different
 11 | features), then compare the cross-validation accuracy of each approach. Then,
 12 | you'll compute accuracy on a test set and do some analysis of the errors.
 13 | 
 14 | The main method takes about 40 seconds for me to run on my laptop. Places to
 15 | check for inefficiency include the vectorize function and the
 16 | eval_all_combinations function.
 17 | 
 18 | Complete the 14 methods below, indicated by TODO.
 19 | 
 20 | As usual, completing one method at a time, and debugging with doctests, should
 21 | help.
 22 | """
 23 | 
 24 | # No imports allowed besides these.
 25 | from collections import Counter, defaultdict
 26 | from itertools import chain, combinations
 27 | import glob
 28 | import matplotlib.pyplot as plt
 29 | import numpy as np
 30 | import os
 31 | import re
 32 | from scipy.sparse import csr_matrix
 33 | from sklearn.model_selection import KFold
 34 | from sklearn.linear_model import LogisticRegression
 35 | import string
 36 | import tarfile
 37 | import urllib.request
 38 | 
 39 | 
 40 | def download_data():
 41 |     """ Download and unzip data.
 42 |     DONE ALREADY.
 43 |     """
 44 |     url = 'https://www.dropbox.com/s/8oehplrobcgi9cq/imdb.tgz?dl=1'
 45 |     urllib.request.urlretrieve(url, 'imdb.tgz')
 46 |     tar = tarfile.open("imdb.tgz")
 47 |     tar.extractall()
 48 |     tar.close()
 49 | 
 50 | 
 51 | def read_data(path):
 52 |     """
 53 |     Walks all subdirectories of this path and reads all
 54 |     the text files and labels.
 55 |     DONE ALREADY.
 56 | 
 57 |     Params:
 58 |       path....path to files
 59 |     Returns:
 60 |       docs.....list of strings, one per document
 61 |       labels...list of ints, 1=positive, 0=negative label.
 62 |                Inferred from file path (i.e., if it contains
 63 |                'pos', it is 1, else 0)
 64 |     """
 65 |     fnames = sorted([f for f in glob.glob(os.path.join(path, 'pos', '*.txt'))])
 66 |     data = [(1, open(f).readlines()[0]) for f in sorted(fnames)]
 67 |     fnames = sorted([f for f in glob.glob(os.path.join(path, 'neg', '*.txt'))])
 68 |     data += [(0, open(f).readlines()[0]) for f in sorted(fnames)]
 69 |     data = sorted(data, key=lambda x: x[1])
 70 |     return np.array([d[1] for d in data]), np.array([d[0] for d in data])
 71 | 
 72 | 
 73 | def tokenize(doc, keep_internal_punct=False):
 74 |     """
 75 |     Tokenize a string.
 76 |     The string should be converted to lowercase.
 77 |     If keep_internal_punct is False, then return only the alphanumerics (letters, numbers and underscore).
 78 |     If keep_internal_punct is True, then also retain punctuation that
 79 |     is inside of a word. E.g., in the example below, the token "isn't"
 80 |     is maintained when keep_internal_punct=True; otherwise, it is
 81 |     split into "isn" and "t" tokens.
 82 | 
 83 |     Params:
 84 |       doc....a string.
 85 |       keep_internal_punct...see above
 86 |     Returns:
 87 |       a numpy array containing the resulting tokens.
 88 | 
 89 |     >>> tokenize(" Hi there! Isn't this fun?", keep_internal_punct=False)
 90 |     array(['hi', 'there', 'isn', 't', 'this', 'fun'], dtype='<U5')
 91 |     >>> tokenize("Hi there! Isn't this fun? ", keep_internal_punct=True)
 92 |     array(['hi', 'there', "isn't", 'this', 'fun'], dtype='<U5')
 93 |     """
 94 |     ###TODO
 95 |     pass
 96 | 
 97 | 
 98 | def token_features(tokens, feats):
 99 |     """
100 |     Add features for each token. The feature name
101 |     is pre-pended with the string "token=".
102 |     Note that the feats dict is modified in place,
103 |     so there is no return value.
104 | 
105 |     Params:
106 |       tokens...array of token strings from a document.
107 |       feats....dict from feature name to frequency
108 |     Returns:
109 |       nothing; feats is modified in place.
110 | 
111 |     >>> feats = defaultdict(lambda: 0)
112 |     >>> token_features(['hi', 'there', 'hi'], feats)
113 |     >>> sorted(feats.items())
114 |     [('token=hi', 2), ('token=there', 1)]
115 |     """
116 |     ###TODO
117 |     pass
118 | 
119 | 
120 | def token_pair_features(tokens, feats, k=3):
121 |     """
122 |     Compute features indicating that two words occur near
123 |     each other within a window of size k.
124 | 
125 |     For example [a, b, c, d] with k=3 will consider the
126 |     windows: [a,b,c], [b,c,d]. In the first window,
127 |     a_b, a_c, and b_c appear; in the second window,
128 |     b_c, c_d, and b_d appear. This example is in the
129 |     doctest below.
130 |     Note that the order of the tokens in the feature name
131 |     matches the order in which they appear in the document.
132 |     (e.g., a__b, not b__a)
133 | 
134 |     Params:
135 |       tokens....array of token strings from a document.
136 |       feats.....a dict from feature to value
137 |       k.........the window size (3 by default)
138 |     Returns:
139 |       nothing; feats is modified in place.
140 | 
141 |     >>> feats = defaultdict(lambda: 0)
142 |     >>> token_pair_features(np.array(['a', 'b', 'c', 'd']), feats)
143 |     >>> sorted(feats.items())
144 |     [('token_pair=a__b', 1), ('token_pair=a__c', 1), ('token_pair=b__c', 2), ('token_pair=b__d', 1), ('token_pair=c__d', 1)]
145 |     """
146 |     ###TODO
147 |     pass
148 | 
149 | 
150 | neg_words = set(['bad', 'hate', 'horrible', 'worst', 'boring'])
151 | pos_words = set(['awesome', 'amazing', 'best', 'good', 'great', 'love', 'wonderful'])
152 | 
153 | def lexicon_features(tokens, feats):
154 |     """
155 |     Add features indicating how many time a token appears that matches either
156 |     the neg_words or pos_words (defined above). The matching should ignore
157 |     case.
158 | 
159 |     Params:
160 |       tokens...array of token strings from a document.
161 |       feats....dict from feature name to frequency
162 |     Returns:
163 |       nothing; feats is modified in place.
164 | 
165 |     In this example, 'LOVE' and 'great' match the pos_words,
166 |     and 'boring' matches the neg_words list.
167 |     >>> feats = defaultdict(lambda: 0)
168 |     >>> lexicon_features(np.array(['i', 'LOVE', 'this', 'great', 'boring', 'movie']), feats)
169 |     >>> sorted(feats.items())
170 |     [('neg_words', 1), ('pos_words', 2)]
171 |     """
172 |     ###TODO
173 |     pass
174 | 
175 | 
176 | def featurize(tokens, feature_fns):
177 |     """
178 |     Compute all features for a list of tokens from
179 |     a single document.
180 | 
181 |     Params:
182 |       tokens........array of token strings from a document.
183 |       feature_fns...a list of functions, one per feature
184 |     Returns:
185 |       list of (feature, value) tuples, SORTED alphabetically
186 |       by the feature name.
187 | 
188 |     >>> feats = featurize(np.array(['i', 'LOVE', 'this', 'great', 'movie']), [token_features, lexicon_features])
189 |     >>> feats
190 |     [('neg_words', 0), ('pos_words', 2), ('token=LOVE', 1), ('token=great', 1), ('token=i', 1), ('token=movie', 1), ('token=this', 1)]
191 |     """
192 |     ###TODO
193 |     pass
194 | 
195 | 
196 | def vectorize(tokens_list, feature_fns, min_freq, vocab=None):
197 |     """
198 |     Given the tokens for a set of documents, create a sparse
199 |     feature matrix, where each row represents a document, and
200 |     each column represents a feature.
201 | 
202 |     Params:
203 |       tokens_list...a list of lists; each sublist is an
204 |                     array of token strings from a document.
205 |       feature_fns...a list of functions, one per feature
206 |       min_freq......Remove features that do not appear in
207 |                     at least min_freq different documents.
208 |     Returns:
209 |       - a csr_matrix: See https://goo.gl/f5TiF1 for documentation.
210 |       This is a sparse matrix (zero values are not stored).
211 |       - vocab: a dict from feature name to column index. NOTE
212 |       that the columns are sorted alphabetically (so, the feature
213 |       "token=great" is column 0 and "token=horrible" is column 1
214 |       because "great" < "horrible" alphabetically),
215 | 
216 |     When vocab is None, we build a new vocabulary from the given data.
217 |     when vocab is not None, we do not build a new vocab, and we do not
218 |     add any new terms to the vocabulary. This setting is to be used
219 |     at test time.
220 | 
221 |     >>> docs = ["Isn't this movie great?", "Horrible, horrible movie"]
222 |     >>> tokens_list = [tokenize(d) for d in docs]
223 |     >>> feature_fns = [token_features]
224 |     >>> X, vocab = vectorize(tokens_list, feature_fns, min_freq=1)
225 |     >>> type(X)
226 |     <class 'scipy.sparse.csr.csr_matrix'>
227 |     >>> X.toarray()
228 |     array([[1, 0, 1, 1, 1, 1],
229 |            [0, 2, 0, 1, 0, 0]], dtype=int64)
230 |     >>> sorted(vocab.items(), key=lambda x: x[1])
231 |     [('token=great', 0), ('token=horrible', 1), ('token=isn', 2), ('token=movie', 3), ('token=t', 4), ('token=this', 5)]
232 |     """
233 |     ###TODO
234 |     pass
235 | 
236 | 
237 | def accuracy_score(truth, predicted):
238 |     """ Compute accuracy of predictions.
239 |     DONE ALREADY
240 |     Params:
241 |       truth.......array of true labels (0 or 1)
242 |       predicted...array of predicted labels (0 or 1)
243 |     """
244 |     return len(np.where(truth==predicted)[0]) / len(truth)
245 | 
246 | 
247 | def cross_validation_accuracy(clf, X, labels, k):
248 |     """
249 |     Compute the average testing accuracy over k folds of cross-validation. You
250 |     can use sklearn's KFold class here (no random seed, and no shuffling
251 |     needed).
252 | 
253 |     Params:
254 |       clf......A LogisticRegression classifier.
255 |       X........A csr_matrix of features.
256 |       labels...The true labels for each instance in X
257 |       k........The number of cross-validation folds.
258 | 
259 |     Returns:
260 |       The average testing accuracy of the classifier
261 |       over each fold of cross-validation.
262 |     """
263 |     ###TODO
264 |     pass
265 | 
266 | 
267 | def eval_all_combinations(docs, labels, punct_vals,
268 |                           feature_fns, min_freqs):
269 |     """
270 |     Enumerate all possible classifier settings and compute the
271 |     cross validation accuracy for each setting. We will use this
272 |     to determine which setting has the best accuracy.
273 | 
274 |     For each setting, construct a LogisticRegression classifier
275 |     and compute its cross-validation accuracy for that setting.
276 | 
277 |     In addition to looping over possible assignments to
278 |     keep_internal_punct and min_freqs, we will enumerate all
279 |     possible combinations of feature functions. So, if
280 |     feature_fns = [token_features, token_pair_features, lexicon_features],
281 |     then we will consider all 7 combinations of features (see Log.txt
282 |     for more examples).
283 | 
284 |     Params:
285 |       docs..........The list of original training documents.
286 |       labels........The true labels for each training document (0 or 1)
287 |       punct_vals....List of possible assignments to
288 |                     keep_internal_punct (e.g., [True, False])
289 |       feature_fns...List of possible feature functions to use
290 |       min_freqs.....List of possible min_freq values to use
291 |                     (e.g., [2,5,10])
292 | 
293 |     Returns:
294 |       A list of dicts, one per combination. Each dict has
295 |       four keys:
296 |       'punct': True or False, the setting of keep_internal_punct
297 |       'features': The list of functions used to compute features.
298 |       'min_freq': The setting of the min_freq parameter.
299 |       'accuracy': The average cross_validation accuracy for this setting, using 5 folds.
300 | 
301 |       This list should be SORTED in descending order of accuracy.
302 | 
303 |       This function will take a bit longer to run (~20s for me).
304 |     """
305 |     ###TODO
306 |     pass
307 | 
308 | 
309 | def plot_sorted_accuracies(results):
310 |     """
311 |     Plot all accuracies from the result of eval_all_combinations
312 |     in ascending order of accuracy.
313 |     Save to "accuracies.png".
314 |     """
315 |     ###TODO
316 |     pass
317 | 
318 | 
319 | def mean_accuracy_per_setting(results):
320 |     """
321 |     To determine how important each model setting is to overall accuracy,
322 |     we'll compute the mean accuracy of all combinations with a particular
323 |     setting. For example, compute the mean accuracy of all runs with
324 |     min_freq=2.
325 | 
326 |     Params:
327 |       results...The output of eval_all_combinations
328 |     Returns:
329 |       A list of (accuracy, setting) tuples, SORTED in
330 |       descending order of accuracy.
331 |     """
332 |     ###TODO
333 |     pass
334 | 
335 | 
336 | def fit_best_classifier(docs, labels, best_result):
337 |     """
338 |     Using the best setting from eval_all_combinations,
339 |     re-vectorize all the training data and fit a
340 |     LogisticRegression classifier to all training data.
341 |     (i.e., no cross-validation done here)
342 | 
343 |     Params:
344 |       docs..........List of training document strings.
345 |       labels........The true labels for each training document (0 or 1)
346 |       best_result...Element of eval_all_combinations
347 |                     with highest accuracy
348 |     Returns:
349 |       clf.....A LogisticRegression classifier fit to all
350 |             training data.
351 |       vocab...The dict from feature name to column index.
352 |     """
353 |     ###TODO
354 |     pass
355 | 
356 | 
357 | def top_coefs(clf, label, n, vocab):
358 |     """
359 |     Find the n features with the highest coefficients in
360 |     this classifier for this label.
361 |     See the .coef_ attribute of LogisticRegression.
362 | 
363 |     Params:
364 |       clf.....LogisticRegression classifier
365 |       label...1 or 0; if 1, return the top coefficients
366 |               for the positive class; else for negative.
367 |       n.......The number of coefficients to return.
368 |       vocab...Dict from feature name to column index.
369 |     Returns:
370 |       List of (feature_name, coefficient) tuples, SORTED
371 |       in descending order of the coefficient for the
372 |       given class label.
373 |     """
374 |     ###TODO
375 |     pass
376 | 
377 | 
378 | def parse_test_data(best_result, vocab):
379 |     """
380 |     Using the vocabulary fit to the training data, read
381 |     and vectorize the testing data. Note that vocab should
382 |     be passed to the vectorize function to ensure the feature
383 |     mapping is consistent from training to testing.
384 | 
385 |     Note: use read_data function defined above to read the
386 |     test data.
387 | 
388 |     Params:
389 |       best_result...Element of eval_all_combinations
390 |                     with highest accuracy
391 |       vocab.........dict from feature name to column index,
392 |                     built from the training data.
393 |     Returns:
394 |       test_docs.....List of strings, one per testing document,
395 |                     containing the raw.
396 |       test_labels...List of ints, one per testing document,
397 |                     1 for positive, 0 for negative.
398 |       X_test........A csr_matrix representing the features
399 |                     in the test data. Each row is a document,
400 |                     each column is a feature.
401 |     """
402 |     ###TODO
403 |     pass
404 | 
405 | 
406 | def print_top_misclassified(test_docs, test_labels, X_test, clf, n):
407 |     """
408 |     Print the n testing documents that are misclassified by the
409 |     largest margin. By using the .predict_proba function of
410 |     LogisticRegression <https://goo.gl/4WXbYA>, we can get the
411 |     predicted probabilities of each class for each instance.
412 |     We will first identify all incorrectly classified documents,
413 |     then sort them in descending order of the predicted probability
414 |     for the incorrect class.
415 |     E.g., if document i is misclassified as positive, we will
416 |     consider the probability of the positive class when sorting.
417 | 
418 |     Params:
419 |       test_docs.....List of strings, one per test document
420 |       test_labels...Array of true testing labels
421 |       X_test........csr_matrix for test data
422 |       clf...........LogisticRegression classifier fit on all training
423 |                     data.
424 |       n.............The number of documents to print.
425 | 
426 |     Returns:
427 |       Nothing; see Log.txt for example printed output.
428 |     """
429 |     ###TODO
430 |     pass
431 | 
432 | 
433 | def main():
434 |     """
435 |     Put it all together.
436 |     ALREADY DONE.
437 |     """
438 |     feature_fns = [token_features, token_pair_features, lexicon_features]
439 |     # Download and read data.
440 |     download_data()
441 |     docs, labels = read_data(os.path.join('data', 'train'))
442 |     # Evaluate accuracy of many combinations
443 |     # of tokenization/featurization.
444 |     results = eval_all_combinations(docs, labels,
445 |                                     [True, False],
446 |                                     feature_fns,
447 |                                     [2,5,10])
448 |     # Print information about these results.
449 |     best_result = results[0]
450 |     worst_result = results[-1]
451 |     print('best cross-validation result:\n%s' % str(best_result))
452 |     print('worst cross-validation result:\n%s' % str(worst_result))
453 |     plot_sorted_accuracies(results)
454 |     print('\nMean Accuracies per Setting:')
455 |     print('\n'.join(['%s: %.5f' % (s,v) for v,s in mean_accuracy_per_setting(results)]))
456 | 
457 |     # Fit best classifier.
458 |     clf, vocab = fit_best_classifier(docs, labels, results[0])
459 | 
460 |     # Print top coefficients per class.
461 |     print('\nTOP COEFFICIENTS PER CLASS:')
462 |     print('negative words:')
463 |     print('\n'.join(['%s: %.5f' % (t,v) for t,v in top_coefs(clf, 0, 5, vocab)]))
464 |     print('\npositive words:')
465 |     print('\n'.join(['%s: %.5f' % (t,v) for t,v in top_coefs(clf, 1, 5, vocab)]))
466 | 
467 |     # Parse test data
468 |     test_docs, test_labels, X_test = parse_test_data(best_result, vocab)
469 | 
470 |     # Evaluate on test set.
471 |     predictions = clf.predict(X_test)
472 |     print('testing accuracy=%f' %
473 |           accuracy_score(test_labels, predictions))
474 | 
475 |     print('\nTOP MISCLASSIFIED TEST DOCUMENTS:')
476 |     print_top_misclassified(test_docs, test_labels, X_test, clf, 5)
477 | 
478 | 
479 | if __name__ == '__main__':
480 |     main()
481 | 


--------------------------------------------------------------------------------
/a2/accuracies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iit-cs579/assignments/07b7e41763890df74d72bf9bbb30dbb1fd670ea8/a2/accuracies.png


--------------------------------------------------------------------------------
/bonus/README.md:
--------------------------------------------------------------------------------
1 | Worth 15 bonus points.
2 | 
3 | See bonus.py
4 | 


--------------------------------------------------------------------------------
/bonus/bonus.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | # Bonus:  Recommendation systems
  4 | #
  5 | # Here we'll implement a content-based recommendation algorithm.
  6 | # It will use the list of genres for a movie as the content.
  7 | # The data come from the MovieLens project: http://grouplens.org/datasets/movielens/
  8 | # Note that I have not provided many doctests for this one. I strongly
  9 | # recommend that you write your own for each function to ensure your
 10 | # implementation is correct.
 11 | 
 12 | # Please only use these imports.
 13 | from collections import Counter, defaultdict
 14 | import math
 15 | import numpy as np
 16 | import os
 17 | import pandas as pd
 18 | import re
 19 | from scipy.sparse import csr_matrix
 20 | import urllib.request
 21 | import zipfile
 22 | 
 23 | def download_data():
 24 |     """ DONE. Download and unzip data.
 25 |     """
 26 |     url = 'https://www.dropbox.com/s/p9wmkvbqt1xr6lc/ml-latest-small.zip?dl=1'
 27 |     urllib.request.urlretrieve(url, 'ml-latest-small.zip')
 28 |     zfile = zipfile.ZipFile('ml-latest-small.zip')
 29 |     zfile.extractall()
 30 |     zfile.close()
 31 | 
 32 | 
 33 | def tokenize_string(my_string):
 34 |     """ DONE. You should use this in your tokenize function.
 35 |     """
 36 |     return re.findall('[\w\-]+', my_string.lower())
 37 | 
 38 | 
 39 | def tokenize(movies):
 40 |     """
 41 |     Append a new column to the movies DataFrame with header 'tokens'.
 42 |     This will contain a list of strings, one per token, extracted
 43 |     from the 'genre' field of each movie. Use the tokenize_string method above.
 44 | 
 45 |     Note: you may modify the movies parameter directly; no need to make
 46 |     a new copy.
 47 |     Params:
 48 |       movies...The movies DataFrame
 49 |     Returns:
 50 |       The movies DataFrame, augmented to include a new column called 'tokens'.
 51 | 
 52 |     >>> movies = pd.DataFrame([[123, 'Horror|Romance'], [456, 'Sci-Fi']], columns=['movieId', 'genres'])
 53 |     >>> movies = tokenize(movies)
 54 |     >>> movies['tokens'].tolist()
 55 |     [['horror', 'romance'], ['sci-fi']]
 56 |     """
 57 |     ###TODO
 58 |     pass
 59 | 
 60 | 
 61 | def featurize(movies):
 62 |     """
 63 |     Append a new column to the movies DataFrame with header 'features'.
 64 |     Each row will contain a csr_matrix of shape (1, num_features). Each
 65 |     entry in this matrix will contain the tf-idf value of the term, as
 66 |     defined in class:
 67 |     tfidf(i, d) := tf(i, d) / max_k tf(k, d) * log10(N/df(i))
 68 |     where:
 69 |     i is a term
 70 |     d is a document (movie)
 71 |     tf(i, d) is the frequency of term i in document d
 72 |     max_k tf(k, d) is the maximum frequency of any term in document d
 73 |     N is the number of documents (movies)
 74 |     df(i) is the number of unique documents containing term i
 75 | 
 76 |     Params:
 77 |       movies...The movies DataFrame
 78 |     Returns:
 79 |       A tuple containing:
 80 |       - The movies DataFrame, which has been modified to include a column named 'features'.
 81 |       - The vocab, a dict from term to int. Make sure the vocab is sorted alphabetically as in a2 (e.g., {'aardvark': 0, 'boy': 1, ...})
 82 |     """
 83 |     ###TODO
 84 |     pass
 85 | 
 86 | 
 87 | def train_test_split(ratings):
 88 |     """DONE.
 89 |     Returns a random split of the ratings matrix into a training and testing set.
 90 |     """
 91 |     test = set(range(len(ratings))[::1000])
 92 |     train = sorted(set(range(len(ratings))) - test)
 93 |     test = sorted(test)
 94 |     return ratings.iloc[train], ratings.iloc[test]
 95 | 
 96 | 
 97 | def cosine_sim(a, b):
 98 |     """
 99 |     Compute the cosine similarity between two 1-d csr_matrices.
100 |     Each matrix represents the tf-idf feature vector of a movie.
101 |     Params:
102 |       a...A csr_matrix with shape (1, number_features)
103 |       b...A csr_matrix with shape (1, number_features)
104 |     Returns:
105 |       A float. The cosine similarity, defined as: dot(a, b) / ||a|| * ||b||
106 |       where ||a|| indicates the Euclidean norm (aka L2 norm) of vector a.
107 |     """
108 |     ###TODO
109 |     pass
110 | 
111 | 
112 | def make_predictions(movies, ratings_train, ratings_test):
113 |     """
114 |     Using the ratings in ratings_train, predict the ratings for each
115 |     row in ratings_test.
116 | 
117 |     To predict the rating of user u for movie i: Compute the weighted average
118 |     rating for every other movie that u has rated.  Restrict this weighted
119 |     average to movies that have a positive cosine similarity with movie
120 |     i. The weight for movie m corresponds to the cosine similarity between m
121 |     and i.
122 | 
123 |     If there are no other movies with positive cosine similarity to use in the
124 |     prediction, use the mean rating of the target user in ratings_train as the
125 |     prediction.
126 | 
127 |     Params:
128 |       movies..........The movies DataFrame.
129 |       ratings_train...The subset of ratings used for making predictions. These are the "historical" data.
130 |       ratings_test....The subset of ratings that need to predicted. These are the "future" data.
131 |     Returns:
132 |       A numpy array containing one predicted rating for each element of ratings_test.
133 |     """
134 |     ###TODO
135 |     pass
136 | 
137 | 
138 | def mean_absolute_error(predictions, ratings_test):
139 |     """DONE.
140 |     Return the mean absolute error of the predictions.
141 |     """
142 |     return np.abs(predictions - np.array(ratings_test.rating)).mean()
143 | 
144 | 
145 | def main():
146 |     download_data()
147 |     path = 'ml-latest-small'
148 |     ratings = pd.read_csv(path + os.path.sep + 'ratings.csv')
149 |     movies = pd.read_csv(path + os.path.sep + 'movies.csv')
150 |     movies = tokenize(movies)
151 |     movies, vocab = featurize(movies)
152 |     print('vocab:')
153 |     print(sorted(vocab.items())[:10])
154 |     ratings_train, ratings_test = train_test_split(ratings)
155 |     print('%d training ratings; %d testing ratings' % (len(ratings_train), len(ratings_test)))
156 |     predictions = make_predictions(movies, ratings_train, ratings_test)
157 |     print('error=%f' % mean_absolute_error(predictions, ratings_test))
158 |     print(predictions[:10])
159 | 
160 | 
161 | if __name__ == '__main__':
162 |     main()
163 | 


--------------------------------------------------------------------------------
/project/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## Project
 4 | 
 5 | The project is an open-ended investigation into an OSNA problem.
 6 | 
 7 | Project guidelines:
 8 | 
 9 | - The data used should be as raw as possible. E.g., you may not simply download a pre-processed dataset from the UCI repository.
10 | Instead, you should collect data directly from an online social networking source (e.g., Twitter, Facebook, Instagram, Reddit, etc.)
11 | - The groups may be up to size 3.
12 | - Sample projects can be found [here](http://snap.stanford.edu/class/cs224w-2016/projects.html) or by reading recent papers published in ICWSM, a leading OSNA confernece: <http://icwsm.org/2017/program/accepted-papers/>
13 | - You may use existing libraries (e.g., nltk, TensorFlow, theano), but your project should be your own.
14 | - After the Proposal survey is submitted (see below), a new private repository will be created for your group.
15 | - I've given you starter code that contains a command-line interface. You will implement each of the commands.
16 | - There's no need to store the raw data in github, but it should be clear how to collect it by reading your report and code.
17 | - See an [example project repo](https://github.com/iit-cs579/sample-project)
18 | 
19 | ### Proposal
20 | Complete this survey, **one per team**, to submit your propsal:
21 | 
22 | <https://forms.gle/merhXJ5VNp2eGRr96>
23 | 
24 | You will specify the following:
25 | 
26 | 1. Problem Overview: Describe the problem you are solving; what makes it interesting?
27 | 2. Data: Which data will you use? How will you collect it? What problems do you anticipate?
28 | 3. Method: What method or algorithm will you use? Will you use an existing library to do so? Do you plan to modify the code at all?
29 | 4. Related Work: List at least 5 references (with links) to research papers that are related to your project (use Google Scholar to search).
30 | 5. Evaluation: How will you evaluate your results? What baseline method will you compare against? What are the key plots or tables you will produce? What performance metrics will you use? What descriptive evaluation will you do (e.g., look at specific predictions made by your system; visualizations)?
31 | 
32 | Once you've submitted the form, I will create github repositories for each team, with the appropriate access.
33 | 
34 | ### Milestone
35 | 
36 | Your project milestone report will be 2 - 3 pages using the provided template. The following is a suggested structure for your report:
37 | 
38 | 1. Title, Author(s)
39 | 2. Problem Overview: Describe the problem you are solving. State it as precisely as you can.
40 | 3. Data: Which data are you using; how did you collect it?
41 | 4. Method: What method or algorithm are you using. Are you using an existing library to do so? Did you introduce any new variations to these methods? How will you evaluate the results? Which baselines will you compare against?
42 | 5. Intermediate/Preliminary Experiments & Results: State and evaluate your results up to the milestone
43 | 6. Related work: Summarize at least five related research papers related to your project. How is your project similar/different?
44 | 7. Who does what: State which group member is responsible for which aspects of the project.
45 | 8. Timeline: What are the remaining steps you plan to complete, and when do you plan to complete them?
46 | 9. References: list of references cited in your report.
47 | 
48 | 
49 | Submit the report as a **PDF** file in the root folder of your project repository under **milestone.pdf**.
50 | 
51 | ### Report
52 | 
53 | A 6-8 page summary of your project. Examples are [here](http://nlp.stanford.edu/courses/cs224n/).
54 | 
55 | 1. Title, Author(s)
56 | 2. Abstract: It should not be more than 300 words. What did you do and what was the main conclusion?
57 | 3. Introduction: Describe the problem precisely and why it is important.
58 | 4. Background/Related Work: Summarize at least five related research papers related to your project. How is your project similar/different?
59 | 5. Approach: What method or algorithm are you using. Are you using an existing library to do so? Did you introduce any new variations to these methods? This section details the framework of your project. Be specific, which means you might want to include equations, figures, plots, etc
60 | 6. Experiment: What kind of experiments did you do; what kind of dataset(s) you're using; what baseline method are you comparing against; and how you will evaluate your results. Report the results of your experiments in detail, including both quantitative evaluations (show numbers, figures, tables, etc) as well as qualitative evaluations (show images, example results, example errors, etc).
61 | 7. Conclusion: What have you learned? Suggest future ideas.
62 | 8. References: list of references cited in your report.
63 | 
64 | Submit the report as a **PDF** file in the root folder of your project repository under **report.pdf**.
65 | 
66 | ### Presentation
67 | 
68 | **The presentation.pdf file should be uploaded the night before the presentation.**
69 | 
70 | A **maximum** eight minute presentation summarizing your project, following a similar template as the report.
71 | 
72 | Upload your slides in the root of your project folder as **presentation.pdf**.
73 | 
74 | #### If your entire team consists of remote students:
75 | - You will present your presentation using a screencast, e.g.
76 |   - Using QuickTime: http://www.abeautifulsite.net/recording-a-screencast-with-quicktime/
77 |   - Using screencast-o-matic: http://www.screencast-o-matic.com/
78 | - Record and save the screencast and upload it to the root of your project folder, using the name **presentation.mp4**.
79 | 
80 | ### Grading
81 | 
82 | The project is worth 100 points total, consisting of:
83 | - Proposal (10%)
84 | - Milestone (15%)
85 | - Report:
86 |   - Clarity, related work, discussion (15%)
87 |   - Technical correctness and depth (15%)
88 |   - Evaluation and results (15%)
89 | - Project presentation (15%)
90 | - Code quality and thoroughness (15%)
91 | 
92 | 
93 | 


--------------------------------------------------------------------------------
/update.sh:
--------------------------------------------------------------------------------
1 | git remote add template https://github.com/iit-cs579/assignments
2 | git fetch template
3 | git merge template/master
4 | 


--------------------------------------------------------------------------------