├── README.md ├── data ├── adj_list ├── dense_link_matrix ├── map ├── sparse_link_matrix └── users ├── docs ├── authority_scores.png ├── auths.png ├── dataset_fetcher.html ├── hits.html ├── hubbiness_scores.png ├── hubs.png └── stats.png ├── requirements.txt └── src ├── dataset_fetcher.py └── hits.py /README.md: -------------------------------------------------------------------------------- 1 | # HITS-Algorithm-implementation 2 | 3 | The HITS algorithm is being used on the Twitter follower network to find important hubs and authorities, where good hubs are people who follow good authorities and good authorities are people who are followed by good hubs. In this 4 | real-life scenario, a good authority could be a popular music artist and a good hub could be a music lover who follows many accomplished artists. 5 | 6 | Dataset 7 | ------- 8 | 9 | The dataset can be viewed as a directed graph. Each node in the graph represents a Twitter user and an edge from user A to user B implies that A is a “follower” of B and B is a “friend” of A. 10 | 11 | The graph consists of 500 nodes with edges between two nodes if one is a follower/friend of another. The graph is stored as an adjacency list the first time it is prepared but then converted to an adjacency matrix immediately (thus requiring 12 | to store a map from matrix index to user id) for repeated use with the HITS algorithm. 13 | 14 | File Structure 15 | --------------- 16 | [src](src) (directory) – Contains python source files 17 | 18 | /hits.py – Implements the HITS algorithm 19 | 20 | /dataset_fetcher.py – Fetches the dataset using the Twitter API 21 | 22 | 23 | [data](data) (directory) – Contains the structures saved after obtaining the dataset 24 | 25 | /adj_list – Adjacency list representing the fetched dataset 26 | 27 | /dense_link_matrix – Link matrix using non sparse representation 28 | 29 | /sparse_link_matrix – Link matrix using sparse representation 30 | 31 | /map – Map from user id to matrix index 32 | 33 | /users – Users information 34 | 35 | 36 | [docs](docs) (directory) - Contains the documentation for the various components 37 | 38 | /dataset_fetcher.html – doc for dataset_fetcher.py 39 | 40 | /hits.html – doc for hits.py 41 | 42 | /requirements.txt – Contains requirements to run the python code 43 | 44 | Usage: 45 | ------ 46 | - Download/Clone this repository 47 | ```bash 48 | git clone https://github.com/nikhil-iyer-97/HITS-Algorithm-implementation.git 49 | ``` 50 | - Change working directory to the where the repository is located 51 | ```bash 52 | cd HITS-Algorithm-implementation 53 | ``` 54 | - Install dependencies: 55 | ```bash 56 | pip install -r requirements.txt 57 | ``` 58 | - Change working directory to `src`: 59 | ```bash 60 | cd src 61 | ``` 62 | Now enter `python3 hits.py` for the program to run and display outputs. 63 | 64 | Example: 65 | -------- 66 | Some example outputs for hubbiness scores and authority scores for the first 30 nodes in the graph are shown below: 67 | ![Hubbiness Scores](/docs/hubbiness_scores.png) 68 | ![Authority Scores](/docs/authority_scores.png) 69 | 70 | The change in hub score and authority score with respect to a few selected entities were measured, resulting in: 71 | ![Change in Hubbiness Score vs Iterations](/docs/hubs.png) 72 | ![Change in Authority Score vs Iterations](/docs/auths.png) 73 | 74 | And finally, the algorithm was benchmarked on Sparse Matrix vs Normal Matrix implementations for various values of : 75 | 76 | ![Time taken to run HITS algorithm](/docs/stats.png) 77 | 78 | 79 | -------------------------------------------------------------------------------- /data/adj_list: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/data/adj_list -------------------------------------------------------------------------------- /data/dense_link_matrix: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/data/dense_link_matrix -------------------------------------------------------------------------------- /data/map: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/data/map -------------------------------------------------------------------------------- /data/sparse_link_matrix: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/data/sparse_link_matrix -------------------------------------------------------------------------------- /data/users: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/data/users -------------------------------------------------------------------------------- /docs/authority_scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/docs/authority_scores.png -------------------------------------------------------------------------------- /docs/auths.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/docs/auths.png -------------------------------------------------------------------------------- /docs/dataset_fetcher.html: -------------------------------------------------------------------------------- 1 | 2 | Python: module dataset_fetcher 3 | 4 | 5 | 6 | 7 | 8 |
 
9 |  
dataset_fetcher
index
/home/ubuntu/Documents/studies/3_1/IR_CS_F469/assn2/IR2/src/dataset_fetcher.py
12 |

13 |

14 | 15 | 16 | 18 | 19 | 20 |
 
17 | Modules
       
numpy
21 | pickle
22 |
queue
23 | scipy.sparse
24 |
sys
25 | time
26 |
tweepy
27 |

28 | 29 | 30 | 32 | 33 | 34 |
 
31 | Classes
       
35 |
builtins.object 36 |
37 |
38 |
DatasetFetcher 39 |
ListToMatrixConverter 40 |
Logger 41 |
42 |
43 |
44 |

45 | 46 | 47 | 49 | 50 | 51 | 53 | 54 |
 
48 | class DatasetFetcher(builtins.object)
   An instance of DatasetFetcher is used to obtain the dataset from
52 | the internet
 
 Methods defined here:
55 |
__init__(self, key, secret, logger)
Initializes an instance of DatasetFetcher
56 |  
57 | Args:
58 |         key: key to be used for authentication
59 |         secret: secret to be used for authentication
60 |         logger: An instance of Logger to be used for logging purposed by public
61 |         member functions
62 | 63 |
get_dataset(self, seed_user, friends_limit, followers_limit, limit, live_save, users_path, adj_list_path)
Obtain the dataset
64 |  
65 | Args:
66 | seed_user: id/screen_name/name of the user to start the bfs with
67 | friends_limit: Maximum number of friends to consider for each user
68 | followers_limit: Maximum number of followers to consider for each user
69 | limit: Maximum number of users to find friends and followers of
70 | live_save: Whether to save computed data frequently
71 | users_path: Path to the file where the users info will be stored
72 |  
73 | adj_list_path:
74 | 75 |
save_dataset(self, users_path, adj_list_path)
Save the dataset obtained by get_dataset
76 |  
77 | Args:
78 |         users_path: Path to the file where users info will be stored
79 |         adj_list_path: Path to the file where the adjacency list will be stored
80 | 81 |
82 | Data descriptors defined here:
83 |
__dict__
84 |
dictionary for instance variables (if defined)
85 |
86 |
__weakref__
87 |
list of weak references to the object (if defined)
88 |
89 |

90 | 91 | 92 | 94 | 95 | 96 | 99 | 100 |
 
93 | class ListToMatrixConverter(builtins.object)
   An instance of ListToMatrixConverter is used to convert the data obtained
97 | by the dataset fetcher from adjacency list form to a matrix form (and an
98 | index-to-userid map)
 
 Methods defined here:
101 |
__init__(self, adj_list_path)
Initializes an instance of ListToMatrixConverter
102 |  
103 | Args:
104 |         adj_list_path: Path to the file where the adjacency list is stored
105 | 106 |
convert(self)
Use the adjacency list to create the link matrix and a dictionary that
107 | maps the index in the link matrix to a user id
108 | 109 |
save(self, map_path, link_matrix_path, use_sparse=False)
Saves the map and link matrix created using the convert function
110 |  
111 | Args:
112 |         map_path: Path to the file where the map from link matrix index to
113 |         user id is to be stored
114 |         link_matrix_path: Path to the file where the link matrix is to be stored
115 |         use_sparse: True if the link matrix is to be stored as a sparse matrix
116 | 117 |
118 | Data descriptors defined here:
119 |
__dict__
120 |
dictionary for instance variables (if defined)
121 |
122 |
__weakref__
123 |
list of weak references to the object (if defined)
124 |
125 |

126 | 127 | 128 | 130 | 131 | 132 | 134 | 135 |
 
129 | class Logger(builtins.object)
   An instance of Logger can be used as a simple and intuitive interface
133 | for logging
 
 Methods defined here:
136 |
__del__(self)
Close the log file when no references to the instance remain
137 | 138 |
__init__(self, log_path, print_stdout=True, sep=' ', end='\n')
Initializes an instance of Logger
139 |  
140 | Args:
141 |         log_path: Path to the file to write the logs to
142 |         print_stdout: True if the logs must be written to stdout
143 |         sep: string to be used to separate arguments of printing
144 |         end: string to be after the last argument of printing
145 | 146 |
log(self, *args)
Logs whatever is present in args with current date and time
147 |  
148 | Uses instance variables self._sep for separating elements of args and
149 | self._end after the last element of args. Writes to the log file
150 | self._log_file. If self._print_stdout is True, logs are also written to
151 | stdout
152 |  
153 | Args:
154 |         args: List of elements to be logged
155 | 156 |
157 | Data descriptors defined here:
158 |
__dict__
159 |
dictionary for instance variables (if defined)
160 |
161 |
__weakref__
162 |
list of weak references to the object (if defined)
163 |
164 |

165 | 166 | 167 | 169 | 170 | 171 |
 
168 | Functions
       
main()
172 |
173 | -------------------------------------------------------------------------------- /docs/hits.html: -------------------------------------------------------------------------------- 1 | 2 | Python: module hits 3 | 4 | 5 | 6 | 7 | 8 |
 
9 |  
hits
index
/home/ubuntu/Documents/studies/3_1/IR_CS_F469/assn2/IR2/src/hits.py
12 |

13 |

14 | 15 | 16 | 18 | 19 | 20 |
 
17 | Modules
       
igraph.clustering
21 | igraph.compat
22 | igraph.configuration
23 | igraph.cut
24 | igraph.datatypes
25 | igraph.drawing
26 |
igraph.formula
27 | gzip
28 | igraph.layout
29 | igraph.matching
30 | math
31 | matplotlib.patches
32 |
numpy
33 | operator
34 | os
35 | pickle
36 | matplotlib.pyplot
37 | igraph.remote
38 |
scipy.sparse
39 | igraph.statistics
40 | sys
41 | time
42 | igraph.utils
43 | igraph.vendor
44 |

45 | 46 | 47 | 49 | 50 | 51 |
 
48 | Classes
       
52 |
builtins.object 53 |
54 |
55 |
DatasetReader 56 |
HITS 57 |
58 |
59 |
60 |

61 | 62 | 63 | 65 | 66 | 67 | 69 | 70 |
 
64 | class DatasetReader(builtins.object)
   An instance of DatasetReader is used to read different files from the
68 | dataset
 
 Methods defined here:
71 |
__init__(self)
Initializes an instance of DatasetReader
72 | 73 |
read_link_matrix(self, link_matrix_path, is_sparse=False)
Returns the array (stored in a file) that represents the link matrix
74 |  
75 | Args:
76 |         link_matrix_path: Path to the file where the link matrix is stored
77 |         is_sparse: True if the link matrix is stored as a sparse matrix
78 | 79 |
read_map(self, map_path)
Returns the dictionary (stored in a file) that represents a map
80 | from the link matrix index to user id
81 |  
82 | Args:
83 |         map_path: Path to the file where the map is stored
84 | 85 |
read_users(self, users_path)
Returns the dictionary (stored in a file) containing details of
86 | all users
87 |  
88 | Args:
89 |         users_path: Path to the file where info of all users is stored
90 | 91 |
92 | Data descriptors defined here:
93 |
__dict__
94 |
dictionary for instance variables (if defined)
95 |
96 |
__weakref__
97 |
list of weak references to the object (if defined)
98 |
99 |

100 | 101 | 102 | 104 | 105 | 106 | 108 | 109 |
 
103 | class HITS(builtins.object)
   An instance of HITS is used to model the idea of hubs and authorities
107 | and execute the corresponding algorithm
 
 Methods defined here:
110 |
__init__(self, link_matrix, users, index_id_map, is_sparse=False)
Initializes an instance of HITS
111 |  
112 | Args:
113 |         link_matrix: The link matrix
114 |         users: Details of all users
115 |         index_id_map: Dictionary representing a map from link matrix index
116 |         to user id
117 |         is_sparse: True if the links matrix is a sparse matrix
118 | 119 |
calc_scores(self, epsilon=0.0001)
Calculates hubbiness and authority
120 | 121 |
get_all_auths(self)
Returns the authority score for each user for each iteration
122 | 123 |
get_all_hubs(self)
Returns the hubbiness score for each user for each iteration
124 | 125 |
get_auths(self)
Returns the authority for each node (user)
126 | 127 |
get_hubs(self)
Returns the hubbiness for each node (user)
128 | 129 |
get_names(self)
Returns the screen name of each user
130 | 131 |
plot_graph(self, x, names, c)
Plots the graph
132 | 133 |
plot_stats(self)
134 | 135 |
136 | Data descriptors defined here:
137 |
__dict__
138 |
dictionary for instance variables (if defined)
139 |
140 |
__weakref__
141 |
list of weak references to the object (if defined)
142 |
143 |

144 | 145 | 146 | 148 | 149 | 150 |
 
147 | Functions
       
community_to_membership(...)
community_to_membership(merges, nodes, steps, return_csize=False)
151 |
convex_hull(...)
convex_hull(vs, coords=False)
152 |  
153 | Calculates the convex hull of a given point set.
154 |  
155 | @param vs: the point set as a list of lists
156 | @param coords: if C{True}, the function returns the
157 |   coordinates of the corners of the convex hull polygon,
158 |   otherwise returns the corner indices.
159 | @return: either the hull's corner coordinates or the point
160 |   indices corresponding to them, depending on the C{coords}
161 |   parameter.
162 |
cos(...)
cos(x)
163 |  
164 | Return the cosine of x (measured in radians).
165 |
is_degree_sequence(...)
is_degree_sequence(out_deg, in_deg=None)
166 |  
167 | Returns whether a list of degrees can be a degree sequence of some graph.
168 |  
169 | Note that it is not required for the graph to be simple; in other words,
170 | this function may return C{True} for degree sequences that can be realized
171 | using one or more multiple or loop edges only.
172 |  
173 | In particular, this function checks whether
174 |  
175 |   - all the degrees are non-negative
176 |   - for undirected graphs, the sum of degrees are even
177 |   - for directed graphs, the two degree sequences are of the same length and
178 |     equal sums
179 |  
180 | @param out_deg: the list of degrees. For directed graphs, this list must
181 |   contain the out-degrees of the vertices.
182 | @param in_deg: the list of in-degrees for directed graphs. This parameter
183 |   must be C{None} for undirected graphs.
184 | @return: C{True} if there exists some graph that can realize the given degree
185 |   sequence, C{False} otherwise.@see: L{is_graphical_degree_sequence()} if you do not want to allow multiple
186 |   or loop edges.
187 |
is_graphical_degree_sequence(...)
is_graphical_degree_sequence(out_deg, in_deg=None)
188 |  
189 | Returns whether a list of degrees can be a degree sequence of some simple graph.
190 |  
191 | Note that it is required for the graph to be simple; in other words,
192 | this function will return C{False} for degree sequences that cannot be realized
193 | without using one or more multiple or loop edges.
194 |  
195 | @param out_deg: the list of degrees. For directed graphs, this list must
196 |   contain the out-degrees of the vertices.
197 | @param in_deg: the list of in-degrees for directed graphs. This parameter
198 |   must be C{None} for undirected graphs.
199 | @return: C{True} if there exists some simple graph that can realize the given
200 |   degree sequence, C{False} otherwise.
201 | @see: L{is_degree_sequence()} if you want to allow multiple or loop edges.
202 |
main()
203 |
set_progress_handler(...)
set_progress_handler(handler)
204 |  
205 | Sets the handler to be called when igraph is performing a long operation.
206 | @param handler: the progress handler function. It must accept two
207 |   arguments, the first is the message informing the user about
208 |   what igraph is doing right now, the second is the actual
209 |   progress information (a percentage).
210 |
set_random_number_generator(...)
set_random_number_generator(generator)
211 |  
212 | Sets the random number generator used by igraph.
213 | @param generator: the generator to be used. It must be a Python object
214 |   with at least three attributes: C{random}, C{randint} and C{gauss}.
215 |   Each of them must be callable and their signature and behaviour
216 |   must be identical to C{random.random}, C{random.randint} and
217 |   C{random.gauss}. By default, igraph uses the C{random} module for
218 |   random number generation, but you can supply your alternative
219 |   implementation here. If the given generator is C{None}, igraph
220 |   reverts to the default Mersenne twister generator implemented in the
221 |   C layer, which might be slightly faster than calling back to Python
222 |   for random numbers, but you cannot set its seed or save its state.
223 |
set_status_handler(...)
set_status_handler(handler)
224 |  
225 | Sets the handler to be called when igraph tries to display a status
226 | message.
227 |  
228 | This is used to communicate the progress of some calculations where
229 | no reasonable progress percentage can be given (so it is not possible
230 | to use the progress handler).
231 |  
232 | @param handler: the status handler function. It must accept a single
233 |   argument, the message that informs the user about what igraph is
234 |   doing right now.
235 |
sin(...)
sin(x)
236 |  
237 | Return the sine of x (measured in radians).
238 |
warn(...)
Issue a warning, or maybe ignore it or raise an exception.
239 |

240 | 241 | 242 | 244 | 245 | 246 |
 
243 | Data
       ADJ_DIRECTED = 0
247 | ADJ_LOWER = 3
248 | ADJ_MAX = 1
249 | ADJ_MIN = 4
250 | ADJ_PLUS = 5
251 | ADJ_UNDIRECTED = 1
252 | ADJ_UPPER = 2
253 | ALL = 3
254 | BLISS_F = 0
255 | BLISS_FL = 1
256 | BLISS_FLM = 4
257 | BLISS_FM = 3
258 | BLISS_FS = 2
259 | BLISS_FSM = 5
260 | GET_ADJACENCY_BOTH = 2
261 | GET_ADJACENCY_LOWER = 1
262 | GET_ADJACENCY_UPPER = 0
263 | IN = 2
264 | Nexus = <igraph.remote.nexus.NexusConnection object>
265 | OUT = 1
266 | REWIRING_SIMPLE = 0
267 | REWIRING_SIMPLE_LOOPS = 1
268 | STAR_IN = 1
269 | STAR_MUTUAL = 3
270 | STAR_OUT = 0
271 | STAR_UNDIRECTED = 2
272 | STRONG = 2
273 | TRANSITIVITY_NAN = 0
274 | TRANSITIVITY_ZERO = 1
275 | TREE_IN = 1
276 | TREE_OUT = 0
277 | TREE_UNDIRECTED = 2
278 | WEAK = 1
279 | arpack_options = <igraph.ARPACKOptions object>
280 | config = <igraph.configuration.Configuration object>
281 | dbl_epsilon = 2.220446049250313e-16
282 | debug = False
283 | known_colors = {'alice blue': (0.9411764705882353, 0.9725490196078431, 1.0, 1.0), 'aliceblue': (0.9411764705882353, 0.9725490196078431, 1.0, 1.0), 'antique white': (0.9803921568627451, 0.9215686274509803, 0.8431372549019608, 1.0), 'antiquewhite': (0.9803921568627451, 0.9215686274509803, 0.8431372549019608, 1.0), 'antiquewhite1': (1.0, 0.9372549019607843, 0.8588235294117647, 1.0), 'antiquewhite2': (0.9333333333333333, 0.8745098039215686, 0.8, 1.0), 'antiquewhite3': (0.803921568627451, 0.7529411764705882, 0.6901960784313725, 1.0), 'antiquewhite4': (0.5450980392156862, 0.5137254901960784, 0.47058823529411764, 1.0), 'aqua': (0.0, 1.0, 1.0, 1.0), 'aquamarine': (0.4980392156862745, 1.0, 0.8313725490196079, 1.0), ...}
284 | name = 'write_svg'
285 | palettes = {'gray': <GradientPalette with 256 colors>, 'heat': <AdvancedGradientPalette with 256 colors>, 'rainbow': <RainbowPalette with 256 colors>, 'red-black-green': <AdvancedGradientPalette with 256 colors>, 'red-blue': <GradientPalette with 256 colors>, 'red-green': <GradientPalette with 256 colors>, 'red-purple-blue': <AdvancedGradientPalette with 256 colors>, 'red-yellow-green': <AdvancedGradientPalette with 256 colors>, 'terrain': <AdvancedGradientPalette with 256 colors>}
286 | pi = 3.141592653589793
287 | -------------------------------------------------------------------------------- /docs/hubbiness_scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/docs/hubbiness_scores.png -------------------------------------------------------------------------------- /docs/hubs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/docs/hubs.png -------------------------------------------------------------------------------- /docs/stats.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhil-iyer-97/HITS-Algorithm-implementation/dc96f56f927abf760d8ac3fa81d54cd72c9b4468/docs/stats.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cairocffi==0.8.0 2 | certifi==2017.7.27.1 3 | cffi==1.11.2 4 | chardet==3.0.4 5 | cycler==0.10.0 6 | idna==2.6 7 | matplotlib==2.1.0 8 | numpy==1.13.3 9 | oauthlib==2.0.6 10 | pycparser==2.18 11 | pyparsing==2.2.0 12 | python-dateutil==2.6.1 13 | python-igraph==0.7.1.post6 14 | pytz==2017.3 15 | requests==2.20.4 16 | requests-oauthlib==0.8.0 17 | scipy==1.0.0 18 | six==1.11.0 19 | tweepy==3.5.0 20 | urllib3>=1.23 21 | -------------------------------------------------------------------------------- /src/dataset_fetcher.py: -------------------------------------------------------------------------------- 1 | import tweepy 2 | import queue 3 | import time 4 | import pickle 5 | import numpy as np 6 | import scipy.sparse as sparse 7 | from datetime import datetime as dt 8 | import sys 9 | 10 | class Logger(): 11 | """An instance of Logger can be used as a simple and intuitive interface 12 | for logging 13 | """ 14 | 15 | def __init__(self, log_path, print_stdout=True, sep=' ', end='\n'): 16 | """Initializes an instance of Logger 17 | 18 | Args: 19 | log_path: Path to the file to write the logs to 20 | print_stdout: True if the logs must be written to stdout 21 | sep: string to be used to separate arguments of printing 22 | end: string to be after the last argument of printing 23 | """ 24 | self._log_file = open(log_path, 'w') 25 | self._print_stdout = print_stdout 26 | self._sep = sep 27 | self._end = end 28 | 29 | def log(self, *args): 30 | """Logs whatever is present in args with current date and time 31 | 32 | Uses instance variables self._sep for separating elements of args and 33 | self._end after the last element of args. Writes to the log file 34 | self._log_file. If self._print_stdout is True, logs are also written to 35 | stdout 36 | 37 | Args: 38 | args: List of elements to be logged 39 | """ 40 | to_print = str(dt.now()) + ': ' 41 | for i in args: 42 | to_print += self._sep + str(i) 43 | self._log_file.write(to_print + self._end) 44 | self._log_file.flush() 45 | if self._print_stdout: 46 | print(to_print, end=self._end) 47 | sys.stdout.flush() 48 | 49 | def __del__(self): 50 | """Close the log file when no references to the instance remain 51 | """ 52 | self._log_file.close() 53 | 54 | class DatasetFetcher(): 55 | """An instance of DatasetFetcher is used to obtain the dataset from 56 | the internet 57 | """ 58 | 59 | def __init__(self, key, secret, logger): 60 | """Initializes an instance of DatasetFetcher 61 | 62 | Args: 63 | key: key to be used for authentication 64 | secret: secret to be used for authentication 65 | logger: An instance of Logger to be used for logging purposed by public 66 | member functions 67 | """ 68 | auth = tweepy.AppAuthHandler(key, secret) 69 | self._api = tweepy.API(auth, retry_count=5) 70 | self._visited = None 71 | self._graph = None 72 | self._logger = logger 73 | 74 | def _print_api_rem(self): 75 | """Print remaining quota for friends listing and followers listing endpoints 76 | """ 77 | try: 78 | temp = self._api.rate_limit_status() 79 | except tweepy.RateLimitError: 80 | self._logger.log('Rate limit API limit reached') 81 | except Exception as e: 82 | self._logger.log('API limit exception: ', repr(e)) 83 | else: 84 | self._logger.log('Friends endpoint remaining: ', 85 | temp['resources']['friends']['/friends/list']['remaining']) 86 | self._logger.log('Followers endpoint remaining: ', 87 | temp['resources']['followers']['/followers/list']['remaining']) 88 | 89 | def _handle_limit(self, cursor, friends_or_followers): 90 | """Handles rate limits given a cursor 91 | """ 92 | while True: 93 | try: 94 | yield cursor.next() 95 | except tweepy.RateLimitError: 96 | try: 97 | reset_time = self._api.rate_limit_status()['resources'][friends_or_followers]['/' + friends_or_followers + '/list']['reset'] 98 | except tweepy.RateLimitError: 99 | self._logger.log('Sleeping for', 15 * 60, 'seconds') 100 | time.sleep(15 * 60) 101 | except Exception as e: 102 | self._logger.log('Unexpected exception thrown: ', repr(e)) 103 | self._logger.log('Sleeping for', 15 * 60, 'seconds') 104 | time.sleep(15 * 60) 105 | else: 106 | self._logger.log('Sleeping for', max(reset_time - time.time() + 1, 1), 'seconds') 107 | time.sleep(max(reset_time - time.time() + 1, 1)) 108 | except tweepy.TweepError as e: 109 | self._logger.log('tweepy.TweepError: code:', repr(e)) 110 | break 111 | 112 | 113 | def get_dataset( 114 | self, seed_user, friends_limit, followers_limit, limit, live_save, 115 | users_path, adj_list_path): 116 | """Obtain the dataset 117 | 118 | Args: 119 | seed_user: id/screen_name/name of the user to start the bfs with 120 | friends_limit: Maximum number of friends to consider for each user 121 | followers_limit: Maximum number of followers to consider for each user 122 | limit: Maximum number of users to find friends and followers of 123 | live_save: Whether to save computed data frequently 124 | users_path: Path to the file where the users info will be stored 125 | 126 | adj_list_path: 127 | """ 128 | 129 | # Each node has three possible states - 130 | # unvisited, visited but not explored, explored 131 | 132 | # each key-value pair is of the form 133 | # id: {'name': '', 'screen_name': ''} 134 | # serves two purposes - 135 | # ids in this are those that are visited 136 | # stores user info corresponding to each id 137 | self._visited = {} 138 | 139 | # each key-value pair is of the form 140 | # id: {'friends': [], 'followers': []} 141 | # set of ids in graph equal to set of ids in visited 142 | self._graph = {} 143 | 144 | # ids that have been visited (and hence their info is in visited dict) 145 | # but not yet explored 146 | boundary = queue.Queue() 147 | 148 | # Initialise 149 | seed_user = self._api.get_user(seed_user) 150 | self._visited[seed_user.id] = { 151 | 'name': seed_user.name, 152 | 'screen_name': seed_user.screen_name 153 | } 154 | self._graph[seed_user.id] = { 155 | 'friends': [], 156 | 'followers': [] 157 | } 158 | boundary.put(seed_user.id) 159 | 160 | # Explore users as long as the total number of visited users is less than 161 | # limit 162 | should_break = False 163 | live_save_suffix = 0 164 | while True: 165 | self._logger.log('') 166 | self._print_api_rem() 167 | user_id = boundary.get() 168 | self._logger.log('Selected:', self._visited[user_id]['screen_name'], 169 | ',', self._visited[user_id]['name'], ',', user_id) 170 | 171 | # Find friends 172 | self._logger.log('Finding friends..') 173 | cnt = 0 174 | for friend in self._handle_limit( 175 | tweepy.Cursor(self._api.friends, user_id=user_id).items(friends_limit), 'friends'): 176 | 177 | cnt += 1 178 | self._graph[user_id]['friends'].append(friend.id) 179 | if friend.id not in self._visited: 180 | self._visited[friend.id] = { 181 | 'name': friend.name, 182 | 'screen_name': friend.screen_name 183 | } 184 | self._graph[friend.id] = { 185 | 'friends': [], 186 | 'followers': [] 187 | } 188 | boundary.put(friend.id) 189 | if len(self._visited) >= limit: 190 | should_break = True 191 | break 192 | self._logger.log('Found', cnt, 'friends') 193 | 194 | if should_break: 195 | break 196 | 197 | # Find followers 198 | self._logger.log('Finding followers..') 199 | cnt = 0 200 | for follower in self._handle_limit( 201 | tweepy.Cursor(self._api.followers, user_id=user_id).items(followers_limit), 'followers'): 202 | 203 | cnt += 1 204 | self._graph[user_id]['followers'].append(follower.id) 205 | if follower.id not in self._visited: 206 | self._visited[follower.id] = { 207 | 'name': follower.name, 208 | 'screen_name': follower.screen_name 209 | } 210 | self._graph[follower.id] = { 211 | 'friends': [], 212 | 'followers': [] 213 | } 214 | boundary.put(follower.id) 215 | if len(self._visited) >= limit: 216 | should_break = True 217 | break 218 | self._logger.log('Found', cnt, 'followers') 219 | 220 | self._logger.log('Latest save suffix: ', live_save_suffix % 2) 221 | if live_save: 222 | self.save_dataset(users_path + str(live_save_suffix % 2), adj_list_path + str(live_save_suffix % 2)) 223 | live_save_suffix += 1 224 | 225 | if should_break: 226 | break 227 | 228 | # Number of visited users is now equal to limit. Now find friends and 229 | # followers of visited but unexplored users. Among these, consider only 230 | # those that have already been visited, thus not increasing the number 231 | # of users visited 232 | self._logger.log('') 233 | self._logger.log('Boundary..') 234 | while not boundary.empty(): 235 | self._logger.log('') 236 | self._print_api_rem() 237 | user_id = boundary.get() 238 | self._logger.log('Selected:', self._visited[user_id]['screen_name'], 239 | ',', self._visited[user_id]['name'], ',', user_id) 240 | 241 | # Find friends 242 | self._logger.log('Finding friends..') 243 | cnt = 0 244 | cnt2 = 0 245 | for friend in self._handle_limit( 246 | tweepy.Cursor(self._api.friends, user_id=user_id).items(friends_limit), 'friends'): 247 | 248 | cnt += 1 249 | if friend.id in self._visited: 250 | cnt2 += 1 251 | self._graph[user_id]['friends'].append(friend.id) 252 | self._logger.log('Found', cnt, 'friends') 253 | self._logger.log('Used', cnt2, 'friends') 254 | 255 | # Find followers 256 | self._logger.log('Finding followers..') 257 | cnt = 0 258 | cnt2 = 0 259 | for follower in self._handle_limit( 260 | tweepy.Cursor(self._api.followers, user_id=user_id).items(followers_limit), 'followers'): 261 | 262 | cnt += 1 263 | if follower.id in self._visited: 264 | cnt2 += 1 265 | self._graph[user_id]['followers'].append(follower.id) 266 | self._logger.log('Found', cnt, 'followers') 267 | self._logger.log('Used', cnt2, 'followers') 268 | 269 | self._logger.log('Latest save suffix: ', live_save_suffix % 2) 270 | if live_save: 271 | self.save_dataset(users_path + str(live_save_suffix % 2), 272 | adj_list_path + str(live_save_suffix % 2)) 273 | live_save_suffix += 1 274 | 275 | self._logger.log('Queue size:', boundary.qsize()) 276 | 277 | def save_dataset(self, users_path, adj_list_path): 278 | """Save the dataset obtained by get_dataset 279 | 280 | Args: 281 | users_path: Path to the file where users info will be stored 282 | adj_list_path: Path to the file where the adjacency list will be stored 283 | """ 284 | if users_path != '': 285 | with open(users_path, mode='wb') as f: 286 | try: 287 | pickle.dump(self._visited, f) 288 | except Exception as e: 289 | self._logger.log('adjException:', repr(e)) 290 | 291 | if adj_list_path != '': 292 | with open(adj_list_path, mode='wb') as f: 293 | try: 294 | pickle.dump(self._graph, f) 295 | except Exception as e: 296 | self._logger.log('dump Exception:', repr(e)) 297 | 298 | class ListToMatrixConverter(): 299 | """An instance of ListToMatrixConverter is used to convert the data obtained 300 | by the dataset fetcher from adjacency list form to a matrix form (and an 301 | index-to-userid map) 302 | """ 303 | 304 | def __init__(self, adj_list_path): 305 | """Initializes an instance of ListToMatrixConverter 306 | 307 | Args: 308 | adj_list_path: Path to the file where the adjacency list is stored 309 | """ 310 | with open(adj_list_path, 'rb') as f: 311 | self._adj_list = pickle.load(f) 312 | self._link_matrix = None 313 | self._index_id_map = None 314 | 315 | def convert(self): 316 | """Use the adjacency list to create the link matrix and a dictionary that 317 | maps the index in the link matrix to a user id 318 | """ 319 | 320 | # Put contents of self._adj_list in a matrix 321 | size = len(self._adj_list) 322 | self._link_matrix = np.zeros((size, size), dtype=np.int) 323 | 324 | # Create map to save some time 325 | id_index_map = {} 326 | index = 0 327 | for user_id in self._adj_list: 328 | id_index_map[user_id] = index 329 | index += 1 330 | 331 | for user_id in self._adj_list: 332 | for friend_id in self._adj_list[user_id]['friends']: 333 | self._link_matrix[id_index_map[user_id], id_index_map[friend_id]] = 1 334 | for follower_id in self._adj_list[user_id]['followers']: 335 | self._link_matrix[id_index_map[follower_id], id_index_map[user_id]] = 1 336 | 337 | self._index_id_map = {} 338 | for i in id_index_map: 339 | self._index_id_map[id_index_map[i]] = i 340 | 341 | def save(self, map_path, link_matrix_path, use_sparse=False): 342 | """Saves the map and link matrix created using the convert function 343 | 344 | Args: 345 | map_path: Path to the file where the map from link matrix index to 346 | user id is to be stored 347 | link_matrix_path: Path to the file where the link matrix is to be stored 348 | use_sparse: True if the link matrix is to be stored as a sparse matrix 349 | """ 350 | if map_path != '': 351 | with open(map_path, 'wb') as f: 352 | try: 353 | pickle.dump(self._index_id_map, f) 354 | except Exception as e: 355 | self._logger.log('Exception:', repr(e)) 356 | 357 | if link_matrix_path != '': 358 | with open(link_matrix_path, mode='wb') as f: 359 | if use_sparse: 360 | try: 361 | sparse.save_npz(f, sparse.csr_matrix(self._link_matrix)) 362 | except Exception as e: 363 | self._logger.log('Exception:', repr(e)) 364 | else: 365 | try: 366 | np.save(f, self._link_matrix) 367 | except Exception as e: 368 | self._logger.log('Exception:', repr(e)) 369 | 370 | 371 | def main(): 372 | 373 | key = 'j5idDIRvUfwI1213Nr14Drh33' 374 | secret = 'jOw1Dgt8dJlu4rPh3GeoGofnIV5VKLkZ8fOQqYk1zUsaSMJnVl' 375 | seed_user = 'Genius1238' 376 | 377 | log_path = 'logs.txt' 378 | 379 | users_path = '../data/users' 380 | adj_list_path = '../data/adj_list' 381 | map_path = '../data/map' 382 | dense_link_matrix_path = '../data/dense_link_matrix' 383 | sparse_link_matrix_path = '../data/sparse_link_matrix' 384 | 385 | users_temp_path = '../data/temp/users_' 386 | adj_list_temp_path = '../data/temp/adj_list_' 387 | 388 | friends_limit = 200 389 | followers_limit = 200 390 | limit = 500 391 | 392 | logger = Logger(log_path) 393 | 394 | # Fetch the dataset, store info of all users and store the adjacency list 395 | app = DatasetFetcher(key, secret, logger) 396 | logger.log('Obtaining dataset..') 397 | app.get_dataset( 398 | seed_user, friends_limit, followers_limit, limit, True, users_temp_path, 399 | adj_list_temp_path) 400 | logger.log('Dataset obtained') 401 | app.save_dataset(users_path, adj_list_path) 402 | 403 | # Create the link matrix and map using the adjacency list created 404 | # previously and save them 405 | c = ListToMatrixConverter(adj_list_path) 406 | c.convert() 407 | c.save(map_path, dense_link_matrix_path, use_sparse=False) 408 | 409 | c = ListToMatrixConverter(adj_list_path) 410 | c.convert() 411 | c.save(map_path, sparse_link_matrix_path, use_sparse=True) 412 | logger.log('Dataset Saved') 413 | 414 | if __name__ == '__main__': 415 | main() 416 | -------------------------------------------------------------------------------- /src/hits.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.sparse as sparse 3 | import time 4 | import pickle 5 | from igraph import * 6 | from dataset_fetcher import ListToMatrixConverter 7 | import matplotlib.pyplot as plt 8 | import matplotlib.patches as mp 9 | import time 10 | 11 | debug = False 12 | 13 | class HITS(): 14 | """An instance of HITS is used to model the idea of hubs and authorities 15 | and execute the corresponding algorithm 16 | """ 17 | 18 | def __init__(self, link_matrix, users, index_id_map, is_sparse=False): 19 | """ 20 | Initializes an instance of HITS 21 | 22 | Args: 23 | link_matrix: The link matrix 24 | users: Details of all users 25 | index_id_map: Dictionary representing a map from link matrix index 26 | to user id 27 | is_sparse: True if the links matrix is a sparse matrix 28 | """ 29 | self.__is_sparse = is_sparse 30 | self.__link_matrix = link_matrix 31 | self.__link_matrix_tr = link_matrix.transpose() 32 | self.__n = self.__link_matrix.shape[0] 33 | self.__hubs = np.ones(self.__n) 34 | self.__auths = np.ones(self.__n) 35 | self.__size = 30 36 | self.__names = [users[index_id_map[i]]['screen_name'] for i in range(0,self.__size)] 37 | self.__index_id_map = index_id_map 38 | self.__users = users 39 | self.all_hubs = [] 40 | self.all_auths = [] 41 | 42 | def calc_scores(self, epsilon=1e-4): 43 | """Calculates hubbiness and authority 44 | """ 45 | epsilon_matrix = epsilon * np.ones(self.__n) 46 | if self.__is_sparse: 47 | while True: 48 | hubs_old = self.__hubs 49 | auths_old = self.__auths 50 | 51 | self.__auths = self.__link_matrix_tr * hubs_old 52 | max_score = self.__auths.max(axis=0) 53 | if max_score != 0: 54 | self.__auths = self.__auths / max_score 55 | self.all_auths.append(self.__auths) 56 | 57 | self.__hubs = self.__link_matrix * self.__auths 58 | max_score = self.__hubs.max(axis=0) 59 | if max_score != 0: 60 | self.__hubs = self.__hubs / max_score 61 | self.all_hubs.append(self.__hubs) 62 | 63 | if (((abs(self.__hubs - hubs_old)) < epsilon_matrix).all()) and (((abs(self.__auths - auths_old)) < epsilon_matrix).all()): 64 | break 65 | 66 | else: 67 | while True: 68 | hubs_old = self.__hubs 69 | auths_old = self.__auths 70 | 71 | self.__auths = np.dot(self.__link_matrix_tr, hubs_old) 72 | max_score = self.__auths.max(axis=0) 73 | if max_score != 0: 74 | self.__auths = self.__auths / max_score 75 | self.all_auths.append(self.__auths) 76 | 77 | self.__hubs = np.dot(self.__link_matrix, self.__auths) 78 | max_score = self.__hubs.max(axis=0) 79 | if max_score != 0: 80 | self.__hubs = self.__hubs / max_score 81 | self.all_hubs.append(self.__hubs) 82 | 83 | if (((abs(self.__hubs - hubs_old)) < epsilon_matrix).all()) and (((abs(self.__auths - auths_old)) < epsilon_matrix).all()): 84 | break 85 | 86 | def get_all_hubs(self): 87 | """Returns the hubbiness score for each user for each iteration 88 | """ 89 | return self.all_hubs 90 | 91 | def get_all_auths(self): 92 | """Returns the authority score for each user for each iteration 93 | """ 94 | return self.all_auths 95 | 96 | def get_hubs(self): 97 | """Returns the hubbiness for each node (user) 98 | """ 99 | return self.__hubs 100 | 101 | def get_auths(self): 102 | """Returns the authority for each node (user) 103 | """ 104 | return self.__auths 105 | 106 | def get_names(self): 107 | """Returns the screen name of each user 108 | """ 109 | return self.__names 110 | 111 | def plot_graph(self, x, names, c): 112 | """Plots the graph 113 | """ 114 | if self.__is_sparse: 115 | g = Graph.Adjacency((self.__link_matrix[0:self.__size, 0:self.__size]).toarray().tolist()) 116 | else: 117 | g = Graph.Adjacency((self.__link_matrix[0:self.__size, 0:self.__size]).tolist()) 118 | g.vs["name"] = names 119 | g.vs["attr"] = ["%.3f" % k for k in x] 120 | 121 | array_min = 0 122 | if x.min(axis=0) < 0.001: 123 | array_min = 0.001 124 | else: 125 | array_min = x.min(axis=0) 126 | 127 | ###layout### 128 | layout = g.layout("kk") 129 | visual_style = {} 130 | visual_style["vertex_size"] = [(x[i]/array_min)*0.3 if x[i]>=0.001 else 10 for i in range(0,min(self.__size,len(x)))] 131 | visual_style["vertex_label"] = [(g.vs["name"][i],float(g.vs["attr"][i])) for i in range(0,min(self.__size,len(x)))] 132 | color_dict = {"0":"red" , "1":"yellow"} 133 | g.vs["color"] = color_dict[str(c)] 134 | visual_style["edge_arrow_size"]=2 135 | visual_style["vertex_label_size"]=35 136 | visual_style["layout"] = layout 137 | visual_style["bbox"] = (3200, 2200) 138 | visual_style["margin"] = 250 139 | visual_style["edge_width"] = 4 140 | plot(g, **visual_style) 141 | 142 | def plot_stats(self): 143 | screen_name_index_map = {} 144 | for key in self.__index_id_map: 145 | screen_name_index_map[self.__users[self.__index_id_map[key]]['screen_name']] = key 146 | 147 | cands = ['austinnotduncan', 'str_mape', 'LeoDiCaprio', 'aidanf123', 'MKBHD'] 148 | colors = ['green', 'cyan', 'magenta', 'blue', 'brown'] 149 | all_hubs = np.array(self.all_hubs) 150 | all_auths = np.array(self.all_auths) 151 | 152 | plt.figure(1, figsize=(12, 7)) 153 | ax = plt.gca() 154 | ax.set_xlabel("Iterations") 155 | ax.set_ylabel("Hubbiness Score") 156 | legend_handles = [] 157 | for i in range(len(cands)): 158 | legend_handles.append(mp.Patch(label=cands[i], color=colors[i])) 159 | ax.plot(np.arange(1, all_hubs.shape[0] + 1), all_hubs[:, screen_name_index_map[cands[i]]], color=colors[i]) 160 | ax.legend(handles=legend_handles) 161 | ax.set_title("Change in hubbiness score with increasing iterations") 162 | plt.show() 163 | 164 | plt.figure(2, figsize=(12, 7)) 165 | ax = plt.gca() 166 | ax.set_xlabel("Iterations") 167 | ax.set_ylabel("Authority Score") 168 | legend_handles = [] 169 | for i in range(len(cands)): 170 | legend_handles.append(mp.Patch(label=cands[i], color=colors[i])) 171 | ax.plot(np.arange(1, all_auths.shape[0] + 1), all_auths[:, screen_name_index_map[cands[i]]], color=colors[i]) 172 | ax.legend(handles=legend_handles) 173 | ax.set_title("Change in authority score with increasing iterations") 174 | plt.show() 175 | 176 | class DatasetReader(): 177 | """An instance of DatasetReader is used to read different files from the 178 | dataset 179 | """ 180 | 181 | def __init__(self): 182 | """Initializes an instance of DatasetReader 183 | """ 184 | pass 185 | 186 | def read_users(self, users_path): 187 | """Returns the dictionary (stored in a file) containing details of 188 | all users 189 | 190 | Args: 191 | users_path: Path to the file where info of all users is stored 192 | """ 193 | with open(users_path, mode='rb') as f: 194 | users = pickle.load(f) 195 | return users 196 | 197 | def read_map(self, map_path): 198 | """Returns the dictionary (stored in a file) that represents a map 199 | from the link matrix index to user id 200 | 201 | Args: 202 | map_path: Path to the file where the map is stored 203 | """ 204 | with open(map_path, mode='rb') as f: 205 | index_id_map = pickle.load(f) 206 | return index_id_map 207 | 208 | def read_link_matrix(self, link_matrix_path, is_sparse=False): 209 | """Returns the array (stored in a file) that represents the link matrix 210 | 211 | Args: 212 | link_matrix_path: Path to the file where the link matrix is stored 213 | is_sparse: True if the link matrix is stored as a sparse matrix 214 | """ 215 | with open(link_matrix_path, mode='rb') as f: 216 | if is_sparse: 217 | link_matrix = sparse.load_npz(link_matrix_path) 218 | else: 219 | link_matrix = np.load(f) 220 | return link_matrix 221 | 222 | 223 | def main(): 224 | sparse = True 225 | epsilon = 1e-10 226 | show_iters = False 227 | 228 | users_path = '../data/users' 229 | map_path = '../data/map' 230 | sparse_link_matrix_path = '../data/sparse_link_matrix' 231 | dense_link_matrix_path = '../data/dense_link_matrix' 232 | if sparse: 233 | link_matrix_path = sparse_link_matrix_path 234 | else: 235 | link_matrix_path = dense_link_matrix_path 236 | 237 | # Load the stored data into objects 238 | r = DatasetReader() 239 | users = r.read_users(users_path) 240 | index_id_map = r.read_map(map_path) 241 | link_matrix = r.read_link_matrix(link_matrix_path, is_sparse=sparse) 242 | 243 | # Run the algorithm 244 | h = HITS(link_matrix, users, index_id_map, is_sparse=sparse) 245 | h.calc_scores(epsilon=epsilon) 246 | 247 | if show_iters: 248 | x = h.get_all_hubs() 249 | for i in x: 250 | h.plot_graph(i, h.get_names(),0) 251 | 252 | y = h.get_all_auths() 253 | for i in y: 254 | h.plot_graph(i, h.get_names(),1) 255 | else: 256 | h.plot_graph(h.get_hubs(), h.get_names(),0) 257 | h.plot_graph(h.get_auths(), h.get_names(),1) 258 | 259 | # Print graphs 260 | h.plot_stats() 261 | 262 | if __name__ == '__main__': 263 | main() --------------------------------------------------------------------------------