├── src ├── README.md ├── expomf.py ├── rec_eval.py ├── expomf_cov.py ├── Location_ExpoMF_Gowalla.ipynb ├── processGowalla.ipynb └── processTasteProfile.ipynb ├── .gitignore └── README.md /src/README.md: -------------------------------------------------------------------------------- 1 | # Experimental Results 2 | By getting the data and running the following notebooks, you should be able to reproduce the exact results in the paper. 3 | 4 | ## Taste Profile Subset 5 | - [processTasteProfile.ipynb](./processTasteProfile.ipynb): pre-process the data and create the train/test/validation splits. 6 | - [ExpoMF_tasteProfileSub.ipynb](./ExpoMF_tasteProfileSub.ipynb): train the model and evaluate on the heldout test set. 7 | 8 | ## Gowalla (with location exposure covariates) 9 | - [processGowalla.ipynb](./processGowalla.ipynb): pre-process the data and create the train/test/validation splits. 10 | - [Location_ExpoMF_Gowalla.ipynb](./Location_ExpoMF_Gowalla.ipynb): train the model with location exposure covariates and evaluate on the heldout test set. 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # VIM and Mac stuff 6 | *.swp 7 | .DS_Store 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | 57 | # Sphinx documentation 58 | docs/_build/ 59 | 60 | # PyBuilder 61 | target/ 62 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ExpoMF 2 | This repository contains the source code to reproduce all the experimental results as described in the paper ["Modeling User Exposure in Recommendation"](http://arxiv.org/abs/1510.07025) (WWW'16). 3 | 4 | ## Dependencies 5 | The python module dependencies are: 6 | - numpy/scipy 7 | - scikit.learn 8 | - joblib 9 | - bottleneck 10 | - pandas (needed to run the example for data preprocessing) 11 | 12 | **Note**: The code is mostly written for Python 2.7. For Python 3.x, it is still usable with minor modification. If you run into any problem with Python 3.x, feel free to contact me and I will try to get back to you with a helpful solution. 13 | 14 | ## Datasets 15 | - [Taste Profile Subset](http://labrosa.ee.columbia.edu/millionsong/tasteprofile) 16 | - [Gowalla](https://snap.stanford.edu/data/loc-gowalla.html): the pre-processed data that we used in the paper can be downloaded [here](http://dawenl.github.io/data/gowalla_pro.zip). 17 | 18 | We also used the arXiv and Mendeley dataset in the paper. However, these datasets are not publicly available. With Taste Profile Subset and Gowalla, we can still cover all the different variations of the model presented in the paper. 19 | 20 | We used the weighted matrix factorization (WMF) implementation in [content_wmf](https://github.com/dawenl/content_wmf) repository. 21 | 22 | ## Examples 23 | See example notebooks in `src/`. 24 | -------------------------------------------------------------------------------- /src/expomf.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Exposure Matrix Factorization for collaborative filtering 4 | 5 | CREATED: 2015-05-28 01:16:56 by Dawen Liang 6 | 7 | """ 8 | 9 | 10 | import os 11 | import sys 12 | import time 13 | import numpy as np 14 | from numpy import linalg as LA 15 | 16 | from joblib import Parallel, delayed 17 | from math import sqrt 18 | from sklearn.base import BaseEstimator, TransformerMixin 19 | 20 | import rec_eval 21 | 22 | floatX = np.float32 23 | EPS = 1e-8 24 | 25 | 26 | class ExpoMF(BaseEstimator, TransformerMixin): 27 | def __init__(self, n_components=100, max_iter=10, batch_size=1000, 28 | init_std=0.01, n_jobs=8, random_state=None, save_params=False, 29 | save_dir='.', early_stopping=False, verbose=False, **kwargs): 30 | ''' 31 | Exposure matrix factorization 32 | 33 | Parameters 34 | --------- 35 | n_components : int 36 | Number of latent factors 37 | max_iter : int 38 | Maximal number of iterations to perform 39 | batch_size: int 40 | Batch size to perform parallel update 41 | init_std: float 42 | The latent factors will be initialized as Normal(0, init_std**2) 43 | n_jobs: int 44 | Number of parallel jobs to update latent factors 45 | random_state : int or RandomState 46 | Pseudo random number generator used for sampling 47 | save_params: bool 48 | Whether to save parameters after each iteration 49 | save_dir: str 50 | The directory to save the parameters 51 | early_stopping: bool 52 | Whether to early stop the training by monitoring performance on 53 | validation set 54 | verbose : bool 55 | Whether to show progress during model fitting 56 | **kwargs: dict 57 | Model hyperparameters 58 | ''' 59 | self.n_components = n_components 60 | self.max_iter = max_iter 61 | self.batch_size = batch_size 62 | self.init_std = init_std 63 | self.n_jobs = n_jobs 64 | self.random_state = random_state 65 | self.save_params = save_params 66 | self.save_dir = save_dir 67 | self.early_stopping = early_stopping 68 | self.verbose = verbose 69 | 70 | if type(self.random_state) is int: 71 | np.random.seed(self.random_state) 72 | elif self.random_state is not None: 73 | np.random.setstate(self.random_state) 74 | 75 | self._parse_kwargs(**kwargs) 76 | 77 | def _parse_kwargs(self, **kwargs): 78 | ''' Model hyperparameters 79 | 80 | Parameters 81 | --------- 82 | lambda_theta, lambda_beta: float 83 | Regularization parameter for user (lambda_theta) and item factors ( 84 | lambda_beta). Default value 1e-5. Since in implicit feedback all 85 | the n_users-by-n_items data points are used for training, 86 | overfitting is almost never an issue 87 | lambda_y: float 88 | inverse variance on the observational model. Default value 1.0 89 | init_mu: float 90 | All the \mu_{i} will be initialized as init_mu. Default value is 91 | 0.01. This number should change according to the sparsity of the 92 | data (sparser data with smaller init_mu). In the experiment, we 93 | select the value from validation set 94 | a, b: float 95 | Prior on \mu_{i} ~ Beta(a, b) 96 | ''' 97 | self.lam_theta = float(kwargs.get('lambda_theta', 1e-5)) 98 | self.lam_beta = float(kwargs.get('lambda_beta', 1e-5)) 99 | self.lam_y = float(kwargs.get('lam_y', 1.0)) 100 | self.init_mu = float(kwargs.get('init_mu', 0.01)) 101 | self.a = float(kwargs.get('a', 1.0)) 102 | self.b = float(kwargs.get('b', 1.0)) 103 | 104 | def _init_params(self, n_users, n_items): 105 | ''' Initialize all the latent factors ''' 106 | self.theta = self.init_std * \ 107 | np.random.randn(n_users, self.n_components).astype(floatX) 108 | self.beta = self.init_std * \ 109 | np.random.randn(n_items, self.n_components).astype(floatX) 110 | self.mu = self.init_mu * np.ones(n_items, dtype=floatX) 111 | 112 | def fit(self, X, vad_data=None, **kwargs): 113 | '''Fit the model to the data in X. 114 | 115 | Parameters 116 | ---------- 117 | X : scipy.sparse.csr_matrix, shape (n_users, n_items) 118 | Training data. 119 | 120 | vad_data: scipy.sparse.csr_matrix, shape (n_users, n_items) 121 | Validation data. 122 | 123 | **kwargs: dict 124 | Additional keywords to evaluation function call on validation data 125 | 126 | Returns 127 | ------- 128 | self: object 129 | Returns the instance itself. 130 | ''' 131 | n_users, n_items = X.shape 132 | self._init_params(n_users, n_items) 133 | self._update(X, vad_data, **kwargs) 134 | return self 135 | 136 | def transform(self, X): 137 | pass 138 | 139 | def _update(self, X, vad_data, **kwargs): 140 | '''Model training and evaluation on validation set''' 141 | n_users = X.shape[0] 142 | XT = X.T.tocsr() # pre-compute this 143 | self.vad_ndcg = -np.inf 144 | for i in xrange(self.max_iter): 145 | if self.verbose: 146 | print('ITERATION #%d' % i) 147 | self._update_factors(X, XT) 148 | self._update_expo(X, n_users) 149 | if vad_data is not None: 150 | vad_ndcg = self._validate(X, vad_data, **kwargs) 151 | if self.early_stopping and self.vad_ndcg > vad_ndcg: 152 | break # we will not save the parameter for this iteration 153 | self.vad_ndcg = vad_ndcg 154 | 155 | if self.save_params: 156 | self._save_params(i) 157 | pass 158 | 159 | def _update_factors(self, X, XT): 160 | '''Update user and item collaborative factors with ALS''' 161 | if self.verbose: 162 | start_t = _writeline_and_time('\tUpdating user factors...') 163 | self.theta = recompute_factors(self.beta, self.theta, X, 164 | self.lam_theta / self.lam_y, 165 | self.lam_y, 166 | self.mu, 167 | self.n_jobs, 168 | batch_size=self.batch_size) 169 | if self.verbose: 170 | print('\r\tUpdating user factors: time=%.2f' 171 | % (time.time() - start_t)) 172 | start_t = _writeline_and_time('\tUpdating item factors...') 173 | self.beta = recompute_factors(self.theta, self.beta, XT, 174 | self.lam_beta / self.lam_y, 175 | self.lam_y, 176 | self.mu, 177 | self.n_jobs, 178 | batch_size=self.batch_size) 179 | if self.verbose: 180 | print('\r\tUpdating item factors: time=%.2f' 181 | % (time.time() - start_t)) 182 | sys.stdout.flush() 183 | pass 184 | 185 | def _update_expo(self, X, n_users): 186 | '''Update exposure prior''' 187 | if self.verbose: 188 | start_t = _writeline_and_time('\tUpdating exposure prior...') 189 | 190 | start_idx = range(0, n_users, self.batch_size) 191 | end_idx = start_idx[1:] + [n_users] 192 | 193 | A_sum = np.zeros_like(self.mu) 194 | for lo, hi in zip(start_idx, end_idx): 195 | A_sum += a_row_batch(X[lo:hi], self.theta[lo:hi], self.beta, 196 | self.lam_y, self.mu).sum(axis=0) 197 | self.mu = (self.a + A_sum - 1) / (self.a + self.b + n_users - 2) 198 | if self.verbose: 199 | print('\r\tUpdating exposure prior: time=%.2f' 200 | % (time.time() - start_t)) 201 | sys.stdout.flush() 202 | pass 203 | 204 | def _validate(self, X, vad_data, **kwargs): 205 | '''Compute validation metric (NDCG@k)''' 206 | vad_ndcg = rec_eval.normalized_dcg_at_k(X, vad_data, 207 | self.theta, 208 | self.beta, 209 | **kwargs) 210 | if self.verbose: 211 | print('\tValidation NDCG@k: %.4f' % vad_ndcg) 212 | sys.stdout.flush() 213 | return vad_ndcg 214 | 215 | def _save_params(self, iter): 216 | '''Save the parameters''' 217 | if not os.path.exists(self.save_dir): 218 | os.makedirs(self.save_dir) 219 | filename = 'ExpoMF_K%d_mu%.1e_iter%d.npz' % (self.n_components, 220 | self.init_mu, iter) 221 | np.savez(os.path.join(self.save_dir, filename), U=self.theta, 222 | V=self.beta, mu=self.mu) 223 | 224 | 225 | # Utility functions # 226 | 227 | def _writeline_and_time(s): 228 | sys.stdout.write(s) 229 | sys.stdout.flush() 230 | return time.time() 231 | 232 | 233 | def get_row(Y, i): 234 | '''Given a scipy.sparse.csr_matrix Y, get the values and indices of the 235 | non-zero values in i_th row''' 236 | lo, hi = Y.indptr[i], Y.indptr[i + 1] 237 | return Y.data[lo:hi], Y.indices[lo:hi] 238 | 239 | 240 | def a_row_batch(Y_batch, theta_batch, beta, lam_y, mu): 241 | '''Compute the posterior of exposure latent variables A by batch''' 242 | pEX = sqrt(lam_y / 2 * np.pi) * \ 243 | np.exp(-lam_y * theta_batch.dot(beta.T)**2 / 2) 244 | A = (pEX + EPS) / (pEX + EPS + (1 - mu) / mu) 245 | A[Y_batch.nonzero()] = 1. 246 | return A 247 | 248 | 249 | def _solve(k, A_k, X, Y, f, lam, lam_y, mu): 250 | '''Update one single factor''' 251 | s_u, i_u = get_row(Y, k) 252 | a = np.dot(s_u * A_k[i_u], X[i_u]) 253 | B = X.T.dot(A_k[:, np.newaxis] * X) + lam * np.eye(f) 254 | return LA.solve(B, a) 255 | 256 | 257 | def _solve_batch(lo, hi, X, X_old_batch, Y, m, f, lam, lam_y, mu): 258 | '''Update factors by batch, will eventually call _solve() on each factor to 259 | keep the parallel process busy''' 260 | assert X_old_batch.shape[0] == hi - lo 261 | 262 | if mu.size == X.shape[0]: # update users 263 | A_batch = a_row_batch(Y[lo:hi], X_old_batch, X, lam_y, mu) 264 | else: # update items 265 | A_batch = a_row_batch(Y[lo:hi], X_old_batch, X, lam_y, mu[lo:hi, 266 | np.newaxis]) 267 | 268 | X_batch = np.empty_like(X_old_batch, dtype=X_old_batch.dtype) 269 | for ib, k in enumerate(xrange(lo, hi)): 270 | X_batch[ib] = _solve(k, A_batch[ib], X, Y, f, lam, lam_y, mu) 271 | return X_batch 272 | 273 | 274 | def recompute_factors(X, X_old, Y, lam, lam_y, mu, n_jobs, batch_size=1000): 275 | '''Regress X to Y with exposure matrix (computed on-the-fly with X_old) and 276 | ridge term lam by embarrassingly parallelization. All the comments below 277 | are in the view of computing user factors''' 278 | m, n = Y.shape # m = number of users, n = number of items 279 | assert X.shape[0] == n 280 | assert X_old.shape[0] == m 281 | f = X.shape[1] # f = number of factors 282 | 283 | start_idx = range(0, m, batch_size) 284 | end_idx = start_idx[1:] + [m] 285 | res = Parallel(n_jobs=n_jobs)(delayed(_solve_batch)( 286 | lo, hi, X, X_old[lo:hi], Y, m, f, lam, lam_y, mu) 287 | for lo, hi in zip(start_idx, end_idx)) 288 | X_new = np.vstack(res) 289 | return X_new 290 | -------------------------------------------------------------------------------- /src/rec_eval.py: -------------------------------------------------------------------------------- 1 | import bottleneck as bn 2 | import numpy as np 3 | 4 | from scipy import sparse 5 | 6 | 7 | """ 8 | All the data should be in the shape of (n_users, n_items) 9 | All the latent factors should in the shape of (n_users/n_items, n_components) 10 | 11 | 1. train_data refers to the data that was used to train the model 12 | 2. heldout_data refers to the data that was used for evaluation (could be test 13 | set or validation set) 14 | 3. vad_data refers to the data that should be excluded as validation set, which 15 | should only be used when calculating test scores 16 | 17 | """ 18 | 19 | 20 | def prec_at_k(train_data, heldout_data, U, V, batch_users=5000, k=20, 21 | mu=None, vad_data=None, agg=np.nanmean): 22 | n_users = train_data.shape[0] 23 | res = list() 24 | for user_idx in user_idx_generator(n_users, batch_users): 25 | res.append(precision_at_k_batch(train_data, heldout_data, 26 | U, V.T, user_idx, k=k, 27 | mu=mu, vad_data=vad_data)) 28 | mn_prec = np.hstack(res) 29 | if callable(agg): 30 | return agg(mn_prec) 31 | return mn_prec 32 | 33 | 34 | def recall_at_k(train_data, heldout_data, U, V, batch_users=5000, k=20, 35 | mu=None, vad_data=None, agg=np.nanmean): 36 | n_users = train_data.shape[0] 37 | res = list() 38 | for user_idx in user_idx_generator(n_users, batch_users): 39 | res.append(recall_at_k_batch(train_data, heldout_data, 40 | U, V.T, user_idx, k=k, 41 | mu=mu, vad_data=vad_data)) 42 | mn_recall = np.hstack(res) 43 | if callable(agg): 44 | return agg(mn_recall) 45 | return mn_recall 46 | 47 | 48 | def ric_rank_at_k(train_data, heldout_data, U, V, batch_users=5000, k=5, 49 | mu=None, vad_data=None): 50 | n_users = train_data.shape[0] 51 | res = list() 52 | for user_idx in user_idx_generator(n_users, batch_users): 53 | res.append(mean_rrank_at_k_batch(train_data, heldout_data, 54 | U, V.T, user_idx, k=k, 55 | mu=mu, vad_data=vad_data)) 56 | mrrank = np.hstack(res) 57 | return mrrank[mrrank > 0].mean() 58 | 59 | 60 | def mean_perc_rank(train_data, heldout_data, U, V, batch_users=5000, 61 | mu=None, vad_data=None): 62 | n_users = train_data.shape[0] 63 | mpr = 0 64 | for user_idx in user_idx_generator(n_users, batch_users): 65 | mpr += mean_perc_rank_batch(train_data, heldout_data, U, V.T, user_idx, 66 | mu=mu, vad_data=vad_data) 67 | mpr /= heldout_data.sum() 68 | return mpr 69 | 70 | 71 | def normalized_dcg(train_data, heldout_data, U, V, batch_users=5000, 72 | mu=None, vad_data=None, agg=np.nanmean): 73 | n_users = train_data.shape[0] 74 | res = list() 75 | for user_idx in user_idx_generator(n_users, batch_users): 76 | res.append(NDCG_binary_batch(train_data, heldout_data, U, V.T, 77 | user_idx, mu=mu, vad_data=vad_data)) 78 | ndcg = np.hstack(res) 79 | if callable(agg): 80 | return agg(ndcg) 81 | return ndcg 82 | 83 | 84 | def normalized_dcg_at_k(train_data, heldout_data, U, V, batch_users=5000, 85 | k=100, mu=None, vad_data=None, agg=np.nanmean): 86 | 87 | n_users = train_data.shape[0] 88 | res = list() 89 | for user_idx in user_idx_generator(n_users, batch_users): 90 | res.append(NDCG_binary_at_k_batch(train_data, heldout_data, U, V.T, 91 | user_idx, k=k, mu=mu, 92 | vad_data=vad_data)) 93 | ndcg = np.hstack(res) 94 | if callable(agg): 95 | return agg(ndcg) 96 | return ndcg 97 | 98 | 99 | def map_at_k(train_data, heldout_data, U, V, batch_users=5000, k=100, mu=None, 100 | vad_data=None, agg=np.nanmean): 101 | 102 | n_users = train_data.shape[0] 103 | res = list() 104 | for user_idx in user_idx_generator(n_users, batch_users): 105 | res.append(MAP_at_k_batch(train_data, heldout_data, U, V.T, user_idx, 106 | k=k, mu=mu, vad_data=vad_data)) 107 | map = np.hstack(res) 108 | if callable(agg): 109 | return agg(map) 110 | return map 111 | 112 | 113 | # helper functions # 114 | 115 | def user_idx_generator(n_users, batch_users): 116 | ''' helper function to generate the user index to loop through the dataset 117 | ''' 118 | for start in xrange(0, n_users, batch_users): 119 | end = min(n_users, start + batch_users) 120 | yield slice(start, end) 121 | 122 | 123 | def _make_prediction(train_data, Et, Eb, user_idx, batch_users, mu=None, 124 | vad_data=None): 125 | n_songs = train_data.shape[1] 126 | # exclude examples from training and validation (if any) 127 | item_idx = np.zeros((batch_users, n_songs), dtype=bool) 128 | item_idx[train_data[user_idx].nonzero()] = True 129 | if vad_data is not None: 130 | item_idx[vad_data[user_idx].nonzero()] = True 131 | X_pred = Et[user_idx].dot(Eb) 132 | if mu is not None: 133 | if isinstance(mu, np.ndarray): 134 | assert mu.size == n_songs # mu_i 135 | X_pred *= mu 136 | elif isinstance(mu, dict): # func(mu_ui) 137 | params, func = mu['params'], mu['func'] 138 | args = [params[0][user_idx], params[1]] 139 | if len(params) > 2: # for bias term in document or length-scale 140 | args += [params[2][user_idx]] 141 | if not callable(func): 142 | raise TypeError("expecting a callable function") 143 | X_pred *= func(*args) 144 | else: 145 | raise ValueError("unsupported mu type") 146 | X_pred[item_idx] = -np.inf 147 | return X_pred 148 | 149 | 150 | def precision_at_k_batch(train_data, heldout_data, Et, Eb, user_idx, 151 | k=20, normalize=True, mu=None, vad_data=None): 152 | batch_users = user_idx.stop - user_idx.start 153 | 154 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, 155 | batch_users, mu=mu, vad_data=vad_data) 156 | idx = bn.argpartsort(-X_pred, k, axis=1) 157 | X_pred_binary = np.zeros_like(X_pred, dtype=bool) 158 | X_pred_binary[np.arange(batch_users)[:, np.newaxis], idx[:, :k]] = True 159 | 160 | X_true_binary = (heldout_data[user_idx] > 0).toarray() 161 | tmp = (np.logical_and(X_true_binary, X_pred_binary).sum(axis=1)).astype( 162 | np.float32) 163 | 164 | if normalize: 165 | precision = tmp / np.minimum(k, X_true_binary.sum(axis=1)) 166 | else: 167 | precision = tmp / k 168 | return precision 169 | 170 | 171 | def recall_at_k_batch(train_data, heldout_data, Et, Eb, user_idx, 172 | k=20, normalize=True, mu=None, vad_data=None): 173 | batch_users = user_idx.stop - user_idx.start 174 | 175 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, 176 | batch_users, mu=mu, vad_data=vad_data) 177 | idx = bn.argpartsort(-X_pred, k, axis=1) 178 | X_pred_binary = np.zeros_like(X_pred, dtype=bool) 179 | X_pred_binary[np.arange(batch_users)[:, np.newaxis], idx[:, :k]] = True 180 | 181 | X_true_binary = (heldout_data[user_idx] > 0).toarray() 182 | tmp = (np.logical_and(X_true_binary, X_pred_binary).sum(axis=1)).astype( 183 | np.float32) 184 | recall = tmp / np.minimum(k, X_true_binary.sum(axis=1)) 185 | return recall 186 | 187 | 188 | def mean_rrank_at_k_batch(train_data, heldout_data, Et, Eb, 189 | user_idx, k=5, mu=None, vad_data=None): 190 | ''' 191 | mean reciprocal rank@k: For each user, make predictions and rank for 192 | all the items. Then calculate the mean reciprocal rank for the top K that 193 | are in the held-out set. 194 | ''' 195 | batch_users = user_idx.stop - user_idx.start 196 | 197 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, 198 | batch_users, mu=mu, vad_data=vad_data) 199 | all_rrank = 1. / (np.argsort(np.argsort(-X_pred, axis=1), axis=1) + 1) 200 | X_true_binary = (heldout_data[user_idx] > 0).toarray() 201 | 202 | heldout_rrank = X_true_binary * all_rrank 203 | top_k = bn.partsort(-heldout_rrank, k, axis=1) 204 | return -top_k[:, :k].mean(axis=1) 205 | 206 | 207 | def NDCG_binary_batch(train_data, heldout_data, Et, Eb, user_idx, 208 | mu=None, vad_data=None): 209 | ''' 210 | normalized discounted cumulative gain for binary relevance 211 | ''' 212 | batch_users = user_idx.stop - user_idx.start 213 | n_items = train_data.shape[1] 214 | 215 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, 216 | batch_users, mu=mu, vad_data=vad_data) 217 | all_rank = np.argsort(np.argsort(-X_pred, axis=1), axis=1) 218 | # build the discount template 219 | tp = 1. / np.log2(np.arange(2, n_items + 2)) 220 | all_disc = tp[all_rank] 221 | 222 | X_true_binary = (heldout_data[user_idx] > 0).tocoo() 223 | disc = sparse.csr_matrix((all_disc[X_true_binary.row, X_true_binary.col], 224 | (X_true_binary.row, X_true_binary.col)), 225 | shape=all_disc.shape) 226 | DCG = np.array(disc.sum(axis=1)).ravel() 227 | IDCG = np.array([tp[:n].sum() 228 | for n in heldout_data[user_idx].getnnz(axis=1)]) 229 | return DCG / IDCG 230 | 231 | 232 | def NDCG_binary_at_k_batch(train_data, heldout_data, Et, Eb, user_idx, 233 | mu=None, k=100, vad_data=None): 234 | ''' 235 | normalized discounted cumulative gain@k for binary relevance 236 | ASSUMPTIONS: all the 0's in heldout_data indicate 0 relevance 237 | ''' 238 | batch_users = user_idx.stop - user_idx.start 239 | 240 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, 241 | batch_users, mu=mu, vad_data=vad_data) 242 | idx_topk_part = bn.argpartsort(-X_pred, k, axis=1) 243 | topk_part = X_pred[np.arange(batch_users)[:, np.newaxis], 244 | idx_topk_part[:, :k]] 245 | idx_part = np.argsort(-topk_part, axis=1) 246 | # X_pred[np.arange(batch_users)[:, np.newaxis], idx_topk] is the sorted 247 | # topk predicted score 248 | idx_topk = idx_topk_part[np.arange(batch_users)[:, np.newaxis], idx_part] 249 | # build the discount template 250 | tp = 1. / np.log2(np.arange(2, k + 2)) 251 | 252 | heldout_batch = heldout_data[user_idx] 253 | DCG = (heldout_batch[np.arange(batch_users)[:, np.newaxis], 254 | idx_topk].toarray() * tp).sum(axis=1) 255 | IDCG = np.array([(tp[:min(n, k)]).sum() 256 | for n in heldout_batch.getnnz(axis=1)]) 257 | return DCG / IDCG 258 | 259 | 260 | def MAP_at_k_batch(train_data, heldout_data, Et, Eb, user_idx, mu=None, k=100, 261 | vad_data=None): 262 | ''' 263 | mean average precision@k 264 | ''' 265 | batch_users = user_idx.stop - user_idx.start 266 | 267 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, batch_users, mu=mu, 268 | vad_data=vad_data) 269 | idx_topk_part = bn.argpartsort(-X_pred, k, axis=1) 270 | topk_part = X_pred[np.arange(batch_users)[:, np.newaxis], 271 | idx_topk_part[:, :k]] 272 | idx_part = np.argsort(-topk_part, axis=1) 273 | # X_pred[np.arange(batch_users)[:, np.newaxis], idx_topk] is the sorted 274 | # topk predicted score 275 | idx_topk = idx_topk_part[np.arange(batch_users)[:, np.newaxis], idx_part] 276 | 277 | aps = np.zeros(batch_users) 278 | for i, idx in enumerate(xrange(user_idx.start, user_idx.stop)): 279 | actual = heldout_data[idx].nonzero()[1] 280 | if len(actual) > 0: 281 | predicted = idx_topk[i] 282 | aps[i] = apk(actual, predicted, k=k) 283 | else: 284 | aps[i] = np.nan 285 | return aps 286 | 287 | 288 | def mean_perc_rank_batch(train_data, heldout_data, Et, Eb, user_idx, 289 | mu=None, vad_data=None): 290 | ''' 291 | mean percentile rank for a batch of users 292 | MPR of the full set is the sum of batch MPR's divided by the sum of all the 293 | feedbacks. (Eq. 8 in Hu et al.) 294 | This metric not necessarily constrains the data to be binary 295 | ''' 296 | batch_users = user_idx.stop - user_idx.start 297 | 298 | X_pred = _make_prediction(train_data, Et, Eb, user_idx, batch_users, 299 | mu=mu, vad_data=vad_data) 300 | all_perc = np.argsort(np.argsort(-X_pred, axis=1), axis=1) / \ 301 | np.isfinite(X_pred).sum(axis=1, keepdims=True).astype(np.float32) 302 | perc_batch = (all_perc[heldout_data[user_idx].nonzero()] * 303 | heldout_data[user_idx].data).sum() 304 | return perc_batch 305 | 306 | 307 | ## steal from https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py 308 | def apk(actual, predicted, k=100): 309 | """ 310 | Computes the average precision at k. 311 | This function computes the average prescision at k between two lists of 312 | items. 313 | Parameters 314 | ---------- 315 | actual : list 316 | A list of elements that are to be predicted (order doesn't matter) 317 | predicted : list 318 | A list of predicted elements (order does matter) 319 | k : int, optional 320 | The maximum number of predicted elements 321 | Returns 322 | ------- 323 | score : double 324 | The average precision at k over the input lists 325 | """ 326 | if len(predicted)>k: 327 | predicted = predicted[:k] 328 | 329 | score = 0.0 330 | num_hits = 0.0 331 | 332 | for i,p in enumerate(predicted): 333 | if p in actual: #and p not in predicted[:i]: # not necessary for us since we will not make duplicated recs 334 | num_hits += 1.0 335 | score += num_hits / (i+1.0) 336 | 337 | # we handle this part before making the function call 338 | #if not actual: 339 | # return np.nan 340 | 341 | return score / min(len(actual), k) 342 | -------------------------------------------------------------------------------- /src/expomf_cov.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Exposure Matrix Factorization with exposure covariates (e.g. topics, or 4 | locations) for collaborative filtering 5 | 6 | This file is largely adapted from expomf.py 7 | 8 | """ 9 | 10 | 11 | import os 12 | import sys 13 | import time 14 | import numpy as np 15 | from numpy import linalg as LA 16 | 17 | from joblib import Parallel, delayed 18 | from math import sqrt 19 | from sklearn.base import BaseEstimator, TransformerMixin 20 | 21 | import rec_eval 22 | 23 | floatX = np.float32 24 | EPS = 1e-8 25 | 26 | 27 | class ExpoMF(BaseEstimator, TransformerMixin): 28 | def __init__(self, n_components=100, max_iter=10, batch_size=1000, 29 | batch_sgd=10, max_epoch=10, init_std=0.01, n_jobs=8, 30 | random_state=None, save_params=False, save_dir='.', 31 | early_stopping=False, verbose=False, **kwargs): 32 | ''' 33 | Exposure matrix factorization 34 | 35 | Parameters 36 | --------- 37 | n_components : int 38 | Number of latent factors 39 | max_iter : int 40 | Maximal number of iterations to perform 41 | batch_size: int 42 | Batch size to perform parallel update 43 | batch_sgd: int 44 | Batch size for SGD when updating exposure factors 45 | max_epoch: int 46 | Number of epochs for SGD 47 | init_std: float 48 | The latent factors will be initialized as Normal(0, init_std**2) 49 | n_jobs: int 50 | Number of parallel jobs to update latent factors 51 | random_state : int or RandomState 52 | Pseudo random number generator used for sampling 53 | save_params: bool 54 | Whether to save parameters after each iteration 55 | save_dir: str 56 | The directory to save the parameters 57 | early_stopping: bool 58 | Whether to early stop the training by monitoring performance on 59 | validation set 60 | verbose : bool 61 | Whether to show progress during model fitting 62 | **kwargs: dict 63 | Model hyperparameters 64 | ''' 65 | self.n_components = n_components 66 | self.max_iter = max_iter 67 | self.batch_size = batch_size 68 | self.batch_sgd = batch_sgd 69 | self.max_epoch = max_epoch 70 | self.init_std = init_std 71 | self.n_jobs = n_jobs 72 | self.random_state = random_state 73 | self.save_params = save_params 74 | self.save_dir = save_dir 75 | self.early_stopping = early_stopping 76 | self.verbose = verbose 77 | 78 | if type(self.random_state) is int: 79 | np.random.seed(self.random_state) 80 | elif self.random_state is not None: 81 | np.random.setstate(self.random_state) 82 | 83 | self._parse_kwargs(**kwargs) 84 | 85 | def _parse_kwargs(self, **kwargs): 86 | ''' Model hyperparameters 87 | 88 | Parameters 89 | --------- 90 | lambda_theta, lambda_beta, lambda_nu: float 91 | Regularization parameter for user (lambda_theta), item CF factors ( 92 | lambda_beta) and user exposure factors (lambda_nu). Default value 93 | 1e-5. Since in implicit feedback all the n_users-by-n_items data 94 | points are used for training, overfitting is almost never an issue 95 | lambda_y: float 96 | inverse variance on the observational model. Default value 1.0 97 | learning_rate: float 98 | Learning rate for SGD. Default value 0.1. Since for each user we 99 | are effectively doing logistic regression, constant learning rate 100 | should suffice 101 | init_mu: float 102 | init_mu is used to initalize the user exposure bias (alpha) such 103 | that all the \mu_{ui} = inv_logit(nu_u * pi_i + alpha_u) is roughly 104 | init_mu. Default value is 0.01. This number should change according 105 | to the sparsity of the data (sparser data with smaller init_mu). 106 | In the experiment, we select the value from validation set 107 | ''' 108 | self.lam_theta = float(kwargs.get('lambda_theta', 1e-5)) 109 | self.lam_beta = float(kwargs.get('lambda_beta', 1e-5)) 110 | self.lam_nu = float(kwargs.get('lambda_nu', 1e-5)) 111 | self.lam_y = float(kwargs.get('lam_y', 1.0)) 112 | self.learning_rate = float(kwargs.get('learning_rate', 0.1)) 113 | self.init_mu = float(kwargs.get('init_mu', 0.01)) 114 | 115 | def _init_params(self, n_users, n_items): 116 | ''' Initialize all the latent factors ''' 117 | # user CF factors 118 | self.theta = self.init_std * \ 119 | np.random.randn(n_users, self.n_components).astype(floatX) 120 | # item CF factors 121 | self.beta = self.init_std * \ 122 | np.random.randn(n_items, self.n_components).astype(floatX) 123 | # user exposure factors 124 | self.nu = self.init_std * \ 125 | np.random.randn(n_users, self.n_components).astype(floatX) 126 | # user exposure bias 127 | self.alpha = np.log(self.init_mu / (1 - self.init_mu)) * \ 128 | np.ones((n_users, 1), dtype=floatX) 129 | 130 | def fit(self, X, pi, vad_data=None, **kwargs): 131 | '''Fit the model to the data in X. 132 | 133 | Parameters 134 | ---------- 135 | X : scipy.sparse.csr_matrix, shape (n_users, n_items) 136 | Training data. 137 | 138 | pi : array-like, shape (n_items, n_components) 139 | item content representations (e.g. topics, locations) 140 | 141 | vad_data: scipy.sparse.csr_matrix, shape (n_users, n_items) 142 | Validation data. 143 | 144 | **kwargs: dict 145 | Additional keywords to evaluation function call on validation data 146 | 147 | Returns 148 | ------- 149 | self: object 150 | Returns the instance itself. 151 | ''' 152 | n_users, n_items = X.shape 153 | assert pi.shape[0] == n_items 154 | self._init_params(n_users, n_items) 155 | self._update(X, pi, vad_data, **kwargs) 156 | return self 157 | 158 | def transform(self, X): 159 | pass 160 | 161 | def _update(self, X, pi, vad_data, **kwargs): 162 | '''Model training and evaluation on validation set''' 163 | XT = X.T.tocsr() # pre-compute this 164 | self.vad_ndcg = -np.inf 165 | 166 | for i in xrange(self.max_iter): 167 | if self.verbose: 168 | print('ITERATION #%d' % i) 169 | self._update_factors(X, XT, pi) 170 | self._update_expo(XT, pi) 171 | if vad_data is not None: 172 | vad_ndcg = self._validate(X, pi, vad_data, **kwargs) 173 | if self.early_stopping and self.vad_ndcg > vad_ndcg: 174 | break # we will not save the parameter for this iteration 175 | self.vad_ndcg = vad_ndcg 176 | if self.save_params: 177 | self._save_params(i) 178 | pass 179 | 180 | def _update_factors(self, X, XT, pi): 181 | '''Update user and item collaborative factors with ALS''' 182 | if self.verbose: 183 | start_t = _writeline_and_time('\tUpdating user factors...') 184 | self.theta = recompute_factors(self.beta, self.theta, pi, 185 | self.nu, self.alpha, X, 186 | self.lam_theta / self.lam_y, 187 | self.lam_y, 188 | self.n_jobs, 189 | batch_size=self.batch_size) 190 | if self.verbose: 191 | print('\r\tUpdating user factors: time=%.2f' 192 | % (time.time() - start_t)) 193 | start_t = _writeline_and_time('\tUpdating item factors...') 194 | self.beta = recompute_factors(self.theta, self.beta, self.nu, pi, 195 | self.alpha, XT, 196 | self.lam_beta / self.lam_y, 197 | self.lam_y, 198 | self.n_jobs, 199 | batch_size=self.batch_size) 200 | if self.verbose: 201 | print('\r\tUpdating item factors: time=%.2f' 202 | % (time.time() - start_t)) 203 | pass 204 | 205 | def _update_expo(self, XT, pi): 206 | '''Update user exposure factors and bias with mini-batch SGD''' 207 | if self.verbose: 208 | start_t = _writeline_and_time( 209 | '\tUpdating user exposure factors...\n') 210 | nu_old = self.nu.copy() 211 | alpha_old = self.alpha.copy() 212 | 213 | n_items = XT.shape[0] 214 | start_idx = range(0, n_items, self.batch_sgd) 215 | end_idx = start_idx[1:] + [n_items] 216 | 217 | # run SGD to learn nu and alpha 218 | for epoch in xrange(self.max_epoch): 219 | idx = np.random.permutation(n_items) 220 | 221 | # take the last batch as validation 222 | lo, hi = start_idx[-1], end_idx[-1] 223 | A_vad = a_row_batch(XT[idx[lo:hi]], self.beta[idx[lo:hi]], 224 | self.theta, pi[idx[lo:hi]], nu_old, 225 | alpha_old.T, self.lam_y).T 226 | pred_vad = get_mu(self.nu, pi[idx[lo:hi]], self.alpha) 227 | init_loss = -np.sum(A_vad * np.log(pred_vad) + (1 - A_vad) * 228 | np.log(1 - pred_vad)) / self.batch_sgd + \ 229 | self.lam_nu / 2 * np.sum(self.nu**2) 230 | if self.verbose: 231 | print('\t\tEpoch #%d: initial validation loss = %.3f' % 232 | (epoch, init_loss)) 233 | sys.stdout.flush() 234 | 235 | for lo, hi in zip(start_idx[:-1], end_idx[:-1]): 236 | A_batch = a_row_batch(XT[idx[lo:hi]], self.beta[idx[lo:hi]], 237 | self.theta, pi[idx[lo:hi]], nu_old, 238 | alpha_old.T, self.lam_y).T 239 | pred_batch = get_mu(self.nu, pi[idx[lo:hi]], self.alpha) 240 | diff = A_batch - pred_batch 241 | grad_nu = 1. / self.batch_sgd * diff.dot(pi[idx[lo:hi]])\ 242 | - self.lam_nu * self.nu 243 | self.nu += self.learning_rate * grad_nu 244 | self.alpha += self.learning_rate * diff.mean(axis=1, 245 | keepdims=True) 246 | 247 | lo, hi = start_idx[-1], end_idx[-1] 248 | pred_vad = get_mu(self.nu, pi[idx[lo:hi]], self.alpha) 249 | loss = -np.sum(A_vad * np.log(pred_vad) + (1 - A_vad) * 250 | np.log(1 - pred_vad)) / self.batch_sgd + \ 251 | self.lam_nu / 2 * np.sum(self.nu**2) 252 | if self.verbose: 253 | print('\t\tEpoch #%d: validation loss = %.3f' % (epoch, loss)) 254 | sys.stdout.flush() 255 | # It seems that after a few epochs the validation loss will 256 | # not decrease. However, we empirically found that it is still 257 | # better to train for more epochs, instead of stop the SGD 258 | if self.verbose: 259 | print('\tUpdating user exposure factors: time=%.2f' 260 | % (time.time() - start_t)) 261 | sys.stdout.flush() 262 | pass 263 | 264 | def _validate(self, X, pi, vad_data, **kwargs): 265 | '''Compute validation metric (NDCG@k)''' 266 | mu = dict(params=[self.nu, pi, self.alpha], func=get_mu) 267 | vad_ndcg = rec_eval.normalized_dcg_at_k(X, vad_data, 268 | self.theta, 269 | self.beta, 270 | mu=mu, 271 | **kwargs) 272 | if self.verbose: 273 | print('\tValidation NDCG@k: %.4f' % vad_ndcg) 274 | sys.stdout.flush() 275 | return vad_ndcg 276 | 277 | def _save_params(self, iter): 278 | '''Save the parameters''' 279 | if not os.path.exists(self.save_dir): 280 | os.makedirs(self.save_dir) 281 | filename = 'ExpoMF_cov_K%d_mu%.1e_iter%d.npz' % (self.n_components, 282 | self.init_mu, iter) 283 | np.savez(os.path.join(self.save_dir, filename), U=self.theta, 284 | V=self.beta, nu=self.nu, alpha=self.alpha) 285 | 286 | 287 | # Utility functions # 288 | 289 | def _writeline_and_time(s): 290 | sys.stdout.write(s) 291 | sys.stdout.flush() 292 | return time.time() 293 | 294 | 295 | def get_row(Y, i): 296 | '''Given a scipy.sparse.csr_matrix Y, get the values and indices of the 297 | non-zero values in i_th row''' 298 | lo, hi = Y.indptr[i], Y.indptr[i + 1] 299 | return Y.data[lo:hi], Y.indices[lo:hi] 300 | 301 | 302 | def inv_logit(x): 303 | return 1. / (1 + np.exp(-x)) 304 | 305 | 306 | def get_mu(nu, pi, alpha): 307 | ''' \mu_{ui} = inv_logit(nu_u * pi_i + alpha_u)''' 308 | return inv_logit(nu.dot(pi.T) + alpha) 309 | 310 | 311 | def a_row_batch(Y_batch, theta_batch, beta, nu_batch, pi, alpha_batch, lam_y): 312 | '''Compute the posterior of exposure latent variables A by batch 313 | 314 | When updating users: 315 | Y_batch: (batch_users, n_items) 316 | theta_batch: (batch_users, n_components) 317 | beta: (n_items, n_components) 318 | nu_batch: (batch_users, n_components) 319 | pi: (n_items, n_components) 320 | alpha_batch: (batch_users, 1) 321 | 322 | When updating items: 323 | Y_batch: (batch_item, n_users) 324 | theta_batch: (batch_item, n_components) 325 | beta: (n_users, n_components) 326 | nu_batch: (batch_item, n_components) 327 | pi: (n_users, n_components) 328 | alpha_batch: (1, n_users) 329 | ''' 330 | pEX = sqrt(lam_y / 2 * np.pi) * \ 331 | np.exp(-lam_y * theta_batch.dot(beta.T)**2 / 2) 332 | mu = get_mu(nu_batch, pi, alpha_batch) 333 | A = (pEX + EPS) / (pEX + EPS + (1 - mu) / mu) 334 | A[Y_batch.nonzero()] = 1. 335 | return A 336 | 337 | 338 | def _solve(k, A_k, X, Y, f, lam, lam_y): 339 | '''Update one single factor''' 340 | s_u, i_u = get_row(Y, k) 341 | a = np.dot(s_u * A_k[i_u], X[i_u]) 342 | B = X.T.dot(A_k[:, np.newaxis] * X) + lam * np.eye(f) 343 | return LA.solve(B, a) 344 | 345 | 346 | def _solve_batch(lo, hi, X, X_old_batch, S, T_batch, alpha, Y, m, f, lam, 347 | lam_y): 348 | '''Update factors by batch, will eventually call _solve() on each factor to 349 | keep the parallel process busy''' 350 | assert X_old_batch.shape[0] == hi - lo 351 | assert T_batch.shape[0] == hi - lo 352 | X_batch = np.empty_like(X_old_batch, dtype=X_old_batch.dtype) 353 | 354 | if X.shape[0] == alpha.shape[0]: # update item 355 | A_batch = a_row_batch(Y[lo:hi], X_old_batch, X, T_batch, S, alpha.T, 356 | lam_y) 357 | else: # update user 358 | A_batch = a_row_batch(Y[lo:hi], X_old_batch, X, T_batch, S, 359 | alpha[lo:hi], lam_y) 360 | 361 | for ib, k in enumerate(xrange(lo, hi)): 362 | X_batch[ib] = _solve(k, A_batch[ib], X, Y, f, lam, lam_y) 363 | return X_batch 364 | 365 | 366 | def recompute_factors(X, X_old, S, T, alpha, Y, lam, lam_y, n_jobs, 367 | batch_size=1000): 368 | '''Regress X to Y with exposure matrix (computed on-the-fly with X_old) and 369 | ridge term lam by embarrassingly parallelization. All the comments below 370 | are in the view of computing user factors 371 | 372 | When updating users: 373 | X: item CF factors (beta) 374 | X_old: user CF factors (theta) 375 | S: content topic proportions (pi) 376 | T: user consideration factors (nu) 377 | alpha: user consideration bias (alpha) 378 | 379 | When updating items: 380 | X: user CF factors (theta) 381 | X_old: item CF factors (beta) 382 | S: user consideration factors (nu) 383 | T: content topic proportions (pi) 384 | alpha: user consideration bias (alpha) 385 | ''' 386 | m, n = Y.shape # m = number of users, n = number of items 387 | assert X.shape[0] == n 388 | assert X_old.shape[0] == m 389 | assert X.shape == S.shape 390 | assert X_old.shape == T.shape 391 | f = X.shape[1] # f = number of factors 392 | 393 | start_idx = range(0, m, batch_size) 394 | end_idx = start_idx[1:] + [m] 395 | res = Parallel(n_jobs=n_jobs)(delayed(_solve_batch)( 396 | lo, hi, X, X_old[lo:hi], S, T[lo:hi], alpha, Y, m, f, lam, lam_y) 397 | for lo, hi in zip(start_idx, end_idx)) 398 | X_new = np.vstack(res) 399 | return X_new 400 | -------------------------------------------------------------------------------- /src/Location_ExpoMF_Gowalla.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Fit Exposure MF with exposure covariantes (Location ExpoMF) to the Gowalla dataset" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import glob\n", 19 | "import os\n", 20 | "# if you are using OPENBLAS, you might want to turn this option on. Otherwise, joblib might get stuck\n", 21 | "# os.environ['OPENBLAS_NUM_THREADS'] = '1'\n", 22 | "\n", 23 | "import numpy as np\n", 24 | "import matplotlib\n", 25 | "matplotlib.use('Agg')\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "%matplotlib inline\n", 28 | "import scipy.sparse\n", 29 | "import pandas as pd" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import expomf_cov\n", 41 | "import rec_eval" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Change this to wherever you saved the processed data from [processGowalla.ipynb](processGowalla.ipynb)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "metadata": { 55 | "collapsed": true 56 | }, 57 | "outputs": [], 58 | "source": [ 59 | "DATA_ROOT = '/home/waldorf/dawen.liang/gowalla_pro/'" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 4, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "unique_uid = list()\n", 71 | "with open(os.path.join(DATA_ROOT, 'unique_uid.txt'), 'r') as f:\n", 72 | " for line in f:\n", 73 | " unique_uid.append(line.strip())\n", 74 | " \n", 75 | "unique_sid = list()\n", 76 | "with open(os.path.join(DATA_ROOT, 'unique_sid.txt'), 'r') as f:\n", 77 | " for line in f:\n", 78 | " unique_sid.append(line.strip())" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 5, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "n_songs = len(unique_sid)\n", 90 | "n_users = len(unique_uid)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Load the data and train the model" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 6, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "def load_data(csv_file, shape=(n_users, n_songs)):\n", 109 | " tp = pd.read_csv(csv_file)\n", 110 | " rows, cols = np.array(tp['uid'], dtype=np.int32), np.array(tp['sid'], dtype=np.int32)\n", 111 | " count = tp['rating']\n", 112 | " return scipy.sparse.csr_matrix((count,(rows, cols)), dtype=np.int16, shape=shape), rows, cols" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 7, 118 | "metadata": { 119 | "collapsed": false 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "train_data, rows, cols = load_data(os.path.join(DATA_ROOT, 'train.num.csv'))\n", 124 | "# binarize the data\n", 125 | "train_data.data = np.ones_like(train_data.data)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 8, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "(57629, 47198)\n", 140 | "(804262,)\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "print train_data.shape\n", 146 | "print train_data.data.shape" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 9, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "vad_data, rows_vad, cols_vad = load_data(os.path.join(DATA_ROOT, 'vad.num.csv'))\n", 158 | "# binarize the data\n", 159 | "vad_data.data = np.ones_like(vad_data.data)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 10, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "test_data, rows_test, cols_test = load_data(os.path.join(DATA_ROOT, 'test.num.csv'))\n", 171 | "# binarize the data\n", 172 | "test_data.data = np.ones_like(test_data.data)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "`feat_venue_locs.tsv` contains the location features (part of the [pre-processed data](http://dawenl.github.io/data/gowalla_pro.zip)), which are generated in the following way: \n", 180 | "- Run GMM (from [scikit.learn](http://scikit-learn.org/)) on all the venue locations.\n", 181 | "- For each venue, take the expected cluster assignment as location features `pi`." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 11, 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "outputs": [], 191 | "source": [ 192 | "pi = np.loadtxt(os.path.join(DATA_ROOT, 'feat_venue_locs.tsv'), dtype='float32')" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 12, 198 | "metadata": { 199 | "collapsed": false 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "# sanity check to make sure all the venues has its corresponding feature \n", 204 | "for i, s in enumerate(unique_sid):\n", 205 | " assert s == \"%d\" % pi[i, 0]" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 13, 211 | "metadata": { 212 | "collapsed": false 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "# the first column is ID, don't need them\n", 217 | "pi = pi[:, 1:]" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 15, 223 | "metadata": { 224 | "collapsed": false 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "n_components = 100\n", 229 | "max_iter = 20\n", 230 | "n_jobs = 20\n", 231 | "lam = 1e-5\n", 232 | "# here we use the best performing init_mu from per-item \\mu_i experiment\n", 233 | "init_mu = 0.01\n", 234 | "max_epoch = 10\n", 235 | "\n", 236 | "save_dir=\"Gowalla_Location_ExpoMF_params_K%d_lam%1.0E_initmu%1.0E_maxepoch%d\" % (n_components, lam, init_mu, max_epoch)\n", 237 | "\n", 238 | "coder = expomf_cov.ExpoMF(n_components=n_components, max_iter=max_iter, batch_size=1000, \n", 239 | " batch_sgd=10, max_epoch=max_epoch, init_std=0.01,\n", 240 | " n_jobs=n_jobs, random_state=98765, save_params=True, save_dir=save_dir, \n", 241 | " early_stopping=True, verbose=True, \n", 242 | " lam_y=1., lam_theta=lam, lam_beta=lam, lam_nu=lam, init_mu=init_mu, learning_rate=.5)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 16, 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "outputs": [ 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "ITERATION #0\n", 257 | "\tUpdating user factors: time=178.12\n", 258 | "\tUpdating item factors: time=190.42\n", 259 | "\tUpdating user consideration factors...\n", 260 | "\t\tEpoch #0: initial validation loss = 3154.786\n", 261 | "\t\tEpoch #0: validation loss = 3133.727\n", 262 | "\t\tEpoch #1: initial validation loss = 3129.680\n", 263 | "\t\tEpoch #1: validation loss = 3129.385\n", 264 | "\t\tEpoch #2: initial validation loss = 3172.628\n", 265 | "\t\tEpoch #2: validation loss = 3173.474\n", 266 | "\t\tEpoch #3: initial validation loss = 3143.275\n", 267 | "\t\tEpoch #3: validation loss = 3143.484\n", 268 | "\t\tEpoch #4: initial validation loss = 3158.697\n", 269 | "\t\tEpoch #4: validation loss = 3159.602\n", 270 | "\t\tEpoch #5: initial validation loss = 3130.088\n", 271 | "\t\tEpoch #5: validation loss = 3130.258\n", 272 | "\t\tEpoch #6: initial validation loss = 3163.065\n", 273 | "\t\tEpoch #6: validation loss = 3163.948\n", 274 | "\t\tEpoch #7: initial validation loss = 3221.945\n", 275 | "\t\tEpoch #7: validation loss = 3222.674\n", 276 | "\t\tEpoch #8: initial validation loss = 3119.196\n", 277 | "\t\tEpoch #8: validation loss = 3119.460\n", 278 | "\t\tEpoch #9: initial validation loss = 3138.604\n", 279 | "\t\tEpoch #9: validation loss = 3138.755\n", 280 | "\tUpdating user consideration factors: time=7653.19\n", 281 | "\tValidation NDCG@k: 0.0301\n", 282 | "ITERATION #1\n", 283 | "\tUpdating user factors: time=182.01\n", 284 | "\tUpdating item factors: time=156.22\n", 285 | "\tUpdating user consideration factors...\n", 286 | "\t\tEpoch #0: initial validation loss = 3860.363\n", 287 | "\t\tEpoch #0: validation loss = 3834.004\n", 288 | "\t\tEpoch #1: initial validation loss = 3788.326\n", 289 | "\t\tEpoch #1: validation loss = 3788.219\n", 290 | "\t\tEpoch #2: initial validation loss = 3783.807\n", 291 | "\t\tEpoch #2: validation loss = 3783.946\n", 292 | "\t\tEpoch #3: initial validation loss = 3804.415\n", 293 | "\t\tEpoch #3: validation loss = 3804.517\n", 294 | "\t\tEpoch #4: initial validation loss = 3780.067\n", 295 | "\t\tEpoch #4: validation loss = 3780.079\n", 296 | "\t\tEpoch #5: initial validation loss = 3846.554\n", 297 | "\t\tEpoch #5: validation loss = 3847.170\n", 298 | "\t\tEpoch #6: initial validation loss = 3775.838\n", 299 | "\t\tEpoch #6: validation loss = 3776.161\n", 300 | "\t\tEpoch #7: initial validation loss = 3826.873\n", 301 | "\t\tEpoch #7: validation loss = 3827.148\n", 302 | "\t\tEpoch #8: initial validation loss = 3806.607\n", 303 | "\t\tEpoch #8: validation loss = 3806.548\n", 304 | "\t\tEpoch #9: initial validation loss = 3780.639\n", 305 | "\t\tEpoch #9: validation loss = 3780.768\n", 306 | "\tUpdating user consideration factors: time=7975.81\n", 307 | "\tValidation NDCG@k: 0.0845\n", 308 | "ITERATION #2\n", 309 | "\tUpdating user factors: time=183.73\n", 310 | "\tUpdating item factors: time=160.88\n", 311 | "\tUpdating user consideration factors...\n", 312 | "\t\tEpoch #0: initial validation loss = 4603.338\n", 313 | "\t\tEpoch #0: validation loss = 4576.209\n", 314 | "\t\tEpoch #1: initial validation loss = 4586.514\n", 315 | "\t\tEpoch #1: validation loss = 4586.777\n", 316 | "\t\tEpoch #2: initial validation loss = 4573.730\n", 317 | "\t\tEpoch #2: validation loss = 4573.795\n", 318 | "\t\tEpoch #3: initial validation loss = 4560.591\n", 319 | "\t\tEpoch #3: validation loss = 4560.743\n", 320 | "\t\tEpoch #4: initial validation loss = 4555.287\n", 321 | "\t\tEpoch #4: validation loss = 4555.128\n", 322 | "\t\tEpoch #5: initial validation loss = 4541.676\n", 323 | "\t\tEpoch #5: validation loss = 4541.635\n", 324 | "\t\tEpoch #6: initial validation loss = 4645.228\n", 325 | "\t\tEpoch #6: validation loss = 4645.335\n", 326 | "\t\tEpoch #7: initial validation loss = 4557.719\n", 327 | "\t\tEpoch #7: validation loss = 4557.995\n", 328 | "\t\tEpoch #8: initial validation loss = 4553.940\n", 329 | "\t\tEpoch #8: validation loss = 4554.181\n", 330 | "\t\tEpoch #9: initial validation loss = 4535.900\n", 331 | "\t\tEpoch #9: validation loss = 4535.999\n", 332 | "\tUpdating user consideration factors: time=7937.84\n", 333 | "\tValidation NDCG@k: 0.0924\n", 334 | "ITERATION #3\n", 335 | "\tUpdating user factors: time=160.01\n", 336 | "\tUpdating item factors: time=186.43\n", 337 | "\tUpdating user consideration factors...\n", 338 | "\t\tEpoch #0: initial validation loss = 5470.817\n", 339 | "\t\tEpoch #0: validation loss = 5441.984\n", 340 | "\t\tEpoch #1: initial validation loss = 5454.386\n", 341 | "\t\tEpoch #1: validation loss = 5454.170\n", 342 | "\t\tEpoch #2: initial validation loss = 5493.884\n", 343 | "\t\tEpoch #2: validation loss = 5494.150\n", 344 | "\t\tEpoch #3: initial validation loss = 5451.011\n", 345 | "\t\tEpoch #3: validation loss = 5451.135\n", 346 | "\t\tEpoch #4: initial validation loss = 5470.153\n", 347 | "\t\tEpoch #4: validation loss = 5470.271\n", 348 | "\t\tEpoch #5: initial validation loss = 5459.202\n", 349 | "\t\tEpoch #5: validation loss = 5459.523\n", 350 | "\t\tEpoch #6: initial validation loss = 5509.305\n", 351 | "\t\tEpoch #6: validation loss = 5509.488\n", 352 | "\t\tEpoch #7: initial validation loss = 5475.179\n", 353 | "\t\tEpoch #7: validation loss = 5475.445\n", 354 | "\t\tEpoch #8: initial validation loss = 5482.983\n", 355 | "\t\tEpoch #8: validation loss = 5482.837\n", 356 | "\t\tEpoch #9: initial validation loss = 5494.159\n", 357 | "\t\tEpoch #9: validation loss = 5494.225\n", 358 | "\tUpdating user consideration factors: time=7680.83\n", 359 | "\tValidation NDCG@k: 0.0949\n", 360 | "ITERATION #4\n", 361 | "\tUpdating user factors: time=175.42\n", 362 | "\tUpdating item factors: time=164.18\n", 363 | "\tUpdating user consideration factors...\n", 364 | "\t\tEpoch #0: initial validation loss = 6506.894\n", 365 | "\t\tEpoch #0: validation loss = 6470.763\n", 366 | "\t\tEpoch #1: initial validation loss = 6459.655\n", 367 | "\t\tEpoch #1: validation loss = 6459.755\n", 368 | "\t\tEpoch #2: initial validation loss = 6477.054\n", 369 | "\t\tEpoch #2: validation loss = 6477.101\n", 370 | "\t\tEpoch #3: initial validation loss = 6479.118\n", 371 | "\t\tEpoch #3: validation loss = 6479.124\n", 372 | "\t\tEpoch #4: initial validation loss = 6533.415\n", 373 | "\t\tEpoch #4: validation loss = 6533.217\n", 374 | "\t\tEpoch #5: initial validation loss = 6467.309\n", 375 | "\t\tEpoch #5: validation loss = 6467.406\n", 376 | "\t\tEpoch #6: initial validation loss = 6490.388\n", 377 | "\t\tEpoch #6: validation loss = 6490.435\n", 378 | "\t\tEpoch #7: initial validation loss = 6469.849\n", 379 | "\t\tEpoch #7: validation loss = 6469.922\n", 380 | "\t\tEpoch #8: initial validation loss = 6551.146\n", 381 | "\t\tEpoch #8: validation loss = 6551.190\n", 382 | "\t\tEpoch #9: initial validation loss = 6463.776\n", 383 | "\t\tEpoch #9: validation loss = 6463.873\n", 384 | "\tUpdating user consideration factors: time=7736.51\n", 385 | "\tValidation NDCG@k: 0.0959\n", 386 | "ITERATION #5\n", 387 | "\tUpdating user factors: time=157.26\n", 388 | "\tUpdating item factors: time=165.75\n", 389 | "\tUpdating user consideration factors...\n", 390 | "\t\tEpoch #0: initial validation loss = 7672.230\n", 391 | "\t\tEpoch #0: validation loss = 7628.359\n", 392 | "\t\tEpoch #1: initial validation loss = 7616.646\n", 393 | "\t\tEpoch #1: validation loss = 7616.676\n", 394 | "\t\tEpoch #2: initial validation loss = 7679.058\n", 395 | "\t\tEpoch #2: validation loss = 7679.083\n", 396 | "\t\tEpoch #3: initial validation loss = 7656.112\n", 397 | "\t\tEpoch #3: validation loss = 7656.398\n", 398 | "\t\tEpoch #4: initial validation loss = 7652.572\n", 399 | "\t\tEpoch #4: validation loss = 7652.493\n", 400 | "\t\tEpoch #5: initial validation loss = 7744.950\n", 401 | "\t\tEpoch #5: validation loss = 7745.014\n", 402 | "\t\tEpoch #6: initial validation loss = 7652.415\n", 403 | "\t\tEpoch #6: validation loss = 7652.378\n", 404 | "\t\tEpoch #7: initial validation loss = 7726.620\n", 405 | "\t\tEpoch #7: validation loss = 7726.597\n", 406 | "\t\tEpoch #8: initial validation loss = 7659.457\n", 407 | "\t\tEpoch #8: validation loss = 7659.554\n", 408 | "\t\tEpoch #9: initial validation loss = 7680.335\n", 409 | "\t\tEpoch #9: validation loss = 7680.456\n", 410 | "\tUpdating user consideration factors: time=7795.65\n", 411 | "\tValidation NDCG@k: 0.0964\n", 412 | "ITERATION #6\n", 413 | "\tUpdating user factors: time=165.60\n", 414 | "\tUpdating item factors: time=164.97\n", 415 | "\tUpdating user consideration factors...\n", 416 | "\t\tEpoch #0: initial validation loss = 9083.521\n", 417 | "\t\tEpoch #0: validation loss = 9033.204\n", 418 | "\t\tEpoch #1: initial validation loss = 9006.595\n", 419 | "\t\tEpoch #1: validation loss = 9006.622\n", 420 | "\t\tEpoch #2: initial validation loss = 9084.035\n", 421 | "\t\tEpoch #2: validation loss = 9084.182\n", 422 | "\t\tEpoch #3: initial validation loss = 8987.997\n", 423 | "\t\tEpoch #3: validation loss = 8988.022\n", 424 | "\t\tEpoch #4: initial validation loss = 9074.346\n", 425 | "\t\tEpoch #4: validation loss = 9074.305\n", 426 | "\t\tEpoch #5: initial validation loss = 9003.383\n", 427 | "\t\tEpoch #5: validation loss = 9003.478\n", 428 | "\t\tEpoch #6: initial validation loss = 9061.818\n", 429 | "\t\tEpoch #6: validation loss = 9061.753\n", 430 | "\t\tEpoch #7: initial validation loss = 8983.690\n", 431 | "\t\tEpoch #7: validation loss = 8983.734\n", 432 | "\t\tEpoch #8: initial validation loss = 9030.014\n", 433 | "\t\tEpoch #8: validation loss = 9030.077\n", 434 | "\t\tEpoch #9: initial validation loss = 9023.638\n", 435 | "\t\tEpoch #9: validation loss = 9023.628\n", 436 | "\tUpdating user consideration factors: time=8014.82\n", 437 | "\tValidation NDCG@k: 0.0959\n" 438 | ] 439 | }, 440 | { 441 | "data": { 442 | "text/plain": [ 443 | "ExpoMF(batch_sgd=10, batch_size=1000, early_stopping=True, init_std=0.01,\n", 444 | " max_epoch=10, max_iter=20, n_components=100, n_jobs=20,\n", 445 | " random_state=98765,\n", 446 | " save_dir='Gowalla_Location_ExpoMF_params_K100_lam1E-05_initmu1E-02_maxepoch10',\n", 447 | " save_params=True, verbose=True)" 448 | ] 449 | }, 450 | "execution_count": 16, 451 | "metadata": {}, 452 | "output_type": "execute_result" 453 | } 454 | ], 455 | "source": [ 456 | "coder.fit(train_data, pi, vad_data=vad_data, batch_users=5000, k=100)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "It seems that after a few epochs the validation loss will not decrease. However, we empirically found that it is still better to train for more epochs, instead of stop the SGD" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "## Evaluate the performance on heldout testset" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 17, 476 | "metadata": { 477 | "collapsed": false 478 | }, 479 | "outputs": [], 480 | "source": [ 481 | "n_params = len(glob.glob(os.path.join(save_dir, '*.npz')))\n", 482 | "\n", 483 | "params = np.load(os.path.join(save_dir, 'ExpoMF_cov_K%d_mu%.1e_iter%d.npz' % (n_components, init_mu, n_params - 1)))\n", 484 | "U, V, nu, alpha = params['U'], params['V'], params['nu'], params['alpha']" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "### Rank by $\\mathbb{E}[y_{ui}] = \\mu_{ui}\\theta_u^\\top\\beta_i$" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 18, 497 | "metadata": { 498 | "collapsed": false 499 | }, 500 | "outputs": [ 501 | { 502 | "name": "stdout", 503 | "output_type": "stream", 504 | "text": [ 505 | "Test Recall@20: 0.1292\n", 506 | "Test Recall@50: 0.1992\n", 507 | "Test NDCG@100: 0.1252\n", 508 | "Test MAP@100: 0.0478\n" 509 | ] 510 | } 511 | ], 512 | "source": [ 513 | "mu = {'params': [nu, pi, alpha], 'func': expomf_cov.get_mu}\n", 514 | "\n", 515 | "print 'Test Recall@20: %.4f' % rec_eval.recall_at_k(train_data, test_data, U, V, k=20, mu=mu, vad_data=vad_data)\n", 516 | "print 'Test Recall@50: %.4f' % rec_eval.recall_at_k(train_data, test_data, U, V, k=50, mu=mu, vad_data=vad_data)\n", 517 | "print 'Test NDCG@100: %.4f' % rec_eval.normalized_dcg_at_k(train_data, test_data, U, V, k=100, mu=mu, vad_data=vad_data)\n", 518 | "print 'Test MAP@100: %.4f' % rec_eval.map_at_k(train_data, test_data, U, V, k=100, mu=mu, vad_data=vad_data)" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [], 528 | "source": [] 529 | } 530 | ], 531 | "metadata": { 532 | "kernelspec": { 533 | "display_name": "Python 2", 534 | "language": "python", 535 | "name": "python2" 536 | }, 537 | "language_info": { 538 | "codemirror_mode": { 539 | "name": "ipython", 540 | "version": 2 541 | }, 542 | "file_extension": ".py", 543 | "mimetype": "text/x-python", 544 | "name": "python", 545 | "nbconvert_exporter": "python", 546 | "pygments_lexer": "ipython2", 547 | "version": "2.7.6" 548 | } 549 | }, 550 | "nbformat": 4, 551 | "nbformat_minor": 0 552 | } 553 | -------------------------------------------------------------------------------- /src/processGowalla.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Gowalla Dataset" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "[Gowalla dataset](https://snap.stanford.edu/data/loc-gowalla.html) contains user-venue checkins. This is the script that pre-processes the full dataset and splits it into non-overlapping training, validation, test sets. The data is used in the paper: [\"modeling user exposure in recommendation\"](http://arxiv.org/abs/1510.07025)." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": false 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import json\n", 26 | "import os\n", 27 | "\n", 28 | "import numpy as np\n", 29 | "import pandas as pd" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "Change this to wherever you keep the [processed data](http://dawenl.github.io/data/gowalla_pro.zip)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "DATA_DIR = '/home/waldorf/dawen.liang/gowalla_pro'" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": { 54 | "collapsed": false 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "df = pd.read_table(os.path.join(DATA_DIR, 'gwl_checkins.tsv'), header=None, sep='\\t', names=['uid', 'sid', 'rating'])" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 4, 64 | "metadata": { 65 | "collapsed": false 66 | }, 67 | "outputs": [ 68 | { 69 | "data": { 70 | "text/html": [ 71 | "
\n", 72 | "\n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | "
uidsidrating
0 0 22847 1
1 0 420315 1
2 0 316637 1
3 0 16516 1
4 0 5535878 1
5 0 15372 1
6 0 21714 1
7 0 420315 1
8 0 153505 1
9 0 420315 1
10 0 23261 1
11 0 16907 1
12 0 12973 1
13 0 341255 1
14 0 260957 1
15 0 1933724 1
16 0 105068 1
17 0 34817 1
18 0 27836 1
19 0 15079 1
20 0 15079 1
21 0 22806 1
22 0 1365909 1
23 0 11844 1
24 0 11742 1
25 0 19822 1
26 0 15169 1
27 0 11794 1
28 0 1567837 1
29 0 35513 1
............
6120414 196561 1109654 1
6120415 196561 1176867 1
6120416 196561 2312134 1
6120417 196561 449112 1
6120418 196561 2312134 1
6120419 196578 1341442 1
6120420 196578 1174322 1
6120421 196578 1160482 1
6120422 196578 594064 1
6120423 196578 627093 1
6120424 196578 899939 1
6120425 196578 467635 1
6120426 196578 1250178 1
6120427 196578 797460 1
6120428 196578 496521 1
6120429 196578 1072999 1
6120430 196578 1151842 1
6120431 196578 1151847 1
6120432 196578 635712 1
6120433 196578 697962 1
6120434 196578 594064 1
6120435 196578 627093 1
6120436 196578 899939 1
6120437 196578 964995 1
6120438 196578 797488 1
6120439 196578 616571 1
6120440 196578 965051 1
6120441 196578 906885 1
6120442 196578 965121 1
6120443 196578 1174322 1
\n", 450 | "

6120444 rows × 3 columns

\n", 451 | "
" 452 | ], 453 | "text/plain": [ 454 | " uid sid rating\n", 455 | "0 0 22847 1\n", 456 | "1 0 420315 1\n", 457 | "2 0 316637 1\n", 458 | "3 0 16516 1\n", 459 | "4 0 5535878 1\n", 460 | "5 0 15372 1\n", 461 | "6 0 21714 1\n", 462 | "7 0 420315 1\n", 463 | "8 0 153505 1\n", 464 | "9 0 420315 1\n", 465 | "10 0 23261 1\n", 466 | "11 0 16907 1\n", 467 | "12 0 12973 1\n", 468 | "13 0 341255 1\n", 469 | "14 0 260957 1\n", 470 | "15 0 1933724 1\n", 471 | "16 0 105068 1\n", 472 | "17 0 34817 1\n", 473 | "18 0 27836 1\n", 474 | "19 0 15079 1\n", 475 | "20 0 15079 1\n", 476 | "21 0 22806 1\n", 477 | "22 0 1365909 1\n", 478 | "23 0 11844 1\n", 479 | "24 0 11742 1\n", 480 | "25 0 19822 1\n", 481 | "26 0 15169 1\n", 482 | "27 0 11794 1\n", 483 | "28 0 1567837 1\n", 484 | "29 0 35513 1\n", 485 | "... ... ... ...\n", 486 | "6120414 196561 1109654 1\n", 487 | "6120415 196561 1176867 1\n", 488 | "6120416 196561 2312134 1\n", 489 | "6120417 196561 449112 1\n", 490 | "6120418 196561 2312134 1\n", 491 | "6120419 196578 1341442 1\n", 492 | "6120420 196578 1174322 1\n", 493 | "6120421 196578 1160482 1\n", 494 | "6120422 196578 594064 1\n", 495 | "6120423 196578 627093 1\n", 496 | "6120424 196578 899939 1\n", 497 | "6120425 196578 467635 1\n", 498 | "6120426 196578 1250178 1\n", 499 | "6120427 196578 797460 1\n", 500 | "6120428 196578 496521 1\n", 501 | "6120429 196578 1072999 1\n", 502 | "6120430 196578 1151842 1\n", 503 | "6120431 196578 1151847 1\n", 504 | "6120432 196578 635712 1\n", 505 | "6120433 196578 697962 1\n", 506 | "6120434 196578 594064 1\n", 507 | "6120435 196578 627093 1\n", 508 | "6120436 196578 899939 1\n", 509 | "6120437 196578 964995 1\n", 510 | "6120438 196578 797488 1\n", 511 | "6120439 196578 616571 1\n", 512 | "6120440 196578 965051 1\n", 513 | "6120441 196578 906885 1\n", 514 | "6120442 196578 965121 1\n", 515 | "6120443 196578 1174322 1\n", 516 | "\n", 517 | "[6120444 rows x 3 columns]" 518 | ] 519 | }, 520 | "execution_count": 4, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "df" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 5, 532 | "metadata": { 533 | "collapsed": false 534 | }, 535 | "outputs": [], 536 | "source": [ 537 | "def get_count(df, id):\n", 538 | " playcount_groupbyid = df[[id, 'rating']].groupby(id, as_index=False)\n", 539 | " count = playcount_groupbyid.size()\n", 540 | " return count\n", 541 | "\n", 542 | "def filter_triplets(df, min_sc=20):\n", 543 | " # Only keep the triplets for songs which were listened to by at least min_sc users. \n", 544 | " songcount = get_count(df, 'sid')\n", 545 | " df = df[df['sid'].isin(songcount.index[songcount >= min_sc])]\n", 546 | " \n", 547 | " # Update both usercount and songcount after filtering\n", 548 | " usercount, songcount = get_count(df, 'uid'), get_count(df, 'sid') \n", 549 | " return df, usercount, songcount" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 6, 555 | "metadata": { 556 | "collapsed": true 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "df, usercount, songcount = filter_triplets(df)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 7, 566 | "metadata": { 567 | "collapsed": false 568 | }, 569 | "outputs": [ 570 | { 571 | "name": "stdout", 572 | "output_type": "stream", 573 | "text": [ 574 | "After filtering, there are 2318616 triplets from 57629 users and 47198 venues (sparsity level 0.085%)\n" 575 | ] 576 | } 577 | ], 578 | "source": [ 579 | "sparsity_level = float(df.shape[0]) / (usercount.shape[0] * songcount.shape[0])\n", 580 | "print \"After filtering, there are %d triplets from %d users and %d venues (sparsity level %.3f%%)\" % (df.shape[0], \n", 581 | " usercount.shape[0], \n", 582 | " songcount.shape[0], \n", 583 | " sparsity_level * 100)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 8, 589 | "metadata": { 590 | "collapsed": true 591 | }, 592 | "outputs": [], 593 | "source": [ 594 | "unique_uid = sorted(pd.unique(df['uid']))\n", 595 | "unique_sid = sorted(pd.unique(df['sid']))" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 9, 601 | "metadata": { 602 | "collapsed": false 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "uid2idx = dict((uid, idx) for (idx, uid) in enumerate(unique_uid))\n", 607 | "sid2idx = dict((sid, idx) for (idx, sid) in enumerate(unique_sid))" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 10, 613 | "metadata": { 614 | "collapsed": false 615 | }, 616 | "outputs": [], 617 | "source": [ 618 | "with open(os.path.join(DATA_DIR, 'sid2idx.json'), 'w') as f:\n", 619 | " json.dump(sid2idx, f)" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 11, 625 | "metadata": { 626 | "collapsed": false 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "with open(os.path.join(DATA_DIR, 'uid2idx.json'), 'w') as f:\n", 631 | " json.dump(uid2idx, f)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": 12, 637 | "metadata": { 638 | "collapsed": true 639 | }, 640 | "outputs": [], 641 | "source": [ 642 | "with open(os.path.join(DATA_DIR, 'unique_uid.txt'), 'w') as f:\n", 643 | " for uid in unique_uid:\n", 644 | " f.write('%s\\n' % uid)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 13, 650 | "metadata": { 651 | "collapsed": false 652 | }, 653 | "outputs": [], 654 | "source": [ 655 | "with open(os.path.join(DATA_DIR, 'unique_sid.txt'), 'w') as f:\n", 656 | " for sid in unique_sid:\n", 657 | " f.write('%s\\n' % sid)" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "## Generate train/test/vad sets" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "Pick out 20% of the checkins for heldout test" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": 14, 677 | "metadata": { 678 | "collapsed": true 679 | }, 680 | "outputs": [], 681 | "source": [ 682 | "np.random.seed(12345)\n", 683 | "n_ratings = df.shape[0]\n", 684 | "test = np.random.choice(n_ratings, size=int(0.20 * n_ratings), replace=False)" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 15, 690 | "metadata": { 691 | "collapsed": false 692 | }, 693 | "outputs": [], 694 | "source": [ 695 | "test_idx = np.zeros(n_ratings, dtype=bool)\n", 696 | "test_idx[test] = True\n", 697 | "\n", 698 | "test_df = df[test_idx]\n", 699 | "train_df = df[~test_idx]" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "Make sure there is no empty row/column in the training data" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": 17, 712 | "metadata": { 713 | "collapsed": false 714 | }, 715 | "outputs": [ 716 | { 717 | "name": "stdout", 718 | "output_type": "stream", 719 | "text": [ 720 | "There are total of 57095 unique users in the training set and 57629 unique users in the entire dataset\n" 721 | ] 722 | } 723 | ], 724 | "source": [ 725 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 726 | "(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": 18, 732 | "metadata": { 733 | "collapsed": false 734 | }, 735 | "outputs": [ 736 | { 737 | "name": "stdout", 738 | "output_type": "stream", 739 | "text": [ 740 | "There are total of 47198 unique items in the training set and 47198 unique items in the entire dataset\n" 741 | ] 742 | } 743 | ], 744 | "source": [ 745 | "print \"There are total of %d unique items in the training set and %d unique items in the entire dataset\" % \\\n", 746 | "(len(pd.unique(train_df['sid'])), len(pd.unique(df['sid'])))" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "We can see the some of the users do not have any checkins in the training set, so we move those users from test set" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": 19, 759 | "metadata": { 760 | "collapsed": true 761 | }, 762 | "outputs": [], 763 | "source": [ 764 | "train_uid = set(pd.unique(train_df['uid']))" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 20, 770 | "metadata": { 771 | "collapsed": true 772 | }, 773 | "outputs": [], 774 | "source": [ 775 | "left_uid = list()\n", 776 | "for i, uid in enumerate(pd.unique(df['uid'])):\n", 777 | " if uid not in train_uid:\n", 778 | " left_uid.append(uid)" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": 21, 784 | "metadata": { 785 | "collapsed": true 786 | }, 787 | "outputs": [], 788 | "source": [ 789 | "move_idx = test_df['uid'].isin(left_uid)" 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "execution_count": 22, 795 | "metadata": { 796 | "collapsed": false 797 | }, 798 | "outputs": [], 799 | "source": [ 800 | "train_df = train_df.append(test_df[move_idx])\n", 801 | "test_df = test_df[~move_idx]" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": 23, 807 | "metadata": { 808 | "collapsed": false 809 | }, 810 | "outputs": [ 811 | { 812 | "name": "stdout", 813 | "output_type": "stream", 814 | "text": [ 815 | "There are total of 57629 unique users in the training set and 57629 unique users in the entire dataset\n" 816 | ] 817 | } 818 | ], 819 | "source": [ 820 | "# make sure we are good\n", 821 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 822 | "(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))" 823 | ] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "metadata": {}, 828 | "source": [ 829 | "Pick out 10% of the training rating as validation set" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": 24, 835 | "metadata": { 836 | "collapsed": true 837 | }, 838 | "outputs": [], 839 | "source": [ 840 | "np.random.seed(13579)\n", 841 | "n_ratings = train_df.shape[0]\n", 842 | "vad = np.random.choice(n_ratings, size=int(0.10 * n_ratings), replace=False)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 25, 848 | "metadata": { 849 | "collapsed": true 850 | }, 851 | "outputs": [], 852 | "source": [ 853 | "vad_idx = np.zeros(n_ratings, dtype=bool)\n", 854 | "vad_idx[vad] = True\n", 855 | "\n", 856 | "vad_df = train_df[vad_idx]\n", 857 | "train_df = train_df[~vad_idx]" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "Again make sure there is no empty row/column in the training data" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 26, 870 | "metadata": { 871 | "collapsed": false 872 | }, 873 | "outputs": [ 874 | { 875 | "name": "stdout", 876 | "output_type": "stream", 877 | "text": [ 878 | "There are total of 57294 unique users in the training set and 57629 unique users in the entire dataset\n" 879 | ] 880 | } 881 | ], 882 | "source": [ 883 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 884 | "(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 27, 890 | "metadata": { 891 | "collapsed": false 892 | }, 893 | "outputs": [ 894 | { 895 | "name": "stdout", 896 | "output_type": "stream", 897 | "text": [ 898 | "There are total of 47198 unique items in the training set and 47198 unique items in the entire dataset\n" 899 | ] 900 | } 901 | ], 902 | "source": [ 903 | "print \"There are total of %d unique items in the training set and %d unique items in the entire dataset\" % \\\n", 904 | "(len(pd.unique(train_df['sid'])), len(pd.unique(df['sid'])))" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": {}, 910 | "source": [ 911 | "We can see the some of the users do not have any checkins in the training set, so we move those users from validation set" 912 | ] 913 | }, 914 | { 915 | "cell_type": "code", 916 | "execution_count": 28, 917 | "metadata": { 918 | "collapsed": false 919 | }, 920 | "outputs": [], 921 | "source": [ 922 | "train_uid = set(pd.unique(train_df['uid']))" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": 29, 928 | "metadata": { 929 | "collapsed": true 930 | }, 931 | "outputs": [], 932 | "source": [ 933 | "left_uid = list()\n", 934 | "for i, uid in enumerate(pd.unique(df['uid'])):\n", 935 | " if uid not in train_uid:\n", 936 | " left_uid.append(uid)" 937 | ] 938 | }, 939 | { 940 | "cell_type": "code", 941 | "execution_count": 30, 942 | "metadata": { 943 | "collapsed": true 944 | }, 945 | "outputs": [], 946 | "source": [ 947 | "move_idx = vad_df['uid'].isin(left_uid)" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": 31, 953 | "metadata": { 954 | "collapsed": true 955 | }, 956 | "outputs": [], 957 | "source": [ 958 | "train_df = train_df.append(vad_df[move_idx])\n", 959 | "vad_df = vad_df[~move_idx]" 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": 32, 965 | "metadata": { 966 | "collapsed": false 967 | }, 968 | "outputs": [ 969 | { 970 | "name": "stdout", 971 | "output_type": "stream", 972 | "text": [ 973 | "(1670363, 3) (185185, 3)\n" 974 | ] 975 | } 976 | ], 977 | "source": [ 978 | "print train_df.shape, vad_df.shape" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": 33, 984 | "metadata": { 985 | "collapsed": false 986 | }, 987 | "outputs": [ 988 | { 989 | "name": "stdout", 990 | "output_type": "stream", 991 | "text": [ 992 | "There are total of 57629 unique users in the training set and 57629 unique users in the entire dataset\n" 993 | ] 994 | } 995 | ], 996 | "source": [ 997 | "# make sure we are good\n", 998 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 999 | "(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "metadata": {}, 1005 | "source": [ 1006 | "## Numerize the data into (user_index, item_index, count) format" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 34, 1012 | "metadata": { 1013 | "collapsed": true 1014 | }, 1015 | "outputs": [], 1016 | "source": [ 1017 | "uid = map(lambda x: uid2idx[x], train_df['uid'])\n", 1018 | "sid = map(lambda x: sid2idx[x], train_df['sid'])" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "execution_count": 35, 1024 | "metadata": { 1025 | "collapsed": true 1026 | }, 1027 | "outputs": [], 1028 | "source": [ 1029 | "train_df['uid'] = uid\n", 1030 | "train_df['sid'] = sid" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "code", 1035 | "execution_count": 36, 1036 | "metadata": { 1037 | "collapsed": true 1038 | }, 1039 | "outputs": [], 1040 | "source": [ 1041 | "train_df.to_csv(os.path.join(DATA_DIR, 'train.num.csv'), index=False)" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": 37, 1047 | "metadata": { 1048 | "collapsed": true 1049 | }, 1050 | "outputs": [], 1051 | "source": [ 1052 | "uid = map(lambda x: uid2idx[x], test_df['uid'])\n", 1053 | "sid = map(lambda x: sid2idx[x], test_df['sid'])" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": 38, 1059 | "metadata": { 1060 | "collapsed": false 1061 | }, 1062 | "outputs": [], 1063 | "source": [ 1064 | "test_df['uid'] = uid\n", 1065 | "test_df['sid'] = sid" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": 39, 1071 | "metadata": { 1072 | "collapsed": false 1073 | }, 1074 | "outputs": [], 1075 | "source": [ 1076 | "test_df.to_csv(os.path.join(DATA_DIR, 'test.num.csv'), index=False)" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": 40, 1082 | "metadata": { 1083 | "collapsed": true 1084 | }, 1085 | "outputs": [], 1086 | "source": [ 1087 | "uid = map(lambda x: uid2idx[x], vad_df['uid'])\n", 1088 | "sid = map(lambda x: sid2idx[x], vad_df['sid'])" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": 41, 1094 | "metadata": { 1095 | "collapsed": true 1096 | }, 1097 | "outputs": [], 1098 | "source": [ 1099 | "vad_df['uid'] = uid\n", 1100 | "vad_df['sid'] = sid" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "code", 1105 | "execution_count": 42, 1106 | "metadata": { 1107 | "collapsed": true 1108 | }, 1109 | "outputs": [], 1110 | "source": [ 1111 | "vad_df.to_csv(os.path.join(DATA_DIR, 'vad.num.csv'), index=False)" 1112 | ] 1113 | } 1114 | ], 1115 | "metadata": { 1116 | "kernelspec": { 1117 | "display_name": "Python 2", 1118 | "language": "python", 1119 | "name": "python2" 1120 | }, 1121 | "language_info": { 1122 | "codemirror_mode": { 1123 | "name": "ipython", 1124 | "version": 2 1125 | }, 1126 | "file_extension": ".py", 1127 | "mimetype": "text/x-python", 1128 | "name": "python", 1129 | "nbconvert_exporter": "python", 1130 | "pygments_lexer": "ipython2", 1131 | "version": "2.7.6" 1132 | } 1133 | }, 1134 | "nbformat": 4, 1135 | "nbformat_minor": 0 1136 | } 1137 | -------------------------------------------------------------------------------- /src/processTasteProfile.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Million Song Dataset Taste Profile" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "[Taste profile dataset](http://labrosa.ee.columbia.edu/millionsong/tasteprofile) contains real user - play counts from undisclosed users, with following statistics:\n", 15 | "\n", 16 | "* 1,019,318 unique users\n", 17 | "* 384,546 unique MSD songs\n", 18 | "* 48,373,586 user - song - play count triplets\n", 19 | "\n", 20 | "This is the script that subsamples the full dataset and splits it into non-overlapping training, validation, test sets. This subset is used in the paper: [\"modeling user exposure in recommendation\"](http://arxiv.org/abs/1510.07025)." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 1, 26 | "metadata": { 27 | "collapsed": false 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "import json\n", 32 | "import os\n", 33 | "import sqlite3\n", 34 | "\n", 35 | "import numpy as np\n", 36 | "import pandas as pd" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Change this to wherever you keep the data\n", 48 | "TPS_DIR = '/home/waldorf/dawen.liang/data/tasteprofile/'\n", 49 | "\n", 50 | "# The dataset can be obtained here:\n", 51 | "# http://labrosa.ee.columbia.edu/millionsong/sites/default/files/challenge/train_triplets.txt.zip\n", 52 | "TP_file = os.path.join(TPS_DIR, 'train_triplets.txt')\n", 53 | "\n", 54 | "# track_metadata.db contains all the metadata, which is not required to subsample the data, but only used when \n", 55 | "# referring to the actual information about particular pieces (e.g. artist, song name, etc.)\n", 56 | "# Available here: http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db\n", 57 | "md_dbfile = os.path.join(TPS_DIR, 'track_metadata.db')" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "tp = pd.read_table(TP_file, header=None, names=['uid', 'sid', 'count'])" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "### Get the user-playcount" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 4, 81 | "metadata": { 82 | "collapsed": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "# We only keep songs that are listened to by at least MIN_SONG_COUNT users and users who have listened \n", 87 | "# to at least MIN_USER_COUNT songs\n", 88 | "\n", 89 | "MIN_USER_COUNT = 20\n", 90 | "MIN_SONG_COUNT = 50" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 5, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "def get_count(tp, id):\n", 102 | " playcount_groupbyid = tp[[id, 'count']].groupby(id, as_index=False)\n", 103 | " count = playcount_groupbyid.size()\n", 104 | " return count\n", 105 | "\n", 106 | "def filter_triplets(tp, min_uc=MIN_USER_COUNT, min_sc=MIN_SONG_COUNT):\n", 107 | " # Only keep the triplets for songs which were listened to by at least min_sc users. \n", 108 | " songcount = get_count(tp, 'sid')\n", 109 | " tp = tp[tp['sid'].isin(songcount.index[songcount >= min_sc])]\n", 110 | " \n", 111 | " # Only keep the triplets for users who listened to at least min_uc songs\n", 112 | " # After doing this, some of the songs will have less than min_uc users, but should only be a small proportion\n", 113 | " usercount = get_count(tp, 'uid')\n", 114 | " tp = tp[tp['uid'].isin(usercount.index[usercount >= min_uc])]\n", 115 | " \n", 116 | " # Update both usercount and songcount after filtering\n", 117 | " usercount, songcount = get_count(tp, 'uid'), get_count(tp, 'sid') \n", 118 | " return tp, usercount, songcount" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 6, 124 | "metadata": { 125 | "collapsed": false 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "tp, usercount, songcount = filter_triplets(tp)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 7, 135 | "metadata": { 136 | "collapsed": false 137 | }, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "After filtering, there are 39730795 triplets from 629112 users and 98485 songs (sparsity level 0.064%)\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "sparsity_level = float(tp.shape[0]) / (usercount.shape[0] * songcount.shape[0])\n", 149 | "print \"After filtering, there are %d triplets from %d users and %d songs (sparsity level %.3f%%)\" % (tp.shape[0], \n", 150 | " usercount.shape[0], \n", 151 | " songcount.shape[0], \n", 152 | " sparsity_level * 100)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 8, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "import matplotlib\n", 164 | "matplotlib.use('Agg')\n", 165 | "import matplotlib.pyplot as plt\n", 166 | "%matplotlib inline" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 9, 172 | "metadata": { 173 | "collapsed": false 174 | }, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/plain": [ 179 | "" 180 | ] 181 | }, 182 | "execution_count": 9, 183 | "metadata": {}, 184 | "output_type": "execute_result" 185 | }, 186 | { 187 | "data": { 188 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnAAAAEMCAYAAABEG4uEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtcVHX+P/D3DOBlF7Xkps1M4bp54TCDgxekUjElbpuS\nVJBCC7ibuvnDdXf9musml7bU0tz82pqVkDp4S7y0DYyyBel6IZQZGdBqK5A5g8rFzDG5DXN+f9jZ\nLxGguPKB07yej8d5PDofzsz5zGtO07tz3nNGJggCAQAAAIB0yHt7AgAAAADQPSjgAAAAACQGBRwA\nAACAxKCAAwAAAJAYFHAAAAAAEoMCDgAAAEBiuizgGhsbB0ycOLFYq9UaR40a9cXSpUs3EBElJia+\n94tf/OJrrVZr1Gq1xrNnzwYQEQmCIEtJSdnIcVx5YGBgidFo1IrPtW3btl9zHFfOcVz59u3bnxXH\nz5w5M16r1Ro5jitfsmTJG+L4lStXhoaGhuZrNJrSsLCww1evXr3n7r98AAAAAAkSBKHL5caNGwMF\nQaCWlhbXoKCgUx9//PH0xMTErJycnDntt923b1/M7NmzDwqCQCUlJdqAgACTIAhUXV09fOTIkV/a\nbDZ3m83mPnLkyC8vX77sLQgCqdXq0pKSEq0gCDR79uyD+/fvf0IQBFq8ePH/btiw4feCINCGDRt+\nn5KS8sat5ooFCxYsWLBgweIMyy0voQ4cOLCBiKi5ublfa2uri7e3d833hZ+s/ba5ubmRCQkJO4iI\ntFqt0W63u/I8r8zPzw+NiIjIc3d3v+7u7n49PDzccOTIkceqqqrudzgccq1WayQiio+P1+n1+qj2\nz9V2HAAAAMDZud5qA4fDIQ8MDCz56quvRi5atGgzx3HlREQrV658+S9/+ctfZ8yY8dH69ev/2L9/\n/yae55UqlcoiPlapVPI8zyutVqtCqVTyHY233V6hUFh5nlcSEdXW1np5eHjUExF5enrW1dTUeLef\nm0wmw89IAAAAgGR0dALsTtzyDJxcLneYTKZxPM8rjx49OrWwsDBk7dq1yz/77LMxZ8+eDWhoaBj4\n0ksvvXi3J3a7xFOJxcXF1L+/N7m5LW2z/D9ycenX66c5f0pLampqr8/B2RZkjsydYUHmyNwZlrvp\ntr+FOmTIkG+joqL0p06dmixeRu3Xr1/z/PnztxYXF08kunlmzWKxqMTHiGfk2o9bLBZVR+Ntz+B5\neXnV1tXVeRLdPBsn7rMrAwaoqKXl9TbLa7f78uA2VVZW9vYUnA4yZw+Zs4fM2UPm0tZlAVdfX+9h\ns9kGERE1NDQMzM/PD1Wr1WbxcqYgCLL9+/fPES+rRkZG5mZnZ88jIiopKQl0cXFpVSgU1hkzZnxk\nMBjCbTbbIJvNNshgMITPnDnznyqVyiKXyx3it1Wzs7PnRURE5InPpdPp4omIdDpdfGRkZG7PxQAA\nAAAgHV32wFVXV9/37LPPbhcEQdbY2Dhg7ty5O6OiovSPPvrox1euXBna0NAwUKvVGt9+++3niIhi\nYmJyCgoKpnMcV96/f/+mrKysJCKi++67r3rlypUvBwUFFRERrVq1KsPHx+cyEVFWVlZScnJyZnNz\nc78ZM2Z8NGfOnP1EROnp6amxsbF7MjMzk4cNG3Zp7969T/dsFHA7EhMTe3sKTgeZs4fM2UPm7CFz\naZPd7WuyLMlkMkGc/+nTp2nmzIX07ben22zRRC4ug8lub+qdCQIAAAB8TyaTkcDqSwwAbRUWFvb2\nFJwOMmcPmbOHzNlD5tKGAg4AAABAYnAJFQAAAIABXEIFAAAAcGIo4KBb0DPBHjJnD5mzh8zZQ+bS\nhgIOAAAAQGLQAwcAAADAAHrgAAAAAJwYCjjoFvRMsIfM2UPm7CFz9pC5tKGAAwAAAJAY9MABAAAA\nMIAeOAAAAAAnhgIOugU9E+whc/aQOXvInD1kLm0o4AAAAAAkBj1wAAAAAAygBw4AAADAiaGAg25B\nzwR7yJw9ZM4eMmcPmUsbCjgAAAAAiUEPHAAAAAAD6IEDAAAAcGIo4KBb0DPBHjJnD5mzh8zZQ+bS\nhgIOAAAAQGLQAwcAAADAALMeuMbGxgETJ04s1mq1xlGjRn2xdOnSDUREFRUVI4KDg0+q1WpzXFzc\n7paWFjcioqampv6xsbF71Gq1+eGHHz5+4cKFB8TnWr169Qo/P79zarXafOTIkcfEcYPBEK5Wq81+\nfn7n1q5du1wc72wfAAAAAM6uywJuwIABjUePHp1qNBq1586d8zt58mRwQUHB9JSUlI3Lly9fazab\n1cOGDbu0adOmxUREmzZtWjx8+PCLZrNZvWzZstdSUlI2EhGdOXNm/P79++eYzWa1wWAIX7BgwZaW\nlha3pqam/osWLdpsMBjCS0tLNfv27XvSaDRqiYg62wf0LvRMsIfM2UPm7CFz9pC5tN2yB27gwIEN\nRETNzc39WltbXby9vWtOnTo1OTo6+iARUXx8vE6v10cREeXm5kYmJCTsICKaNWvWBydOnHjI4XDI\n9Xp9VFxc3G4XF5dWhUJh5TiuvKioKKioqCiI47hyhUJhdXV1tcfGxu7R6/VRdrvdtbN9tJeYmEhp\naWn09ttvU1PTZSIqbPPXoyQIjv+sFRYW/uCAxXr3100mU5+aD9axjvWfxrrJZOpT83GGdXye9/x6\nYWEhpaWlUWJiIiUmJtJdJQhCl0tra6s8ICDA5O7ublu2bNmrVqv1vjFjxpwX/15dXT189OjRnwmC\nQKNGjfr88uXL3uLfRo8e/dnFixeHPffcc1t2794dK44vWLDgrV27dsXt3LnzmYULF24Wx3ft2hW3\nYMGCt6qrq4d3to+2y83p31RcXCwMGTJeIBLaLI2Ci0s/AQAAAKC3fV+33LL2up3F9VYFnlwud5hM\npnHffvvtkLCwsMPjxo0z3eoxAAAAANBzbvs2IkOGDPk2KipK//XXX/+irq7OUxzneV6pVCp5IiKl\nUslXVVXdT0TkcDjk9fX1Hl5eXrVKpZK3WCyqto9RqVSW9uMWi0WlUqks3t7eNZ3tA3pX21PEwAYy\nZw+Zs4fM2UPm0tZlAVdfX+9hs9kGERE1NDQMzM/PDx03bpxp8uTJpw4ePBhNRKTT6eIjIyNziYgi\nIyNzdTpdPBHRoUOHZgcHB590cXFpjYyMzN2zZ0+s3W535XleWVZW5j9p0qRPJ06cWFxWVuZvtVoV\nLS0tbnv37n06IiIiz8XFpbWzfQAAAAA4uy7vA2c2m9XPPvvsdkEQZI2NjQPmzp27c9WqVRkVFRUj\n5s6du/P69evuHMeV79ixI8HNza2lqampf0JCwo7z58+PHTRokG3nzp1zfX19K4mIXnnllT/rdLp4\nuVzuWL9+/R/DwsIOExHl5eVFLFu27DWHwyFPSEjYsWLFitVEN28j0tE+fjB53AcOAAAAJOJu3gcO\nN/IFAAAAYAA/Zg+9Bj0T7CFz9pA5e8icPWQubSjgAAAAACQGl1ABAAAAGMAlVAAAAAAnhgIOugU9\nE+whc/aQOXvInD1kLm0o4AAAAAAkBj1wAAAAAAygBw4AAADAiaGAg25BzwR7yJw9ZM4eMmcPmUsb\nCjgAAAAAiUEPHAAAAAAD6IEDAAAAcGIo4KBb0DPBHjJnD5mzh8zZQ+bShgIOAAAAQGLQAwcAAADA\nAHrgAAAAAJwYCjjoFvRMsIfM2UPm7CFz9pC5tKGAAwAAAJAY9MABAAAAMIAeOAAAAAAnhgIOugU9\nE+whc/aQOXvInD1kLm0o4AAAAAAkpssCzmKxqKZOnXpUrVabR48e/fmrr776P0REaWlpaUqlktdq\ntUatVmvMy8uLEB+zevXqFX5+fufUarX5yJEjj4njBoMhXK1Wm/38/M6tXbt2uTheUVExIjg4+KRa\nrTbHxcXtbmlpcSMiampq6h8bG7tHrVabH3744eMXLlx44O6/fOiukJCQ3p6C00Hm7CFz9pA5e8hc\n4gRB6HS5dOmSj9ls9hcEgWw2m/uDDz74hclkCkhLS0tdv379H9pvf/r06fETJkwottvtLjzPK3x9\nfSuam5vdGhsb+/v6+lbwPK9oaWlxnTBhQnFJSYlWEAT61a9+9Y8DBw5EC4JAS5Ys+dvrr7++VBAE\nWrdu3R+XLFnyN0EQ6MCBA9GzZs061H5/N6d/U3FxsTBkyHiBSGizNAouLv0EAAAAgN72fd3SZe11\nu0uXZ+B8fHwu+/v7lxERubu7X9doNKVWq1XxfeX0o29R6PX6qLi4uN0uLi6tCoXCynFceVFRUVBR\nUVEQx3HlCoXC6urqao+Njd2j1+uj7Ha766lTpyZHR0cfJCKKj4/X6fX6KCKi3NzcyISEhB1ERLNm\nzfrgxIkTD3W0T2ALPRPsIXP2kDl7yJw9ZC5trre7YWVlpW9xcfHErKyspOLi4olvvvnm8+++++5v\nxo8ff2bjxo0pQ4cOvWK1WhWPPvrox+JjlEolz/O8UhAEmUqlsrQdLywsDKmtrfXy9PSsE8cVCoWV\n53klERHP80rxMXK53OHh4VFfU1Pj7ePjc7ntvBITE8nX15eqq6upqekyERUSUcj3fz1KguD4z7bi\nwSqeNsZ699dNJlOfmo8zrIv6ynywjvWeWDeZTH1qPs6wjs9zNp/fhYWFVFlZSXfd7Zyms9ls7hMm\nTCgWL3XW1tZ6OhwOmcPhkK1atSp93rx5OkEQ6Lnnntuye/fuWPFxCxYseGvXrl1xO3fufGbhwoWb\nxfFdu3bFLViw4K3q6urhY8aMOS+OV1dXDx89evRngiDQqFGjPr98+bK3+LfRo0d/dunSJZ+28yJc\nQgUAAACJIFaXUImIWlpa3GJiYnLmzp27U7zU6enpWSeTyQSZTCYsWLBgS3Fx8USim2fWLBaLSnys\neBat/bjFYlGpVCqLt7d3TV1dnWfb7ZVKJS8+V1VV1f1ERA6HQ15fX+/h5eVVe5fqVgAAAADJ6rKA\nEwRBNn/+/K1+fn7nli5dukEcr6mp8Rb/OScnJ4bjuHIiosjIyNw9e/bE2u12V57nlWVlZf6TJk36\ndOLEicVlZWX+VqtV0dLS4rZ3796nIyIi8lxcXFonT5586uDBg9FERDqdLj4yMjJXfC6dThdPRHTo\n0KHZwcHBJ+VyuYOgV7U9LQxsIHP2kDl7yJw9ZC5tXfbAHT9+/GGdThev0WhKtVqtkYjolVde+fPO\nnTvnlpaWapqbm/s98MADF7Zu3TqfiGj8+PFnnnjiiQMajaZULpc7tmzZssDNza3Fzc2tZfPmzYvC\nwsIOOxwOeUJCwo7AwMASIqKNGzemzJ07d+eLL774Esdx5evWrfsTEdHixYs3JSQk7FCr1eZBgwbZ\ndu7cObenwwAAAACQAvwWKgAAAAAD+C1UAAAAACeGAg66BT0T7CFz9pA5e8icPWQubSjgAAAAACQG\nPXAAAAAADKAHDgAAAMCJoYCDbkHPBHvInD1kzh4yZw+ZSxsKOAAAAACJQQ8cAAAAAAPogQMAAABw\nYijgoFvQM8EeMmcPmbOHzNlD5tKGAg4AAABAYtADBwAAAMAAeuAAAAAAnBgKOOgW9Eywh8zZQ+bs\nIXP2kLm0oYADAAAAkBj0wAEAAAAwgB44AAAAACeGAg66BT0T7CFz9pA5e8icPWQubSjgAAAAACQG\nPXAAAAAADKAHDgAAAMCJoYCDbkHPBHvInD1kzh4yZw+ZSxsKOAAAAACJ6bKAs1gsqqlTpx5Vq9Xm\n0aNHf/7qq6/+DxHRlStXhoaGhuZrNJrSsLCww1evXr1HfExKSspGjuPKAwMDS4xGo1Yc37Zt2685\njivnOK58+/btz4rjZ86cGa/Vao0cx5UvWbLkDXG8q31A7wkJCentKTgdZM4eMmcPmbOHzKWtywKu\nX79+zX//+99/Zzab1WfOnBn/7rvv/ubs2bMBqamp6VFRUfrS0lJNREREXmpqajoRUU5OTkxVVdX9\n5eXl3NatW+cnJSVlERFdvHhx+EsvvfRiUVFRUFFRUVBGRsaqmpoabyKipKSkrMzMzOTy8nLuwoUL\nDxw4cOAJIqLO9gEAAADg7Los4Hx8fC77+/uXERG5u7tf12g0pVarVZGbmxuZkJCwg4goPj5ep9fr\no4iI9Hp9lDiu1WqNdrvdled5ZX5+fmhERESeu7v7dXd39+vh4eGGI0eOPFZVVXW/w+GQa7VaY/vn\n6mwf0LvQM8EeMmcPmbOHzNlD5tLmersbVlZW+hYXF0/MzMxMrq2t9fLw8KgnIvL09KwTz6ZZrVaF\nSqWyiI9RKpU8z/NKq9WqUCqVfEfjbbdXKBRWnueVRESd7aO9xMRE8vX1perqampqukxEhUQU8v1f\nj5IgOP6zrXiwiqeNsd79dZPJ1Kfm4wzror4yH6xjvSfWTSZTn5qPM6zj85zN53dhYSFVVlbS3XZb\n94G7fv26e0hISOFf/vKXv0ZHRx8cPHjwtWvXrg0W/y6uh4WFHc7IyFgVFBRUREQUHh5uSEtLSyso\nKJgul8sdy5cvX0tEtGbNmheIiKZNm/ZJRkbGqry8vAgiopMnTwanp6enGgyG8M728YPJ4z5wAAAA\nIBFM7wPX0tLiFhMTkzNv3rzs6Ojog0REXl5etXV1dZ5EN8+UeXt71xDdPLNmsVhU4mN5nleqVCpL\n+3GLxaLqaFzcvqt9AAAAADi7Lgs4QRBk8+fP3+rn53du6dKlG8TxyMjIXJ1OF09EpNPp4iMjI3PF\n8ezs7HlERCUlJYEuLi6tCoXCOmPGjI8MBkO4zWYbZLPZBhkMhvCZM2f+U6VSWeRyuUP8tmp2dva8\niIiIvK72Ab2r7WlhYAOZs4fM2UPm7CFzaeuyB+748eMP63S6eI1GUyp+0WD16tUr0tPTU2NjY/dk\nZmYmDxs27NLevXufJiKKiYnJKSgomM5xXHn//v2bsrKykoiI7rvvvuqVK1e+LF5aXbVqVYaPj89l\nIqKsrKyk5OTkzObm5n4zZsz4aM6cOfuJiDrbBwAAAICzw2+hAgAAADCA30IFAAAAcGIo4KBb0DPB\nHjJnD5mzh8zZQ+bShgIOAAAAQGLQAwcAAADAAHrgAAAAAJwYCjjoFvRMsIfM2UPm7CFz9pC5tKGA\nAwAAAJAY9MABAAAAMIAeOAAAAAAnhgIOugU9E+whc/aQOXvInD1kLm0o4AAAAAAkBj1wAAAAAAyg\nBw4AAADAiaGAg25BzwR7yJw9ZM4eMmcPmUsbCjgAAAAAiUEPHAAAAAAD6IEDAAAAcGIo4KBb0DPB\nHjJnD5mzh8zZQ+bShgIOAAAAQGLQAwcAAADAAHrgAAAAAJwYCjjoFvRMsIfM2UPm7CFz9pC5tKGA\nAwAAAJCYLgu45OTkTB8fn8tqtdosjqWlpaUplUpeq9UatVqtMS8vL0L82+rVq1f4+fmdU6vV5iNH\njjwmjhsMhnC1Wm328/M7t3bt2uXieEVFxYjg4OCTarXaHBcXt7ulpcWNiKipqal/bGzsHrVabX74\n4YePX7hw4YG7+7LhToWEhPT2FJwOMmcPmbOHzNlD5tLWZQGXlJSUZTAYwtuOyWQy4Q9/+MPrRqNR\nazQatREREXlERGfOnBm/f//+OWazWW0wGMIXLFiwpaWlxa2pqan/okWLNhsMhvDS0lLNvn37njQa\njVoiopSUlI3Lly9fazab1cOGDbu0adOmxUREmzZtWjx8+PCLZrNZvWzZstdSUlI29lQAAAAAAFLT\nZQE3ZcqUY/fee+837cc7+gaFXq+PiouL2+3i4tKqUCisHMeVFxUVBRUVFQVxHFeuUCisrq6u9tjY\n2D16vT7Kbre7njp1anJ0dPRBIqL4+HidXq+PIiLKzc2NTEhI2EFENGvWrA9OnDjx0N361gb8d9Az\nwR4yZw+Zs4fM2UPm0uZ6Jw968803n3/33Xd/M378+DMbN25MGTp06BWr1ap49NFHPxa3USqVPM/z\nSkEQZCqVytJ2vLCwMKS2ttbL09OzThxXKBRWnueVREQ8zyvFx8jlcoeHh0d9TU2Nt4+Pz+X2c0lM\nTCRfX1+qrq6mpqbLRFRIRCHf//UoCYLjP9uKB6t42hjr3V83mUx9aj7OsC7qK/PBOtZ7Yt1kMvWp\n+TjDOj7P2Xx+FxYWUmVlJd1tt7wPXGVlpe/jjz/+D7PZrCYiqqur8/Tw8KgnutkP99VXX43U6XTx\nCxYs2PLoo49+HBsbu4eIaOHChW+FhIQUCoIgO3r06NTNmzcvIiLavXt3XGFhYUhqamr6o48++vH5\n8+fHEhFdvHhx+PTp0ws+++yzMaNHj/782LFjU7y9vWuIiMaMGfPZJ598Mq19AYf7wAEAAIBU9Op9\n4Dw9PetkMpkgk8mEBQsWbCkuLp5IdPPMmsViUYnbiWfR2o9bLBaVSqWyeHt719TV1Xm23V6pVPLi\nc1VVVd1PRORwOOT19fUeXl5etf/NCwUAAAD4qeh2AVdTU+Mt/nNOTk4Mx3HlRESRkZG5e/bsibXb\n7a48zyvLysr8J02a9OnEiROLy8rK/K1Wq6KlpcVt7969T0dEROS5uLi0Tp48+dTBgwejiYh0Ol18\nZGRkrvhcOp0unojo0KFDs4ODg0/K5XJHR/MBttqeFgY2kDl7yJw9ZM4eMpe2LnvgnnnmmV2ffPLJ\ntLq6Ok+VSmVJT09PLSgomF5aWqppbm7u98ADD1zYunXrfCKi8ePHn3niiScOaDSaUrlc7tiyZcsC\nNze3Fjc3t5bNmzcvCgsLO+xwOOQJCQk7AgMDS4iINm7cmDJ37tydL7744kscx5WvW7fuT0REixcv\n3pSQkLBDrVabBw0aZNu5c+fcno8CAAAAQBrwW6gAAAAADOC3UAEAAACcGAo46Bb0TLCHzNlD5uwh\nc/aQubShgAMAAACQGPTAAQAAADCAHjgAAAAAJ4YCDroFPRPsIXP2kDl7yJw9ZC5tKOAAAAAAJAY9\ncAAAAAAMoAcOAAAAwImhgINuQc8Ee8icPWTOHjJnD5lL20+8gBtOra3NJJPJfrAMHjy0tycGAAAA\ncMd+4j1wMiLq6PXJSMqvGwAAAKQHPXAAAAAATgwFHHQLeibYQ+bsIXP2kDl7yFzaUMABAAAASAx6\n4AAAAAAYQA8cAAAAgBNDAQfdgp4J9pA5e8icPWTOHjKXNhRwAAAAABKDHjgAAAAABtADBwAAAODE\nUMBBt6Bngj1kzh4yZw+Zs4fMpQ0FHAAAAIDEdFnAJScnZ/r4+FxWq9VmcezKlStDQ0ND8zUaTWlY\nWNjhq1ev3iP+LSUlZSPHceWBgYElRqNRK45v27bt1xzHlXMcV759+/ZnxfEzZ86M12q1Ro7jypcs\nWfLG7ewDeldISEhvT8HpIHP2kDl7yJw9ZC5tXRZwSUlJWQaDIbztWGpqanpUVJS+tLRUExERkZea\nmppORJSTkxNTVVV1f3l5Obd169b5SUlJWUREFy9eHP7SSy+9WFRUFFRUVBSUkZGxqqamxlt8/szM\nzOTy8nLuwoULDxw4cOCJrvYBAAAAALco4KZMmXLs3nvv/abtWG5ubmRCQsIOIqL4+HidXq+PIiLS\n6/VR4rhWqzXa7XZXnueV+fn5oREREXnu7u7X3d3dr4eHhxuOHDnyWFVV1f0Oh0Ou1WqN7Z+rs31A\n70PPBHvInD1kzh4yZw+ZS5trdx9QW1vr5eHhUU9E5OnpWSeeTbNarQqVSmURt1MqlTzP80qr1apQ\nKpV8R+Ntt1coFFae55Vd7aMjiYmJ5OvrS9XV1dTUdJmICokopM0WbdcLf/BY8eAVTyNj/dbrJpOp\nT83HGdZFfWU+WMd6T6ybTKY+NR9nWMfnOZvP78LCQqqsrKS77Zb3gausrPR9/PHH/2E2m9VERIMH\nD7527dq1weLfxfWwsLDDGRkZq4KCgoqIiMLDww1paWlpBQUF0+VyuWP58uVriYjWrFnzAhHRtGnT\nPsnIyFiVl5cXQUR08uTJ4PT09FSDwRDe2T5+NHncBw4AAAAkolfvA+fl5VVbV1fnSXTzTJm3t3cN\n0c0zaxaLRSVux/O8UqVSWdqPWywWVUfj4vZd7QMAAAAA7qCAi4yMzNXpdPFERDqdLj4yMjJXHM/O\nzp5HRFRSUhLo4uLSqlAorDNmzPjIYDCE22y2QTabbZDBYAifOXPmP1UqlUUulzvEb6tmZ2fPi4iI\nyOtqH9D72p4WBjaQOXvInD1kzh4yl7Yue+CeeeaZXZ988sm0uro6T5VKZcnIyFiVnp6eGhsbuycz\nMzN52LBhl/bu3fs0EVFMTExOQUHBdI7jyvv379+UlZWVRER03333Va9cufJl8dLqqlWrMnx8fC4T\nEWVlZSUlJydnNjc395sxY8ZHc+bM2U9E1Nk+AAAAAAC/hQoAAADABH4LFQAAAMCJoYCDbkHPBHvI\nnD1kzh4yZw+ZSxsKOAAAAACJQQ8cAAAAAAPogQMAAABwYijgoFvQM8EeMmcPmbOHzNlD5tKGAg4A\nAABAYtADBwAAAMAAeuAAAAAAnBgKOOgW9Eywh8zZQ+bsIXP2kLm0oYADAAAAkBj0wAEAAAAwgB44\nAAAAACeGAg66BT0T7CFz9pA5e8icPWQubSjgAAAAACQGPXAAAAAADKAHDgAAAMCJOWkB50oymewH\ny+DBQ3t7UpKAngn2kDl7yJw9ZM4eMpc2196eQO+wU/tLqzbbXTmjCQAAANDjnLYH7sfj6IsDAACA\nnoMeOAAAAAAnhgIOugU9E+whc/aQOXvInD1kLm0o4AAAAAAk5o4LOF9f30qNRlOq1WqNkyZN+pSI\n6MqVK0NDQ0PzNRpNaVhY2OGrV6/eI26fkpKykeO48sDAwBKj0agVx7dt2/ZrjuPKOY4r3759+7Pi\n+JkzZ8ZrtVojx3HlS5YseeNO5wl3V0hISG9Pwekgc/aQOXvInD1kLm13XMDJZDKhsLAwxGg0aj/9\n9NNJRESpqanpUVFR+tLSUk1EREReampqOhFRTk5OTFVV1f3l5eXc1q1b5yclJWUREV28eHH4Sy+9\n9GJRUVFQUVFRUEZGxqqamhpvIqKkpKSszMzM5PLycu7ChQsPHDhw4Im78YIBAAAApO6/uoTa/psU\nubm5kQnTXxYWAAAWwklEQVQJCTuIiOLj43V6vT6KiEiv10eJ41qt1mi32115nlfm5+eHRkRE5Lm7\nu193d3e/Hh4ebjhy5MhjVVVV9zscDrlWqzW2fy7oXeiZYA+Zs4fM2UPm7CFzabvj+8DJZDIhNDQ0\n3263uz733HNvL168eFNtba2Xh4dHPRGRp6dnnXg2zWq1KlQqlUV8rFKp5HmeV1qtVoVSqeQ7Gm+7\nvUKhsPI8r+xoHomJieTr60vV1dXU1HSZiAqJKKTNFm3XC9s9Wly/+XfxYBZPK2P9x+smk6lPzccZ\n1kV9ZT5Yx3pPrJtMpj41H2dYx+c5m8/vwsJCqqyspLvtju8DV1NT4+3t7V1TW1vrFR4ebli7du3y\nOXPm7L927dpgcZvBgwdfu3bt2uCwsLDDGRkZq4KCgoqIiMLDww1paWlpBQUF0+VyuWP58uVriYjW\nrFnzAhHRtGnTPsnIyFiVl5cXQUR08uTJ4PT09FSDwRD+g8njPnAAAAAgEX3iPnDe3t41REReXl61\nTz755L7i4uKJXl5etXV1dZ5ERLW1tV7iNkqlkrdYLCrxsTzPK1UqlaX9uMViUXU0zvO8su2ZOgAA\nAABndkcF3I0bN35248aNnxERfffddz83GAzhHMeVR0ZG5up0ungiIp1OFx8ZGZlLRBQZGZmbnZ09\nj4iopKQk0MXFpVWhUFhnzJjxkcFgCLfZbINsNtsgg8EQPnPmzH+qVCqLXC53iN9Wzc7Onic+F/Su\ntqeFgQ1kzh4yZw+Zs4fMpe2OeuAuX77sEx0dfVAmkwk3btz4WVxc3O5Zs2Z98Mgjj/wrNjZ2T2Zm\nZvKwYcMu7d2792kiopiYmJyCgoLpHMeV9+/fvykrKyuJiOi+++6rXrly5cvipdVVq1Zl+Pj4XCYi\nysrKSkpOTs5sbm7uN2PGjI/mzJmz/269aAAAAAApw2+hthmTchYAAADQt/WJHjgAAAAA6B0o4KBb\n0DPBHjJnD5mzh8zZQ+bShgIOAAAAQGLQA9dmTMpZAAAAQN+GHrge4UoymewHy+DBQ3t7UgAAAAA/\nggLuP+x086zc/y022ze9O6U+CD0T7CFz9pA5e8icPWQubSjgAAAAACQGPXC3GJNyPgAAANB3oAcO\nAAAAwImhgINuQc8Ee8icPWTOHjJnD5lLGwo4AAAAAIlBD9wtxqScDwAAAPQd6IEDAAAAcGIo4LqE\nm/u2h54J9pA5e8icPWTOHjKXNtfenkDfJt7c9//YbHflzCcAAADAHUMP3B2MSTkzAAAA6B3ogQMA\nAABwYijgoFvQM8EeMmcPmbOHzNlD5tKGAq7b8MUGAAAA6F3ogbtLY1LOEQAAAHoeeuAAAAAAnBgK\nuLvix5dVf6qXVtEzwR4yZw+Zs4fM2UPm0oYC7q4Q7xf3w8Vm+6ZXZ9UTTCZTb0/B6SBz9pA5e8ic\nPWQubX26gDMYDOFqtdrs5+d3bu3atct7ez7d99P7wsPVq1d7ewpOB5mzh8zZQ+bsIXNp67MFXFNT\nU/9FixZtNhgM4aWlpZp9+/Y9aTQatb09r+758Zk5m83WweXWfj+5Qg8AAAB6Tp8t4IqKioI4jitX\nKBRWV1dXe2xs7B69Xh/V2/P673V0ubXlR2O3W+h1PHb723a3UKysrLyracCtIXP2kDl7yJw9ZC5t\nffY2Ijt37px77NixKZs3b15ERLR79+64wsLCkLfeemuhuI1MJuubkwcAAADowN26jUif/TH72ynO\n7lYIAAAAAFLSZy+hKpVK3mKxqMR1i8WiUqlUlt6cEwAAAEBf0GcLuIkTJxaXlZX5W61WRUtLi9ve\nvXufjoiIyOvteQEAAAD0tj57CXXAgAGNmzdvXhQWFnbY4XDIExISdgQGBpb09rwAAAAAelufPQNH\nRBQREZFXVlbmf+7cOb8VK1asbvs36d8jrm/y9fWt1Gg0pVqt1jhp0qRPiYiuXLkyNDQ0NF+j0ZSG\nhYUdvnr16j3i9ikpKRs5jisPDAwskd5tXnpHcnJypo+Pz2W1Wm0Wx+4k423btv2a47hyjuPKt2/f\n/izr1yElHWWelpaWplQqea1Wa9Rqtca8vLwI8W+rV69e4efnd06tVpuPHDnymDiOz53bZ7FYVFOn\nTj2qVqvNo0eP/vzVV1/9HyIc6z2ps8xxrPecxsbGARMnTizWarXGUaNGfbF06dINREQVFRUjgoOD\nT6rVanNcXNzulpYWN6Kbt0iLjY3do1arzQ8//PDxCxcuPCA+V2fvRacEQZDc0tjY2N/X17eC53lF\nS0uL64QJE4pLSkq0vT2vn8Li6+tbUV9fP7Tt2OLFi/93w4YNvxcEgTZs2PD7lJSUNwRBoH379sXM\nnj37oCAIVFJSog0ICDD19vylsBw9enRKSUmJ1t/f33ynGVdXVw8fOXLklzabzd1ms7mPHDnyy0uX\nLvn09mvrq0tHmaelpaWuX7/+D+23PX369PgJEyYU2+12F57nFb6+vhXNzc1u+Nzp3nLp0iUfs9ns\nLwgC2Ww29wcffPALk8kUgGOdfeY41nt2uXHjxkBBEKilpcU1KCjo1Mcffzz9V7/61T8OHDgQLQgC\nLVmy5G+vv/76UkEQaN26dX9csmTJ3wRBoAMHDkTPmjXrUGfvRVNTU7+u9tunz8B15qd7j7i+QWj3\n7d7c3NzIhISEHURE8fHxOjFrvV4fJY5rtVqj3W535XleyX7G0jJlypRj99577w9+Z627Gefn54dG\nRETkubu7X3d3d78eHh5uyM/PD2X/aqSho8yJOv4mu16vj4qLi9vt4uLSqlAorBzHlRcVFQXhc6d7\nfHx8Lvv7+5cREbm7u1/XaDSlVqtVgWO953SWORGO9Z40cODABiKi5ubmfq2trS7e3t41p06dmhwd\nHX2Q6IfHedvjf9asWR+cOHHiIYfDIe/ovfj0008ndbVfSRZwPM8r234jValU8igc7g6ZTCaIlzc2\nbdq0mIiotrbWy8PDo56IyNPTs66mpsabiMhqtSrwPtwd3c3YarUqlEol336c/cyl7c0333x+7Nix\n5+Pj43VXrlwZSnQz846yxfF+5yorK32Li4snPvLII//Csc6GmPmUKVOOEeFY70kOh0M+btw4k4+P\nz+Xp06cX3Hvvvd94enrWiX9XKBRWMb+29YtcLnd4eHjU19TUeN/JcS7JAg438O05p06dmlxSUhL4\n0UcfzcjKykr65z//ObOr7dv/Xx3em7uvo/9zhv/e888//+ZXX3018ty5c34jR478KiUlZWNvz+mn\n6Pr16+5PPvnkvjfeeGPJ4MGDr3W1LY71u+P69evuTz311PtvvPHGkkGDBtlwrPcsuVzuMJlM43ie\nVx49enRqYWFhCJP9stjJ3YZ7xPUcb2/vGiIiLy+v2ieffHJfcXHxRC8vr9q6ujpPoptnisRt2r8P\nPM8r2/4fBNy+7mSsUqks+Hfgv+fp6Vknk8kEmUwmLFiwYEtxcfFEImR+N7W0tLjFxMTkzJs3L1u8\nnIRjvWeJmc+dO3enmDmOdTaGDBnybVRUlP7rr7/+hXiME/3wv41KpZKvqqq6n+jmmbv6+noPLy+v\n2s7ei672J8kCDveI6xk3btz42Y0bN35GRPTdd9/93GAwhHMcVx4ZGZmr0+niiYh0Ol18ZGRkLhFR\nZGRkbnZ29jwiopKSkkDx2n3vvQLp6m7GM2bM+MhgMITbbLZBNpttkMFgCJ85c+Y/e/M1SI146Y6I\nKCcnJ4bjuHKim5nv2bMnVuzBKisr8580adKn+NzpHkEQZPPnz9/q5+d3TvxmHhGO9Z7UWeY41ntO\nfX29h81mG0RE1NDQMDA/Pz903LhxpsmTJ586ePBgNNGPj3Px+D906NDs4ODgky4uLq2dvRdd7ry3\nv71xp0tubm4Ex3FlY8eOPffKK6+s6O35/BSWr7/+eoRGozkbEBBgevDBB7948cUXMwRBoPr6+qEz\nZ87MV6vVpaGhoUe++eabe8THPP/885v8/PzKtVptyZkzZwJ7+zVIYYmLi9s1fPjwajc3t2alUmnJ\nzMxMupOMMzMzk8aOHXtu7Nix5957771f9/br6stL+8y3bt2aHB8fv0Oj0ZwdM2bM+bCwMAPP8wpx\n+5dffvnPY8eOPcdxXJnBYAgTx/G5c/vLsWPHHpHJZI6AgADTuHHjjOPGjTPm5eWF41hnm3lubm4E\njvWeW0pLS9Xjxo0zBgQEmEaPHv1Zenr6KkG4+d/TyZMnn/T39zfHxsbubm5udhOEm3fReOqpp/b6\n+/ubg4ODT1RUVPje6r3obOmzP2YPAAAAAB2T5CVUAAAAAGeGAg4AAABAYlDAAQAAAEgMCjgAAAAA\niUEBB+DkQkJCCs+cOTO+p/ezYcOGpaNHj/5c/BmZn5rCwsKQxx9//B+9PQ8iosTExPdycnJiiIh+\n+9vfvnP+/PmxnW27bdu2X1+8eHF4T87nlVde+XNPPj+AM0IBB+Dk/ptfz2htbXW53W3ffvvt5woK\nCqbv2LEj4U73Bx1zOBw/+CwXb9pKRPTOO+/8duzYsec7e+x7772XWF1dfV9Pzm/16tUrevL5AZwR\nCjgACaisrPQdO3bs+YULF77l7+9fFhISUvjdd9/9nOiHZ9Dq6uo8R4wYUUF08z/M0dHRByMiIvJG\njBhRsWnTpsXr1q3704QJE04HBgaWtL1T+I4dOxImTZr06ZgxYz47fvz4w0Q3f47nmWee2RUQEHCW\n47jy999//ynxeWfNmvVBWFjY4ccee+xI+7m+/PLLK8eOHXt+7Nix59euXbuciGjhwoVvff31178I\nDw83/O1vf/t92+3Pnj0bEBQUVKTVao0ajab0q6++GtnZ83SVw/Hjxx8eM2bMZ5MmTfp02bJlr6nV\nanNHz//ll1/+sv2cP/jgg1njx48/o1arzbNnzz4k3pgzPT09VcwlMTHxPbFQOn/+/NhHHnnkXwEB\nAWe1Wq3x66+//oVMJhOuX7/uHhcXt3vUqFFfPPXUU+8LHfw0VGfvV2c5vPPOO78V34Pk5ORMu93u\nSnTzx8r/9Kc/rZswYcLpoqKioM6OnZCQkMKSkpLA1tZWl4SEhB1qtdqs0WhK169f/8ecnJyY06dP\nT5g3b152YGBgSWNj44CTJ08GBwcHn9RoNKXTp08vEH8MPSQkpPCFF15Y89BDD50YMWJExccff/zo\n7eT7wgsvrGloaBio1WqN4tnXjt5bAOim3r4JHhYsWG69VFRU+Lq6uraYzWZ/QRDo6aef3pOVlZUo\nCAKFhIQUiDc9ra2t9fT19a0QBIGysrISf/nLX/67oaFhQG1trefgwYO/fffdd+cLgkBLly59/bXX\nXvuTIAg0bdq0wkWLFv1dEAQ6fvz4Q6NGjfpc3Ean080TBIG++eabe0aOHPnltWvXBmVlZSUqlUrL\ntWvXBrWf5/Hjxx9Sq9WlTU1N/RoaGgZwHFdWVFQ0SRAE8vX1raivrx/a/jG/+93v3tyzZ8/TgiBQ\na2urvKGhYUBHz3Pq1KmgrnJ48MEHvyguLp4gCAKtXLnyr2q1urSz52+7/0uXLvkEBwefuHHjxkBB\nEGjNmjXLV65c+VdBEOjbb78dLG6XkJCwfd++fTGCIJBGozn74YcfRgmCQHa73eXGjRsDCwoKQoYM\nGXL10qVLPg6HQxYcHHyioKAgpP3r7ez9WrRo0d/bz9NkMgVERUV9aLfbXcRt3nnnnd8IgkAymcyx\nf//+Jzo6XhITE7NycnLmtN1fUVHRpIiIiFxxm+vXr/+8/Xyampr6jR8//nRdXZ2HIAi0e/fu2Hnz\n5unE7ZYvX75GEG7e5HXatGmFt5OvIAjk7u5u6+oYOXXqVFBv/zuGBYvUFtfeLiAB4PaMGDGiwt/f\nv4yIaPz48Wfa/m5eZ6ZPn14wYMCAxgEDBjTec889V8Wfc1Gr1WaTyTSO6ObltqeffnovEdFDDz10\norGxcUBtba3XkSNHHsvPzw9dt27dn4iI7Ha7a1VV1f0ymUwIDQ3NHzRokK39/v71r389MmfOnP39\n+vVrJiKaM2fO/qNHj07t6idhHnnkkX/99a9//UtFRcWI6Ojog6NHj/68o+c5duzYlKeeeur9jnKo\nra31am5u7jdhwoTTRESxsbF7Dh06NLuz52+7/2PHjk3597///eBDDz10goioubm5X1BQUBER0Ycf\nfvir9evX/9Fut7vW19d7jBkz5rPa2lqv+vp6j6ioKD0RkYuLS+vAgQMbiIgmTZr0qY+Pz2UionHj\nxplu5z0STZky5Vj7eebn54cajUat+LoaGhoGenl51Yr7FX/r8naMGjXqiy+//PKXKSkpG8PDww1t\nfxpJ+P5MYWlpqebf//73g+JPVbW2trqIr4eIaPbs2YeIiAIDA0vE13arfNvr7L0VMweA24MCDkAi\n+vfv3yT+s4uLS6v4H125XO4QL+01NjYO6OwxcrncIa63fUxHxP6pDz74YJZ4iU90+vTpCT//+c+/\n6+xxQpvLhoIgyG7VY/fMM8/smjx58im9Xh/1+OOP/2PLli0Lunqe9jl09DraPraj558+fXpB2+0j\nIiLytm/f/mzbsevXr7v//ve//1tpaalm2LBhl9LT01PtdrtrV6/ndubW2fvV0TyJiObPn781IyNj\nVfvnGTBgQGN3+hfvueeeq0ajUXv48OGwd9999zf79u17MjMzM5no/95vQRBkAQEBZ48ePTq1q9fX\n9rXdTr5t3ckxAgA/hh44AIkS/yOoVCr506dPTyAiOnDgwBPdeaz4z/v27XuSiOjkyZPBAwcObPD0\n9KwLCws7/Pe///134nZlZWX+7R/b3iOPPPKvgwcPRjc3N/drbGwccPDgweipU6ce7WouVVVV948Y\nMaJi8eLFm2bPnn3IaDRqO3uezvbt5eVV269fv2axt0zs1yMiunDhwgPtn7/tY6dMmXKsoKBgelVV\n1f1EN4uqr776aqTdbneVy+WOe+6552pDQ8NA8Tk9PT3rvLy8aj/88MNfERG1tLS4NTQ0DOzqNbbV\n2fvV0TxDQ0Pz9+7d+/Q333xzLxHRtWvXBvM8r7zdfYkEQZBduXJlqCAIsjlz5uzPyMhYJc5h4MCB\nDWIfoUajKa2qqrpfzMhut7t+/vnno7t67lvlS3Sz4BO/8HInxwgA/BjOwAFIRPuzFOL6smXLXouJ\nicnZunXr/PDwcIM43vabiO0f3/ZvMplM6NevX3NQUFDRt99+O0Q8K/PSSy+9uGjRos1+fn7nXF1d\n7SqVyqLX66PaP29bwcHBJ2NjY/cEBAScJSJKSkrKmjhxYnFH8xdlZ2fP27lz51xXV1f78OHDL77w\nwgtrPDw86jt6nsrKSt/OcsjMzEyOj4/XDR48+FpwcPBJ8bJmdnb2vF27dj3T9vnbPt7Hx+fy22+/\n/dysWbM+ILr5jc6XX3555eOPP/6PpKSkrDFjxnz2wAMPXGh7iW/Xrl3PzJ8/f+uf//znV9zc3Fr2\n7dv3ZEe5dPSaO3u/Opqnh4dH/YoVK1ZPmTLlmKurq10ulzveeuuthUqlku/OWSuZTCZYLBZVYmLi\ne+KY+M3QhISEHUlJSVmDBw++duLEiYfef//9pxYuXPhWU1NTf7vd7pqSkrKxo8uiXc27/baJiYnv\njR079nxQUFDRjh07Ejo7RgDg9uHH7AHgJ6GhoWGgWLStWbPmhaqqqvvbnkEEAPgpwRk4APhJ+OCD\nD2atXr16RUNDw0CVSmXZuXPn3N6eEwBAT8EZOAAAAACJwZcYAAAAACQGBRwAAACAxKCAAwAAAJAY\nFHAAAAAAEoMCDgAAAEBiUMABAAAASMz/B7+o5dzF/e6uAAAAAElFTkSuQmCC\n", 189 | "text/plain": [ 190 | "" 191 | ] 192 | }, 193 | "metadata": {}, 194 | "output_type": "display_data" 195 | } 196 | ], 197 | "source": [ 198 | "plt.figure(figsize=(10, 4))\n", 199 | "usercount.hist(bins=100)\n", 200 | "plt.xlabel('number of songs each user listens to')" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 10, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "" 214 | ] 215 | }, 216 | "execution_count": 10, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | }, 220 | { 221 | "data": { 222 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAm0AAAEMCAYAAACWS4HcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XlYU1f+P/APAUUdsFYWlyQWa13gEiAgIm5FAYFg1aoV\nF7BF+53qdyy003Fopx2pOiM61nGpM3ZmrKiAirVqFyCKCmpFEAHL4lZbkNy4sFVFRbac3x/+bifl\ni9jpkBI479fz3OfhnNybnPe9QT/ce3JjwRgjAAAAADBvso4eAAAAAAA8GYo2AAAAgE4ARRsAAABA\nJ4CiDQAAAKATQNEGAAAA0AmgaAMAAADoBJ5YtK1Zs+btYcOGXXF1dS3etGlTNBFRTU1N38DAwHQ3\nN7fCoKCgw7dv3+4jrR8VFbVZEIQST0/P/IKCArXUv3PnzpcFQSgRBKFk165dC6T+vLw8L7VaXSAI\nQkl0dPSm9g4IAAAA0CUwxh67nDt3zksQhOK6uroeTU1NlgEBAemFhYWqpUuXfrhhw4Y3GGO0YcOG\nN6KiojYxxmj//v0zp02bdogxRvn5+Wp3d/fzjDG6fv36gCFDhlytra21qa2ttRkyZMjVW7duOTLG\nSKVSFebn56sZYzRt2rRDBw4ceLGtMWHBggULFixYsPC4tHmm7fLly8NHjx6d3aNHj4eWlpbNzz//\n/InPP/98ampqqiYiIiKBiCg8PDwxJSUllIgoJSUlVOpXq9UFTU1NVqIoKtLT0wNDQkLSbGxs7tnY\n2NwLDg7WHjlyZHJ5efkgg8EgU6vVBS2fCwAAAAD+zaqtB1UqVVFsbOyKmpqavj169HiYmpqqcXNz\nK6ysrHSws7OrJiKyt7evqqiocCQi0uv1cqVSqZO2VygUoiiKCr1eL1coFGJr/cbry+VyvSiKCuMx\nWFhY4CsbAAAAoNNgjFmY4nnbPNOmUqmKfvvb3/7Vz88vc+LEiRkqlaroSUWUKQba0acjO2KJjY3t\n8DEgN3IjN3IjN3Ij93+2mNITP4iwZMmSrYWFhW45OTk+AwcOvD5ixIhLDg4OlVVVVfZERJWVlQ6O\njo4VRI/OoOl0OqW0rSiKCqVSqWvZr9PplK31i6KoMD4jJwkKepGGDh35o8XNzZcqKir+2/xmq6ys\nrKOH0CGQmy/IzRfk5guvuU3piUWbVJzdvHmzf3JyclhYWFiyRqNJTUxMDCciSkxMDNdoNKlERBqN\nJjUpKWk+EVF+fr6npaVls1wu1/v7+x/TarXBtbW1trW1tbZarTY4ICDgqFKp1MlkMoP0KdOkpKT5\n0nMZO3HiGF29+gFdvfrRD8t3392iysrK9twXAAAAAGarzTltREQvvvjiwbt37/bu1q1b49/+9rff\n9O/f/+aKFStiw8LCkrdv376wf//+N/ft2zebiGjmzJmfZmRkTBQEocTa2ro+Pj4+koho4MCB1999\n990/+/j45BARLV++fGW/fv1uERHFx8dHLly4cHtDQ0N3f3//YzNmzDjQ+kg8iaj3Dy1Ly57/bXaz\n9sorr3T0EDoEcvMFufmC3HzhNbcpWZj6+ut/y8LCgllb21J9vUjGRVvv3gJlZe0jQRA6bnAAAAAA\nRiwsLIh1xAcRoONkZmZ29BA6BHLzBbn5gtx84TW3KaFoAwAAAOgEcHkUAAAAoJ3g8igAAAAA51C0\nmSle5wIgN1+Qmy/IzRdec5sSijYAAACATgBz2gAAAADaCea0AQAAAHAORZuZ4nUuAHLzBbn5gtx8\n4TW3KaFoAwAAAOgEMKcNAAAAoJ1gThsAAAAA51C0mSle5wIgN1+Qmy/IzRdec5sSijYAAACATgBz\n2gAAAADaSYfOaYuNjV0xbNiwKyNGjLg0a9as/Q8ePOhVWlo62NfX94xKpSqaM2fO3sbGxm5ERPX1\n9dZhYWHJKpWqaOzYsaevXbv2jPQ8cXFx77i4uFxQqVRFR44cmSz1a7XaYJVKVeTi4nJh7dq1MaYI\nCQAAANDZtVm0Xb169bmEhISI4uJi10uXLo2wtLRs3rNnz9yoqKjNMTExa4uKilT9+/e/uWXLlqVE\nRFu2bFk6YMCAG0VFRaply5ati4qK2kxElJeX53XgwIEZRUVFKq1WG/zaa6/9o7GxsVt9fb31kiVL\ntmq12uDCwkK3/fv3zyooKFD/EsHNHa9zAZCbL8jNF+TmC6+5TanNoq1v37413bp1a7x///6vmpqa\nrB48eNBr0KBB5dnZ2aOnT59+iIgoPDw8MSUlJZSIKDU1VRMREZFARDR16tTPs7KyxhgMBllKSkro\nnDlz9lpaWjbL5XK9IAglOTk5Pjk5OT6CIJTI5XK9lZVVU1hYWLL0XAAAAADwb1ZtPdi3b9+at956\na/2gQYPKe/bsWRcUFHTY1dW12N7evkpaRy6X60VRVBARiaKoUCqVOiIimUxmsLOzq66oqHDU6/Xy\nSZMmHZe2USgUoiiKCsaYhbS+1J+ZmenXchyNjXVEFEdE1kTUh4g8fnhMquT9/PzQ7gJtqc9cxoO2\nadtSn7mMB23TtqU+cxkP2qZtS33mMh5TtaWfy8rKyNTa/CDCt99+O+SFF1744tSpU+OfeuqpOy+9\n9NInL7744sG4uLh3Ll686ExEdOPGjQETJ07MuHTp0ojhw4dfPnXq1HhHR8cKIqIRI0ZcyszM9IuN\njV0xadKk42FhYclERIsXL/7Iz88vkzFmcfLkyQlbt25dQkS0d+/eOZmZmX4fffTR4h8GiA8iAAAA\nQCfRYR9EOHv27KgxY8Zk2dnZVVtZWTXNmDHjwMmTJydUVVXZS+uIoqhQKBQi0aMzZeXl5YOIiAwG\ng6y6utrOwcGhUqFQiDqdTmm8jVKp1LXs1+l0SuMzbzwzruB5gtx8QW6+IDdfeM1tSm0Wbc8999zV\n7Ozs0XV1dT0ZYxZHjx4NGDFixKXRo0dnHzp0aDoRUWJiYrhGo0klItJoNKmJiYnhRESfffbZNF9f\n3zOWlpbNGo0mNTk5OaypqclKFEVFcXGx66hRo856e3vnFhcXu+r1enljY2O3ffv2zQ4JCUkzfWwA\nAACAzuWJ92l7//33309KSpovk8kMarW6YMeOHa/cuHFjwLx583bfu3fPRhCEkoSEhIhu3bo11tfX\nW0dERCRcvHjR2dbWtnb37t3znJycyoiIVq9e/YfExMRwmUxmWL9+/VtBQUGHiYjS0tJCli1bts5g\nMMgiIiIS3nnnnbgfDRCXRwEAAKCTMOXlUdxcFwAAAKCd4AvjOcTrXADk5gty8wW5+cJrblNC0QYA\nAADQCeDyKAAAAEA7weVRAAAAAM6haDNTvM4FQG6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPM\naQMAAADgHIo2M8XrXADk5gty8wW5+cJrblNC0QYAAADQCWBOGwAAAEA7wZw2AAAAAM6haDNTvM4F\nQG6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgXJtF2+XLl4er1eoCaXnqqafubN68\nOaqmpqZvYGBgupubW2FQUNDh27dv95G2iYqK2iwIQomnp2d+QUGBWurfuXPny4IglAiCULJr164F\nUn9eXp6XWq0uEAShJDo6epNpYnY+vM4FQG6+IDdfkJsvvOY2pTaLtuHDh18uKChQFxQUqPPy8rx6\n9er14MUXXzwYGxu7IjQ0NKWwsNAtJCQkLTY2dgUR0aeffjqzvLx8UElJifDxxx8vioyMjCciunHj\nxoBVq1b9MScnxycnJ8dn5cqVyysqKhyJiCIjI+O3b9++sKSkRLh27dozBw8efNH0sQEAAAA6l598\nefTo0aMBzz333FWlUqlLTU3VREREJBARhYeHJ6akpIQSEaWkpIRK/Wq1uqCpqclKFEVFenp6YEhI\nSJqNjc09Gxube8HBwdojR45MLi8vH2QwGGRqtbqg5XPxzs/Pr6OH0CGQmy/IzRfk5guvuU3J6qeu\nuHfv3jlz587dQ0RUWVnpYGdnV01EZG9vXyWdNdPr9XKlUqmTtlEoFKIoigq9Xi9XKBRia/3G68vl\ncr0oioqWr93YWEdEcURkTUR9iMjjh8ek06/SmwNttNFGG2200Ub7l2pLP5eVlZHJMcaeuNTX13e3\nt7evrKiocGCMka2t7V3jx6X25MmTD2dnZ/tI/UFBQdozZ86MXr169Ttr1qyJkfrj4uLejouLezsr\nK8s3ODg4TerPysryDQoK0ho/NxExa2tbRnSHEbEflt69XVhxcTHrqjIyMjp6CB0CufmC3HxBbr7w\nmvtRafXk2urnLLKfUtilpaWFeHl55Tk4OFQSETk4OFRWVVXZEz066+bo6FhB9OgMmk6nU0rbiaKo\nUCqVupb9Op1O2Vq/KIoK4zNyAAAAAPDITyra9uzZM1e6NEpEpNFoUhMTE8OJiBITE8M1Gk2q1J+U\nlDSfiCg/P9/T0tKyWS6X6/39/Y9ptdrg2tpa29raWlutVhscEBBwVKlU6mQymUH6lGlSUtJ86bl4\nJ51+5Q1y8wW5+YLcfOE1tyk98ea69+/f/9UzzzxzrbS0dLCtrW0tEVFNTU3fsLCw5Fu3bvXr37//\nzX379s3u06fPbSKipUuXbsnIyJhobW1dv23btlc9PT3ziYji4+Mj161bt4yIKCYmZu3LL7+8k+jR\nLT9effXVbQ0NDd39/f2Pbd68OepHA8TNdQEAAKCTMOXNdfGNCGYqMzOTy79SkJsvyM0X5OYLr7nx\njQgAAAAAnMOZNgAAAIB2gjNtAAAAAJxD0WamjG/axxPk5gty8wW5+cJrblNC0QYAAADQCWBOGwAA\nAEA7wZw2AAAAAM6haDNTvM4FQG6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgHIo2\nM8XrXADk5gty8wW5+cJrblNC0QYAAADQCWBOGwAAAEA7wZw2AAAAAM6haDNTvM4FQG6+IDdfkJsv\nvOY2pScWbbdv3+7z0ksvfeLu7v61s7Pzxezs7NE1NTV9AwMD093c3AqDgoIO3759u4+0flRU1GZB\nEEo8PT3zCwoK1FL/zp07XxYEoUQQhJJdu3YtkPrz8vK81Gp1gSAIJdHR0ZvaPyIAAABAF8AYa3OZ\nNWvWJ7t3757LGKPm5mbZnTt3ei9duvTDDRs2vMEYow0bNrwRFRW1iTFG+/fvnzlt2rRDjDHKz89X\nu7u7n2eM0fXr1wcMGTLkam1trU1tba3NkCFDrt66dcuRMUYqlaowPz9fzRijadOmHTpw4MCLxq9P\nRMza2pYR3WFE7Ield28XVlxczAAAAADMxaPSqu3a6ucubZ5pq66utjt//rzH3Llz9xARyWQyQ+/e\nve+mpqZqIiIiEoiIwsPDE1NSUkKJiFJSUkKlfrVaXdDU1GQliqIiPT09MCQkJM3GxuaejY3NveDg\nYO2RI0cml5eXDzIYDDK1Wl3Q8rkAAAAA4N+s2nrwm2++Gerg4FA5e/bsfRcuXHDx9PTM/9vf/vab\nyspKBzs7u2oiInt7+6qKigpHIiK9Xi9XKpU6aXuFQiGKoqjQ6/VyhUIhttZvvL5cLteLoqhoOY7G\nxjoiiiMiayLqQ0QePzwmXTP38/PrUm2pz1zG80u1N27cSB4eHmYzHhxv07ZxvM1jPDjepm1LfeYy\nHhzv9m1LP5eVlZHJtXUa7vTp02OsrKwaz549680Yo+jo6I3Lli37i62t7V3j9aT25MmTD2dnZ/tI\n/UFBQdozZ86MXr169Ttr1qyJkfrj4uLejouLezsrK8s3ODg4TerPysryDQoK0ho/N3F6eTQjI6Oj\nh9AhkJsvyM0X5OYLr7mpoy6PKpVKnVwu13t7e+cSEc2aNWv/+fPnPRwdHSuqqqrsiYgqKysdHB0d\nK4genUHT6XRKaXtRFBVKpVLXsl+n0ylb6xdFUWF8Ro5nUiXPG+TmC3LzBbn5wmtuU3pi0WZvb191\n5cqVYURER48eDXB2dr4YEhKSlpiYGE5ElJiYGK7RaFKJiDQaTWpSUtJ8IqL8/HxPS0vLZrlcrvf3\n9z+m1WqDa2trbWtra221Wm1wQEDAUaVSqZPJZAbpU6ZJSUnzpecCAAAAACNPOhV3/vx595EjR+a6\nuLiUhISEpNbU1DxdXV3dNyAgIF2lUhUGBgYe+f777/tI6//mN7/Z4uLiUqJWq/Pz8vI8pf7t27dH\nOjs7X3B2dr6wY8eOl6X+c+fOeXl4eBS4uLiUvP7665tbvj7h8ihXkJsvyM0X5OYLr7nJhJdH2/wg\nAhGRu7v717m5ud4t+9PT0wNbW3/Lli1LW+uPjIyMj4yMjG/Z7+XllWd8PzcAAAAA+L/w3aMAAAAA\n7QTfPQoAAADAORRtZsr4/i88QW6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgHIo2\nM8XrXADk5gty8wW5+cJrblNC0QYAAADQCWBOGwAAAEA7wZw2AAAAAM6haDNTvM4FQG6+IDdfkJsv\nvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgHIo2M8XrXADk5gty8wW5+cJrblNC0QYAAADQCTyx\naHNycipzc3MrVKvVBaNGjTpLRFRTU9M3MDAw3c3NrTAoKOjw7du3+0jrR0VFbRYEocTT0zO/oKBA\nLfXv3LnzZUEQSgRBKNm1a9cCqT8vL89LrVYXCIJQEh0dvam9A3ZWfn5+HT2EDoHcfEFuviA3X3jN\nbUpPLNosLCxYZmamX0FBgfrs2bOjiIhiY2NXhIaGphQWFrqFhISkxcbGriAi+vTTT2eWl5cPKikp\nET7++ONFkZGR8UREN27cGLBq1ao/5uTk+OTk5PisXLlyeUVFhSMRUWRkZPz27dsXlpSUCNeuXXvm\n4MGDL5oyMAAAAEBn9JMuj7b8FERqaqomIiIigYgoPDw8MSUlJZSIKCUlJVTqV6vVBU1NTVaiKCrS\n09MDQ0JC0mxsbO7Z2NjcCw4O1h45cmRyeXn5IIPBIFOr1QUtn4t3vM4FQG6+IDdfkJsvvOY2Jasn\nrWBhYcECAwPTm5qarH7961//c+nSpVsqKysd7OzsqomI7O3tq6SzZnq9Xq5UKnXStgqFQhRFUaHX\n6+UKhUJsrd94fblcrhdFUdFyDI2NdUQUR0TWRNSHiDx+eEx6U0inYbtKu6vne1z7/PnzZjUeHG/T\ntnG8zWM8ON6mbUvMZTw43u3bln4uKysjU3vifdoqKiocHR0dKyorKx2Cg4O1a9eujZkxY8aBu3fv\n/nDTtN69e9+9e/du76CgoMMrV65c7uPjk0NEFBwcrH3//fffz8jImCiTyQwxMTFriYjWrFnzNhHR\n888/f2LlypXL09LSQoiIzpw547tixYpYrVYb/MMAcZ82AAAA6CQ69D5tjo6OFUREDg4OlbNmzdqf\nm5vr7eDgUFlVVWVPRFRZWekgraNQKESdTqeUthVFUaFUKnUt+3U6nbK1flEUFcZn5AAAAADgkTaL\ntgcPHvR68OBBLyKi+/fv/0qr1QYLglCi0WhSExMTw4mIEhMTwzUaTSoRkUajSU1KSppPRJSfn+9p\naWnZLJfL9f7+/se0Wm1wbW2tbW1tra1Wqw0OCAg4qlQqdTKZzCB9yjQpKWm+9Fy8a3lanRfIzRfk\n5gty84XX3KbU5py2W7du9Zs+ffohCwsL9uDBg15z5szZO3Xq1M/HjRv3VVhYWPL27dsX9u/f/+a+\nfftmExHNnDnz04yMjImCIJRYW1vXx8fHRxIRDRw48Pq77777Z+my6fLly1f269fvFhFRfHx85MKF\nC7c3NDR09/f3PzZjxowDpg4NAAAA0Nngu0cBAAAA2gm+exQAAACAcyjazBSvcwGQmy/IzRfk5guv\nuU0JRRsAAABAJ4A5bQAAAADtBHPaAAAAADiHos1M8ToXALn5gtx8QW6+8JrblFC0AQAAAHQCmNMG\nAAAA0E4wpw0AAACAcyjazBSvcwGQmy/IzRfk5guvuU0JRRsAAABAJ4A5bQAAAADtBHPaAAAAADiH\nos1M8ToXALn5gtx8QW6+8JrblFC0AQAAAHQCmNMGAAAA0E46fE5bc3OzpVqtLnjhhRe+ICIqLS0d\n7Ovre0alUhXNmTNnb2NjYzciovr6euuwsLBklUpVNHbs2NPXrl17RnqOuLi4d1xcXC6oVKqiI0eO\nTJb6tVptsEqlKnJxcbmwdu3amPYOCAAAANAV/KSibdOmTdEuLi4XLCwsGBFRVFTU5piYmLVFRUWq\n/v3739yyZctSIqItW7YsHTBgwI2ioiLVsmXL1kVFRW0mIsrLy/M6cODAjKKiIpVWqw1+7bXX/tHY\n2Nitvr7eesmSJVu1Wm1wYWGh2/79+2cVFBSoTRe38+B1LgBy8wW5+YLcfOE1tyk9sWgTRVGRmpqq\nefXVV7cxxiyam5sts7OzR0+fPv0QEVF4eHhiSkpKKBFRamqqJiIiIoGIaOrUqZ9nZWWNMRgMspSU\nlNA5c+bstbS0bJbL5XpBEEpycnJ8cnJyfARBKJHL5XorK6umsLCwZOm5AAAAAODfrJ60wptvvrlh\n3bp1y+7evdubiKiiosLR3t6+SnpcLpfrRVFUED0q8JRKpY6ISCaTGezs7KorKioc9Xq9fNKkScel\nbRQKhSiKooIxZiGtL/VnZmb6tRxDY2MdEcURkTUR9SEijx8ekyp5Pz8/tLtAW+ozl/Ggbdq21Gcu\n40HbtG2pz1zGg7Zp21KfuYzHVG3p57KyMjI5xthjly+++GLK//7v//6NMUYZGRl+U6ZM+UKv1w8c\nMWLERWmd69evDxg+fPglxhgNGzbs8q1btxylx4YPH37pxo0b/X/961//Y+/evWFS/2uvvfbRnj17\n5uzevXvu4sWLt0r9e/bsmfPaa699ZDwGImLW1raM6A4jYj8svXu7sOLiYgYAAABgLh6VVo+vrf6b\nRdZWQZeVlTXm888/nzp48ODSuXPn7jl+/PikmJiYtVVVVfbSOqIoKhQKhUj06ExZeXn5ICIig8Eg\nq66utnNwcKhUKBSiTqdTGm+jVCp1Lft1Op3S+Mwbz4wreJ4gN1+Qmy/IzRdec5tSm0Xb6tWr/6DT\n6ZSlpaWD9+7dO2fSpEnHExISIkaPHp196NCh6UREiYmJ4RqNJpWISKPRpCYmJoYTEX322WfTfH19\nz1haWjZrNJrU5OTksKamJitRFBXFxcWuo0aNOuvt7Z1bXFzsqtfr5Y2Njd327ds3OyQkJM30sQEA\nAAA6l598n7YTJ048v379+rc+//zzqaWlpYPnzZu3+969ezaCIJQkJCREdOvWrbG+vt46IiIi4eLF\ni862tra1u3fvnufk5FRG9KgATExMDJfJZIb169e/FRQUdJiIKC0tLWTZsmXrDAaDLCIiIuGdd96J\n+9EAcZ82AAAA6CRMeZ823FwXAAAAoJ10+M114ZfH61wA5OYLcvMFufnCa25TQtEGAAAA0Ang8igA\nAABAO8HlUQAAAADOoWgzU7zOBUBuviA3X5CbL7zmNiUUbQAAAACdAOa0AQAAALQTzGkDAAAA4ByK\nNjPF61wA5OYLcvMFufnCa25TQtEGAAAA0AlgThsAAABAO8GcNgAAAADOoWgzU7zOBUBuviA3X5Cb\nL7zmNiUUbQAAAACdAOa0AQAAALSTDpvT9vDhwx7e3t65arW6YNiwYVfefPPNDUREpaWlg319fc+o\nVKqiOXPm7G1sbOxGRFRfX28dFhaWrFKpisaOHXv62rVrz0jPFRcX946Li8sFlUpVdOTIkclSv1ar\nDVapVEUuLi4X1q5dG2OKkAAAAACdXZtFW48ePR6ePHlyQkFBgfrChQsuZ86c8c3IyJgYFRW1OSYm\nZm1RUZGqf//+N7ds2bKUiGjLli1LBwwYcKOoqEi1bNmydVFRUZuJiPLy8rwOHDgwo6ioSKXVaoNf\ne+21fzQ2Nnarr6+3XrJkyVatVhtcWFjotn///lkFBQXqXyK4ueN1LgBy8wW5+YLcfOE1tyk9cU5b\nz54964iIGhoaujc3N1s6OjpWZGdnj54+ffohIqLw8PDElJSUUCKi1NRUTURERAIR0dSpUz/Pysoa\nYzAYZCkpKaFz5szZa2lp2SyXy/WCIJTk5OT45OTk+AiCUCKXy/VWVlZNYWFhydJzAQAAAMC/WT1p\nBYPBIPP09Mz/9ttvhyxZsmTr008//b29vX2V9LhcLteLoqggIhJFUaFUKnVERDKZzGBnZ1ddUVHh\nqNfr5ZMmTToubaNQKERRFBWMMQtpfak/MzPTr+UYGhvriCiOiKyJqA8RefzwmFTJ+/n5od0F2lKf\nuYwHbdO2pT5zGQ/apm1LfeYyHrRN25b6zGU8pmpLP5eVlZGp/eQPIty5c+epoKCgw1FRUZtXrVr1\nx4sXLzoTEd24cWPAxIkTMy5dujRi+PDhl0+dOjXe0dGxgohoxIgRlzIzM/1iY2NXTJo06XhYWFgy\nEdHixYs/8vPzy2SMWZw8eXLC1q1blxAR7d27d05mZqbfRx99tPiHAeKDCAAAANBJmMXNdZ966qk7\noaGhKd99992zVVVV9lK/KIoKhUIhEj06U1ZeXj6I6NEZuurqajsHB4dKhUIh6nQ6pfE2SqVS17Jf\np9Mpjc+88cy4gucJcvMFufmC3HzhNbcptVm0VVdX29XW1toSEdXV1fVMT08P9PDwOD969OjsQ4cO\nTSciSkxMDNdoNKlERBqNJjUxMTGciOizzz6b5uvre8bS0rJZo9GkJicnhzU1NVmJoqgoLi52HTVq\n1Flvb+/c4uJiV71eL29sbOy2b9++2SEhIWmmDg0AAADQ2bR5ebSoqEi1YMGCXYwxi4cPH/aYN2/e\n7uXLl68sLS0dPG/evN337t2zEQShJCEhIaJbt26N9fX11hEREQkXL150trW1rd29e/c8JyenMiKi\n1atX/yExMTFcJpMZ1q9f/1ZQUNBhIqK0tLSQZcuWrTMYDLKIiIiEd955J+5HA8TlUQAAAOgkTHl5\nFDfXBQAAAGgnZjGnDX5ZvM4FQG6+IDdfkJsvvOY2JRRtAAAAAJ0ALo8CAAAAtBNcHgUAAADgHIo2\nM8XrXAA7+jpUAAAgAElEQVTk5gty8wW5+cJrblNC0QYAAADQCWBOGwAAAEA7wZw2AAAAAM6haDNT\nvM4FQG6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgHIo2M8XrXADk5gty8wW5+cJr\nblNC0QYAAADQCWBOGwAAAEA7wZw2AAAAAM61WbTpdDrlhAkTTqpUqqLhw4df/stf/vJ7IqKampq+\ngYGB6W5uboVBQUGHb9++3UfaJioqarMgCCWenp75BQUFaql/586dLwuCUCIIQsmuXbsWSP15eXle\narW6QBCEkujo6E2mCNkZ8ToXALn5gtx8QW6+8JrblNos2rp3797w97///X+LiopUeXl5Xtu2bXv1\n66+/do+NjV0RGhqaUlhY6BYSEpIWGxu7gojo008/nVleXj6opKRE+PjjjxdFRkbGExHduHFjwKpV\nq/6Yk5Pjk5OT47Ny5crlFRUVjkREkZGR8du3b19YUlIiXLt27ZmDBw++aPrYAAAAAJ1Lm0Vbv379\nbrm6uhYTEdnY2Nxzc3Mr1Ov18tTUVE1EREQCEVF4eHhiSkpKKBFRSkpKqNSvVqsLmpqarERRVKSn\npweGhISk2djY3LOxsbkXHBysPXLkyOTy8vJBBoNBplarC1o+F+/8/Pw6eggdArn5gtx8QW6+8Jrb\nlKx+6oplZWVOubm53tu3b19YWVnpYGdnV01EZG9vXyWdNdPr9XKlUqmTtlEoFKIoigq9Xi9XKBRi\na/3G68vlcr0oioqWr93YWEdEcURkTUR9iMjjh8ek06/SmwNttNFGG2200Ub7l2pLP5eVlZHJMcae\nuNTW1tp4eXmdO3jw4HTGGNna2t41flxqT548+XB2draP1B8UFKQ9c+bM6NWrV7+zZs2aGKk/Li7u\n7bi4uLezsrJ8g4OD06T+rKws36CgIK3xcxMRs7a2ZUR3GBH7Yend24UVFxezriojI6Ojh9AhkJsv\nyM0X5OYLr7kflVZPrq1+ziJ7UlHX2NjYbebMmZ/Onz8/afr06YeIiBwcHCqrqqrsiYgqKysdHB0d\nK4genUHT6XRKaVtRFBVKpVLXsl+n0ylb6xdFUWF8Rg4AAAAAHmmzaGOMWSxatOhjFxeXC2+++eYG\nqV+j0aQmJiaGExElJiaGazSaVKk/KSlpPhFRfn6+p6WlZbNcLtf7+/sf02q1wbW1tba1tbW2Wq02\nOCAg4KhSqdTJZDKD9CnTpKSk+dJz8U46/cob5OYLcvMFufnCa25TanNO2+nTp8cmJiaGu7m5FUof\nFoiLi3tnxYoVsWFhYcnbt29f2L9//5v79u2bTUQ0c+bMTzMyMiYKglBibW1dHx8fH0lENHDgwOvv\nvvvun318fHKIiJYvX76yX79+t4iI4uPjIxcuXLi9oaGhu7+//7EZM2YcMG1kAAAAgM4H34hgpjIz\nM7n8KwW5+YLcfEFuvvCaG9+IAAAAAMA5nGkDAAAAaCc40wYAAADAORRtZsr4pn08QW6+IDdfkJsv\nvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADgHIo2M8XrXADk5gty8wW5+cJrblNC0QYAAADQCWBO\nGwAAAEA7wZw2AAAAAM6haDNTvM4FQG6+IDdfkJsvvOY2JRRtAAAAAJ0A5rQBAAAAtBPMaQMAAADg\nHIo2M8XrXADk5gty8wW5+cJrblNqs2hbuHDh9n79+t1SqVRFUl9NTU3fwMDAdDc3t8KgoKDDt2/f\n7iM9FhUVtVkQhBJPT8/8goICtdS/c+fOlwVBKBEEoWTXrl0LpP68vDwvtVpdIAhCSXR09Kb2DgcA\nAADQZTDGHrucPHlyfH5+vtrV1bVI6lu6dOmHGzZseIMxRhs2bHgjKipqE2OM9u/fP3PatGmHGGOU\nn5+vdnd3P88Yo+vXrw8YMmTI1draWpva2lqbIUOGXL1165YjY4xUKlVhfn6+mjFG06ZNO3TgwIEX\nW46BiJi1tS0jusOI2A9L794urLi4mAEAAACYi0el1eNrq/9mafNM2/jx4089/fTT3xv3paamaiIi\nIhKIiMLDwxNTUlJCiYhSUlJCpX61Wl3Q1NRkJYqiIj09PTAkJCTNxsbmno2Nzb3g4GDtkSNHJpeX\nlw8yGAwytVpd0PK5AAAAAODHrP7TDSorKx3s7OyqiYjs7e2rKioqHImI9Hq9XKlU6qT1FAqFKIqi\nQq/XyxUKhdhav/H6crlcL4qiorXXbGysI6I4IrImoj5E5PHDY9I1cz8/vy7VlvrMZTy/VHvjxo3k\n4eFhNuPB8TZtG8fbPMaD423attRnLuPB8W7ftvRzWVkZmdyTTsWVlpY6GV8etbW1vWv8uNSePHny\n4ezsbB+pPygoSHvmzJnRq1evfmfNmjUxUn9cXNzbcXFxb2dlZfkGBwenSf1ZWVm+QUFB2pavT5xe\nHs3IyOjoIXQI5OYLcvMFufnCa27qqMujrXFwcKisqqqyJ3p01s3R0bGC6NEZNJ1Op5TWE0VRoVQq\ndS37dTqdsrV+URQVxmfkeCdV8rxBbr4gN1+Qmy+85jal/7ho02g0qYmJieFERImJieEajSZV6k9K\nSppPRJSfn+9paWnZLJfL9f7+/se0Wm1wbW2tbW1tra1Wqw0OCAg4qlQqdTKZzCB9yjQpKWm+9FwA\nAAAA8GNtFm1z587dM2bMmKzLly8PVyqVuvj4+MgVK1bEpqSkhLq5uRWmpaWFrFy5cjkR0cyZMz+V\ny+V6QRBKXn311W3x8fGRREQDBw68/u677/7Zx8cnx8fHJ2f58uUr+/Xrd4uIKD4+PnLhwoXbBUEo\nGTRoUPmMGTMOmD5y52B8rZwnyM0X5OYLcvOF19ym1OYHEfbs2TO3tf709PTA1vq3bNmytLX+yMjI\n+MjIyPiW/V5eXnnG93MDAAAAgNbhu0cBAAAA2gm+exQAAACAcyjazBSvcwGQmy/IzRfk5guvuU0J\nRRsAAABAJ4A5bQAAAADtBHPaAAAAADjXaYu2u3evkqurK1lYWPyw9O7dt6OH1W54nQuA3HxBbr4g\nN194zW1K//EXxpuPBiL68aXd2lqTnI0EAAAA6HCddk4bkQW1LNqILMjc8wAAAEDXhTltAAAAAJxD\n0WameJ0LgNx8QW6+IDdfeM1tSijaAAAAADoBzGkDAAAAaCeY0wYAAADAuS5WtFn96L5tnfnebbzO\nBUBuviA3X5CbL7zmNqUuVrQ10aNLpv9eamu/79gh/Uznz5/v6CF0COTmC3LzBbn5wmtuUzKLok2r\n1QarVKoiFxeXC2vXro3p6PGYg9u3b3f0EDoEcvMFufmC3HzhNbcpdXjRVl9fb71kyZKtWq02uLCw\n0G3//v2zCgoK1O33Cl3nkikAAADwq8OLtpycHB9BEErkcrneysqqKSwsLDklJSW0/V6htUumtf+n\nkLOw6P6T+n6pgq+srOwXeR1zg9x8QW6+IDdfeM1tSh1+y4/du3fPO3Xq1PitW7cuISLau3fvnMzM\nTL+PPvpoMdGjW3506AABAAAA/gOmuuVHh39h/JOKMlMFBwAAAOhMOvzyqEKhEHU6nVJq63Q6pVKp\n1HXkmAAAAADMTYcXbd7e3rnFxcWuer1e3tjY2G3fvn2zQ0JC0jp6XAAAAADmpMMvj/bo0ePh1q1b\nlwQFBR02GAyyiIiIBE9Pz/yOHhcAAACAWWGMme2SlpYW7OrqWuTs7HxhzZo1MR09np+zREZGbnd0\ndLzl6upaJPVVV1f3DQgISFepVIWTJ08+/P333/eRHnv99dc3u7i4lKjV6vz8/Hy11L9jx46XXVxc\nSlxcXEp27ty5QOo/d+6cl4eHR4GLi0tJVFTUpo7OKy3l5eXK8ePHn3R1dS0aNmzY5bVr1/6eh+x1\ndXU9Ro4cmevh4VEwdOjQK2+88cYGxhh99913g0ePHn3G1dW1KCwsbG9DQ0M3xhg9fPjQevbs2cmu\nrq5FY8aMOV1WVvaM9FyrV69+x9nZ+YKrq2vR4cOHJ0v95vx70dTUZOnh4VEwZcqUL3jJ/cwzz5Sp\nVKpCDw+PAm9v77M8vM8ZY/T999/3mTVr1idubm5fjxgx4uKZM2dGd/Xcly5dGu7h4VEgLb17976z\nadOmqK6emzFGy5cvXzF06NArw4cPvzRz5sz99+/f78XD73dcXNzbQ4cOvSIIQvHGjRujGevY3+8O\n3yGPWx4+fGjt5ORUKoqivLGx0WrkyJG5xjugsywnT54cn5+frzYu2pYuXfrhhg0b3mCM0YYNG96Q\nDtT+/ftnTps27RBjjPLz89Xu7u7nGWN0/fr1AUOGDLlaW1trU1tbazNkyJCrt27dcmSMkUqlKpT2\ny7Rp0w4dOHDgxY7OzBijmzdv9isqKnJljFFtba3N0KFDr5w/f96dh+wPHjzoyRijxsZGKx8fn+zj\nx49PnDJlyhcHDx6czhij6OjojX/961/fZIzRBx988FZ0dPRGxhgdPHhw+tSpUz9j7NEv8siRI3Ob\nmposRVGUOzk5lTY0NHQz99+L9evX/3bevHlJL7zwwueMMeIht5OTU2l1dXVf4z4e3uezZs36ZPfu\n3XMZY9Tc3Cy7c+dObx5yS0tzc7Osf//+N8rLy5VdPfc333zz3ODBg7+rr6/vzhij2bNnJ2/btm1R\nV//9PnfunJcgCMV1dXU9mpqaLAMCAtILCwtVHXm8O/yN/7jlxIkTE0JDQ7+U2uvWrfvdqlWr3uvo\ncf2cpbS01Mm4aHv22We/raqqsmOMUWVlpf2QIUOuMvborNz+/ftnSusJglCs0+kUO3fuXLB06dIP\npf7f/OY3WxISEsKvXbs2SBCEYqn/k08+mbVo0aJtHZ23tWXmzJn7U1JSNDxlv3//fq+RI0fmFhcX\nC/b29pVSf25u7kh/f/+jjDGaNGnSsXPnznkx9ug/AXt7+8rm5mbZihUrln/wwQdvSduEhoZ+eerU\nqXHm/Huh0+kU/v7+R6UitampyZKH3E5OTqXSe1pauvr7vKqqyu655577pmV/V89tvBw+fHjyuHHj\nTvGQu7q6uu+wYcMu19TUPN3Y2Gg1ZcqUL44cORLY1X+/k5KS5hnv/1WrVr33pz/96d2OPN4d/kGE\nxxFFUWH8KVKFQiGKoqjoyDG1l8rKSgc7O7tqIiJ7e/uqiooKRyIivV4vby2zXq+XKxQKsbV+4/Xl\ncrneHPdRWVmZU25urve4ceO+4iG7wWCQeXh4nO/Xr9+tiRMnZjz99NPf29vbV0mPG4/V+H0uk8kM\ndnZ21RUVFY4/Nbc5/V68+eabG9atW7dMJpMZiIgqKiocechtYWHBAgMD093c3Aq3bNmylKjr/45/\n8803Qx0cHCpnz569z9XVtXjBggW7amtrbbt6bmN79+6dM3fu3D1EXf949+3bt+att95aP2jQoPKB\nAwde79Onz21XV9firv77rVKpik6cOPF8TU1N3wcPHvRKTU3V6HQ6ZUceb7Mt2ni9qS7rYvelu3fv\nns2sWbP2b9q0Kbp3795321q3q2SXyWSG8+fPe4iiqDh58uSEzMxMv44ek6l9+eWXUxwdHSvUanWB\ndBy7yvF8kuzs7NH5+fmex44d84+Pj488evRoQFvrd4X9YjAYZLm5ud7Lli1bV1xc7Nq3b9+aVatW\n/bGtbbpCbklDQ0P3L7744oWXXnrpkyet2xVyf/vtt0M2btz4RllZmdP169cH3rt3zyY9PT2wo8dl\naiqVqui3v/3tX/38/DInTpyYoVKpijr63rJmW7R15fu3OTg4VFZVVdkTPfoLzdHRsYLo/2aW/lp5\n3L5obX3jar6jNTY2dps5c+an8+fPT5o+ffohIn6yExE99dRTd0JDQ1O+++67Z6XMRD8eq0KhEMvL\nywcRPfqPsLq62s7BwaHyP90fv2Su1mRlZY35/PPPpw4ePLh07ty5e44fPz4pJiZmbVfPTUQkvYcd\nHBwqZ82atT83N9e7q7/PlUqlTi6X6729vXOJiGbNmrX//PnzHo6OjhVdObckLS0txMvLK8/BwaGS\nqOv/u3b27NlRY8aMybKzs6u2srJqmjFjxoGTJ09O4OH3e8mSJVsLCwvdcnJyfAYOHHh9xIgRlzr0\neHf0NePHLXV1dT2eeeaZMlEU5Q0NDd1GjhyZm5eX59nR4/o5S8s5bcaTGP/617+++frrr29m7NEk\nxunTpx9kjFFeXp6nm5vb14wx0uv1A4cMGXL17t27tnfv3rV99tlnv71582Y/xv7vJMZPP/10Rkfn\nZYyRwWCwiIiI2CV9epKX7FVVVXZ37961ZezRBxLGjx9/8osvvphiPGE3Kipq0/r163/L2I8n7B44\ncOBFaQK/NGG3sbHRSqfTKZ555pmyhoaGbp3h9yIzM/N56dOjXT33/fv3e92/f78XY4zu3bv3qwkT\nJpz47LPPpnb19zljjLy8vM5dvnx5GGOMYmNj34+KitrEQ27GGIWFhe3dsWPHy1K7q+c+e/astyAI\nxQ8ePOhpMBgsFixYsHPdunW/6+q/34w9mrPGGKMbN270d3Z2vnDjxo3+HXm8O3yHtLWkpqaGCIJQ\n7OzsfGH16tXvdPR4fs4yZ86cPQMGDLjerVu3BoVCodu+fXuk8ceFAwMDjxh/XPg3v/nNFunjwsZv\n2u3bt0c6OztfcHZ2vmD8j4Xxx4WlN445LKdOnRpnYWFhcHd3Py99PD4tLS24q2cvLCxUeXh4FLi7\nu58fPnz4pRUrVixnrO1bX7z00kv7XF1di3x9fbNKS0udpOf685///AdnZ+cLgiAUa7XaIKnf3H8v\nMjMzn5f+ke7qub/77rvBbm5uX7u7u58fOnTolT/+8Y8rGfvxLQG64vucMUbnz593HzlyZK6Li0tJ\nSEhIak1NzdM85L53796v7OzsqqQ/zng53rGxse8/99xz3wwbNuxyWFjY3rq6uh5d/febMUbjxo07\n5ebm9rWXl9e548ePT+zo493hXxgPAAAAAE9mtnPaAAAAAODfULQBAAAAdAIo2gAAAAA6ARRtAAAA\nAJ0AijaAX4Cfn19mXl6el6lfZ8OGDW8OHz78ckRERIKpX+unKCsrc1KpVEWmev7H7de8vDyv6Ojo\nTW1ta2Njc89U42rplzr+/4mxY8ee/qnr+vn5Zebn53sSEYWGhqbcvXu39+PW3bhx4xt1dXU922OM\nTxpTa/v0l3p9gI6Aog3gF/DffMNHc3Oz5U9d95///OevMzIyJiYkJET83Nf7qQwGQ4f/+/G4/erl\n5ZW3adOm6J+zrSmY4ze8nD59euxPXdd4/CkpKaFtfbvJpk2boh88eNDrvx3fTxlTa/v1l3p9gI7Q\n4f/oApiLsrIyJ2dn54uLFy/+yNXVtdjPzy/z/v37vyL68V/1VVVV9oMHDy4lItqxY8cr06dPPxQS\nEpI2ePDg0i1btiz94IMPfjdy5Mhznp6e+cZ3DE9ISIgYNWrU2REjRlyS/sO8d++ezdy5c/e4u7t/\nLQhCySeffPKS9LxTp079PCgo6PDkyZOPtBzrn//853ednZ0vOjs7X1y7dm0MEdHixYs/+u67754N\nDg7Wbty48Q3j9Xfs2PHK66+//qHUnjJlypcnTpx4vrm52TIiIiJBpVIVubm5Fa5fv/4tIqLLly8P\nnzhxYoa7u/vXPj4+OSUlJQIR0SuvvLJj8eLFH40dO/Z0TEzM2mPHjvmr1eoCaamtrbVtOdampiar\nBQsW7HJ1dS2eMmXKlw8ePOh1/PjxSS+++OJBaZ309PTAGTNmHDDeLjc313vmzJmfEhF99tln03r1\n6vWgqanJ6uHDhz2GDBnyrbTeJ5988tKYMWOyBg8eXHr8+PFJRESZmZl+L7zwwhdERLW1tbZz5szZ\nKwhCibu7+9effvrpTGnb995770/S2G/cuDGg5dgfd3zKysqcxo8ff0qtVhe4uroWnzhx4nlpmxUr\nVsQ6Oztf9PDwOP/222+vaWucxkRRVEyYMOGkWq0uUKlURV999dU4IqL4+PhIFxeXCy4uLhfeeOON\njdL6NjY291ob/6VLl0Z4eHic9/Lyynvvvff+ZGtrW9vytaTtW3vdU6dOjW9tfYmTk1NZTU1N37t3\n7/bWaDSp7u7uX6tUqqLk5OSwDz/88PXr168PnDhxYoa/v/8xIqLPP/98qpeXV55KpSqaNm3aZ9J7\nxMnJqez9999/f9SoUWeHDx9+ubi42LWtff7gwYNe06ZN+0wQhJJZs2btr6ur68lafGXQ5s2bo1q+\n/uP2H0Cn1NE3rsOCxVyW0tJSJysrq8aioiJXxhjNnj07OT4+/hXGGPn5+WVIN0qsrKy0d3JyKmWM\nUXx8/CvPPffcN3V1dT0qKyvte/fufWfbtm2LGGP05ptv/nXdunW/Y4zR888/n7lkyZK/M8bo9OnT\nY4YNG3ZZWicxMXE+Y4y+//77PtJds+Pj419RKBQ64xt4Ssvp06fHqFSqwvr6+u51dXU9BEEozsnJ\nGcUYIycnp9Lq6uq+LbfZsWPHy0uXLv1Qak+ZMuWLEydOTMjJyRkVEhKSKvXfu3fvV4wxGjNmzOlv\nvvnmOcYYZWdn+4wdO/Yrxhi9/PLLO6Q7fjPGSKPRpJw9e9absUc31GxqarJsuU8tLCwM0vj+53/+\n55/SjTNHjBhxsaqqyo4xRnPnzt395Zdfhhpv29jYaPXss89+yxijt95664NRo0blnD59ekxmZubz\n8+bNS5L2a0xMzBrGHt2c8/nnn89kjFFGRoaf9K0MUVFRm373u9+tk573zp07vRljZGFhYUhLSwtm\njNHvf//7tbGxse+33G+POz51dXU9pBuJXrlyZahKpSpk7NHd38eOHfuV9Jj0Wn5+fhmtjdN4+ctf\n/rJs7dq1vzc+FteuXRskl8vF77//vk9zc7MsICAgfe/evWFtjT8gICD9k08+mcUYo3/961+v2tjY\n1Lb2fpf6165d+/uWr9tyXeP3v/QeS05Oni29p423M34P3rx5s5+vr2/WgwcPejLGaM2aNTHvvvvu\nn6T1tm7dupgxRn//+9+XvPzyyzva2uerV69+59e//vU/GGNUUlLiYmVl1djaXfONX7+t/YcFS2dc\ncKYNwMjgwYNLXV1di4keXWIz/l64x5k4cWJGjx49Htrb21f16dPntkajSSV69GXD0vYWFhZs9uzZ\n+4iIxowZk/Xw4cMelZWVDkeOHJm8Zs2at9VqdcHEiRMzmpqarMrLywdZWFiwwMDA9NbOknz11Vfj\nZsyYcaB79+4NPXr0eCh9D+DPyTts2LArV69efS4qKmpzamqqplevXg+qqqrs8/PzPV966aVP1Gp1\nweLFiz+SzhhaWFgw4zNiEyZMOBkVFbV58+bNURUVFY6WlpbNLV9DqVTqRo0adZaIaO7cuXukM0gR\nEREJCQkJEbdv3+6TnZ09OiQkJM14Oysrq6YhQ4Z8e+nSpRG5ubnev/3tb/968uTJCV999dW48ePH\nn5LGM23atM+IiDw9PfNbO17Hjh3zX7x48UdSW7q0171794bg4GAt0eOPdWvHR6fTKe/fv/+r8PDw\nREEQSmbPnr3vypUrw4iIjh49GhAZGRnfrVu3RuPXIiJ60jhHjx6dvW3btldXrFgRm5+f7/mrX/3q\nfnZ29uiAgICjffr0uS2TyQxz587dI50Je9z4s7OzR0vHSHrPtcXX1/dMy9d90jZERGq1uuDw4cNB\nb7/99pqTJ09OaG27U6dOjf/mm2+GjhkzJkutVhfs2rVrwfXr1we2tU8e9zvx1VdfjZs7d+4eIiIX\nF5cLbm5uhU8aY1v7D6AzsuroAQCYE2tr63rpZ0tLy2b2/y+/yGQygzSH6+HDhz0et41MJjNIbeNt\nWiPNx5G+aN34sXPnzo183H+eFhYWjBldFmKMWTxpzlTLsUgZ+vTpc7ugoEB9+PDhoG3btr26f//+\nWX/5y19+7+DgUFlQUKBu7bl69er1QPo5JiZm7ZQpU75MTU3VjBs37qsjR45MHj58+OXWcrYca2Rk\nZPwLL7zwRY8ePR7Onj17n0wmM7R8rQkTJpxMTU3VdOvWrdHf3//Yyy+/vNNgMMg++OCD30nrSPvb\n0tKy+XH7m7W4jEZEJBVWre0fY60dnz/84Q+rnZycypKTk8Oam5ste/To8VDK2tpr/ZRxjh8//tTJ\nkycnpKSkhL766qvb3njjjY09e/ase9yx/qnjf5LWXnfBggW7nrTd0KFDv8nLy/NKSUkJjY2NXTFx\n4sSM5cuXr2y5XkhISNquXbsWtPYcj9snre3ztvbt47T2u/KfbA9gbnCmDaAN0j/yCoVCPHfu3Egi\nooMHD774n2wr/bx///5ZRERnzpzx7dmzZ529vX1VUFDQ4b///e//K60nzetp6z+XcePGfXXo0KHp\nDQ0N3R8+fNjj0KFD0ydMmHCyrbEoFArx/PnzHowxC71eLz979uwoIqKampq+jDGLGTNmHFi5cuXy\nc+fOjbS3t69ycHCo/PLLL6dIY7lw4YJLa89bVlbmJAhCybJly9aNGjXqrDT3zVh5efmg3NxcbyKi\n5OTksHHjxn1FRDRgwIAbAwcOvP6nP/3pvcjIyPjWnn/8+PGnNm7c+MaYMWOy7O3tq6qrq+2uXLky\nTBCEkrbyGgsMDEz/xz/+8ZrUbuuTjy097vg8fPiwR79+/W4REe3evXue9GGRwMDA9B07drzS0NDQ\nnYjozp07T/3U1xJFUeHo6FixaNGijxctWvTxuXPnRvr6+p45fvz4pNu3b/cxGAyyffv2zX7SsR49\nenS29B6V5oP9p6/7U8Z78+bN/r169Xowf/78pLfeemu9tF3Pnj3rpLmg48aN+yojI2NieXn5IKJH\n++3bb78d0tbzPm6fjxs37qvk5OQwIqKLFy86FxYWurW2vfHrjx49Orvl/nv++edP/JR8AOYIRRuA\nkZZnrKT2smXL1m3YsOFNb2/v3IqKCkepv+Un2Fr+bLxe9+7dG3x8fHIiIyPjt2/fvpCIaNWqVX+s\nqBKi/P0AAAJISURBVKhwlC73xMTErG3teY35+vqeCQsLS3Z3d/9arVYXREREJHh7e+e2Nn6Jn59f\n5sCBA68PHz78cnR09CYvL688IiKdTqeUJtRHREQkxMXFvUP0qLhav379W25uboWurq7Fxv/5G7/G\nBx988Ds3N7dCd3f3r62srJpCQ0NTWu6/4cOHX/7www9fd3V1Ldbr9XLjW3HMmzdv96BBg8pbnp2T\njBo16mxFRYWjVKhIk95bW7e1/S/t4/Ly8kEuLi4XPDw8zh87dsy/tXVb23ePOz5LlizZ+q9//et/\nvLy88kpKSgRpUv+0adM+CwwMTHdzcytUq9UF0odE2hqn5NixY/7u7u5fe3p65u/bt292dHT0JoVC\nIa5cuXK5r6/vGUEQSpydnS++9NJLn7Q1/s2bN0etWrXqj97e3rlXrlwZ1rNnz7q2xnD06NGAlq/7\nuP1rvF1hYaGb9IGblStXLn/vvff+RES0aNGij6UPAvTv3//mP//5z19PnTr1cw8Pj/OjRo0629of\nAMbjf9w+j46O3nTjxo0BgiCULF++fOXIkSPPtTY+49dXKpW6x+0/gM4IXxgPAB0mKipqs7u7+9eL\nFi36uKPH0lU8fPiwh3S5du/evXN27dq1IDU1VdPR4wKA/x7mtAFAh/Dx8cnp0aPHQ+P5afDfy83N\n9X799dc/bGho6N67d++7j5tPBgCdD860AQAAAHQCmNMGAAAA0AmgaAMAAADoBFC0AQAAAHQCKNoA\nAAAAOgEUbQAAAACdAIo2AAAAgE7g/wHFJmCZ38sK3QAAAABJRU5ErkJggg==\n", 223 | "text/plain": [ 224 | "" 225 | ] 226 | }, 227 | "metadata": {}, 228 | "output_type": "display_data" 229 | } 230 | ], 231 | "source": [ 232 | "plt.figure(figsize=(10, 4))\n", 233 | "songcount.hist(bins=100)\n", 234 | "plt.xlabel('number of users by which each song is listened to')" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 11, 240 | "metadata": { 241 | "collapsed": false 242 | }, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "Sehr kosmisch BY Harmonia -- count: 82524\n", 249 | "Dog Days Are Over (Radio Edit) BY Florence + The Machine -- count: 73359\n", 250 | "Undo BY Björk -- count: 64711\n", 251 | "Secrets BY OneRepublic -- count: 62270\n", 252 | "You're The One BY Dwight Yoakam -- count: 61191\n", 253 | "Revelry BY Kings Of Leon -- count: 60286\n", 254 | "Fireflies BY Charttraxx Karaoke -- count: 51811\n", 255 | "Hey_ Soul Sister BY Train -- count: 51280\n", 256 | "Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile) BY Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner -- count: 50840\n", 257 | "Tive Sim BY Cartola -- count: 45128\n", 258 | "Use Somebody BY Kings Of Leon -- count: 44781\n", 259 | "OMG BY Usher featuring will.i.am -- count: 42510\n", 260 | "Drop The World BY Lil Wayne / Eminem -- count: 40193\n", 261 | "The Scientist BY Coldplay -- count: 39659\n", 262 | "Canada BY Five Iron Frenzy -- count: 37996\n", 263 | "Marry Me BY Train -- count: 37645\n", 264 | "Clocks BY Coldplay -- count: 37591\n", 265 | "Catch You Baby (Steve Pitron & Max Sanna Radio Edit) BY Lonnie Gordon -- count: 35633\n", 266 | "Pursuit Of Happiness (nightmare) BY Kid Cudi / MGMT / Ratatat -- count: 34730\n", 267 | "Yellow BY Coldplay -- count: 34264\n", 268 | "Lucky (Album Version) BY Jason Mraz & Colbie Caillat -- count: 33838\n", 269 | "Bulletproof BY La Roux -- count: 33440\n", 270 | "Alejandro BY Lady GaGa -- count: 32987\n", 271 | "Billionaire [feat. Bruno Mars] (Explicit Album Version) BY Travie McCoy -- count: 32717\n", 272 | "Creep (Explicit) BY Radiohead -- count: 32638\n", 273 | "Just Dance BY Lady GaGa / Colby O'Donis -- count: 32562\n", 274 | "Sincerité Et Jalousie BY Alliance Ethnik -- count: 32001\n", 275 | "Représente BY Alliance Ethnik -- count: 31249\n", 276 | "The Only Exception (Album Version) BY Paramore -- count: 30272\n", 277 | "Somebody To Love BY Justin Bieber -- count: 29014\n", 278 | "Bleed It Out [Live At Milton Keynes] BY Linkin Park -- count: 28787\n", 279 | "Invalid BY Tub Ring -- count: 28707\n", 280 | "I Gotta Feeling BY Black Eyed Peas -- count: 28593\n", 281 | "Ain't Misbehavin BY Sam Cooke -- count: 27655\n", 282 | "Livin' On A Prayer BY Bon Jovi -- count: 26903\n", 283 | "Heartbreak Warfare BY John Mayer -- count: 26855\n", 284 | "Fix You BY Coldplay -- count: 26811\n", 285 | "When You Were Young BY The Killers -- count: 26781\n", 286 | "The Gift BY Angels and Airwaves -- count: 26021\n", 287 | "Love Story BY Taylor Swift -- count: 25872\n", 288 | "Float On BY Modest Mouse -- count: 25470\n", 289 | "Cosmic Love BY Florence + The Machine -- count: 25423\n", 290 | "Halo BY Beyoncé -- count: 25291\n", 291 | "Nothin' On You [feat. Bruno Mars] (Album Version) BY B.o.B -- count: 25239\n", 292 | "Kryptonite BY 3 Doors Down -- count: 25170\n", 293 | "Supermassive Black Hole (Twilight Soundtrack Version) BY Muse -- count: 25163\n", 294 | "Uprising BY Muse -- count: 25061\n", 295 | "Party In The U.S.A. BY Miley Cyrus -- count: 24984\n", 296 | "Sample Track 2 BY Simon Harris -- count: 24557\n", 297 | "I CAN'T GET STARTED BY Ron Carter -- count: 24442\n" 298 | ] 299 | } 300 | ], 301 | "source": [ 302 | "# take a look at the top 50 most listened songs\n", 303 | "def get_song_info_from_sid(conn, sid):\n", 304 | " cur = conn.cursor()\n", 305 | " cur.execute(\"SELECT title, artist_name FROM songs WHERE song_id = '%s'\" % (sid))\n", 306 | " title, artist = cur.fetchone()\n", 307 | " return title, artist\n", 308 | "\n", 309 | "songcount.sort(ascending=False)\n", 310 | "\n", 311 | "with sqlite3.connect(md_dbfile) as conn:\n", 312 | " for i in xrange(50):\n", 313 | " sid = songcount.index[i]\n", 314 | " title, artist = get_song_info_from_sid(conn, sid)\n", 315 | " print \"%s BY %s -- count: %d\" % (title, artist, songcount[i])" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "You might wonder why \"Sehr kosmisch\" by Harmonia is the most popular song. There is actually a metadata matching error in the dataset that affects about 1% of the songs. But since collaborative filtering doesn't make use of metadata, this error will not affect us. Read more about it [here](http://labrosa.ee.columbia.edu/millionsong/blog/12-2-12-fixing-matching-errors)." 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "## Subsample ~20000 songs and ~200000 users:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "* First sample 250000 users based on listening count, only keep the data with those 250000 users\n", 337 | "* Then sample 25000 songs from the pre-selected user listening history based on listening count\n", 338 | "* Only keep the users who listened to at least 20 songs and the songs that are listened to by at least 50 users" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 12, 344 | "metadata": { 345 | "collapsed": false 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "unique_uid = usercount.index\n", 350 | "\n", 351 | "np.random.seed(98765)\n", 352 | "\n", 353 | "n_users = 250000\n", 354 | "p_users = usercount / usercount.sum()\n", 355 | "idx = np.random.choice(len(unique_uid), size=n_users, replace=False, p=p_users.tolist())\n", 356 | "unique_uid = unique_uid[idx]" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 13, 362 | "metadata": { 363 | "collapsed": false 364 | }, 365 | "outputs": [], 366 | "source": [ 367 | "tp = tp[tp['uid'].isin(unique_uid)]" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 14, 373 | "metadata": { 374 | "collapsed": false 375 | }, 376 | "outputs": [], 377 | "source": [ 378 | "unique_sid = songcount.index\n", 379 | "\n", 380 | "n_songs = 25000\n", 381 | "p_songs = songcount / songcount.sum()\n", 382 | "idx = np.random.choice(len(unique_sid), size=n_songs, replace=False, p=p_songs.tolist())\n", 383 | "unique_sid = unique_sid[idx]" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 15, 389 | "metadata": { 390 | "collapsed": false 391 | }, 392 | "outputs": [], 393 | "source": [ 394 | "tp = tp[tp['sid'].isin(unique_sid)]" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 16, 400 | "metadata": { 401 | "collapsed": false 402 | }, 403 | "outputs": [], 404 | "source": [ 405 | "tp, usercount, songcount = filter_triplets(tp, min_uc=20, min_sc=50)\n", 406 | "unique_uid = usercount.index\n", 407 | "unique_sid = songcount.index" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 17, 413 | "metadata": { 414 | "collapsed": false 415 | }, 416 | "outputs": [ 417 | { 418 | "name": "stdout", 419 | "output_type": "stream", 420 | "text": [ 421 | "After subsampling and filtering, there are 13964169 triplets from 211830 users and 22781 songs (sparsity level 0.289%)\n" 422 | ] 423 | } 424 | ], 425 | "source": [ 426 | "sparsity_level = float(tp.shape[0]) / (usercount.shape[0] * songcount.shape[0])\n", 427 | "print \"After subsampling and filtering, there are %d triplets from %d users and %d songs (sparsity level %.3f%%)\" % \\\n", 428 | "(tp.shape[0], usercount.shape[0], songcount.shape[0], sparsity_level * 100)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 18, 434 | "metadata": { 435 | "collapsed": false 436 | }, 437 | "outputs": [], 438 | "source": [ 439 | "song2id = dict((sid, i) for (i, sid) in enumerate(unique_sid))\n", 440 | "user2id = dict((uid, i) for (i, uid) in enumerate(unique_uid))" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 19, 446 | "metadata": { 447 | "collapsed": false 448 | }, 449 | "outputs": [], 450 | "source": [ 451 | "with open(os.path.join(TPS_DIR, 'unique_uid_sub.txt'), 'w') as f:\n", 452 | " for uid in unique_uid:\n", 453 | " f.write('%s\\n' % uid)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 20, 459 | "metadata": { 460 | "collapsed": false 461 | }, 462 | "outputs": [], 463 | "source": [ 464 | "with open(os.path.join(TPS_DIR, 'unique_sid_sub.txt'), 'w') as f:\n", 465 | " for sid in unique_sid:\n", 466 | " f.write('%s\\n' % sid)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "## Generate train/test/vad sets" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "Pick out 20% of the rating for heldout test" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 21, 486 | "metadata": { 487 | "collapsed": false 488 | }, 489 | "outputs": [], 490 | "source": [ 491 | "np.random.seed(12345)\n", 492 | "n_ratings = tp.shape[0]\n", 493 | "test = np.random.choice(n_ratings, size=int(0.20 * n_ratings), replace=False)" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 22, 499 | "metadata": { 500 | "collapsed": false 501 | }, 502 | "outputs": [], 503 | "source": [ 504 | "test_idx = np.zeros(n_ratings, dtype=bool)\n", 505 | "test_idx[test] = True\n", 506 | "\n", 507 | "test_tp = tp[test_idx]\n", 508 | "train_tp = tp[~test_idx]" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "Make sure there is no empty row or column in the training data" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 23, 521 | "metadata": { 522 | "collapsed": false 523 | }, 524 | "outputs": [ 525 | { 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "There are total of 211830 unique users in the training set and 211830 unique users in the entire dataset\n" 530 | ] 531 | } 532 | ], 533 | "source": [ 534 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 535 | "(len(pd.unique(train_tp['uid'])), len(pd.unique(tp['uid'])))" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 24, 541 | "metadata": { 542 | "collapsed": false 543 | }, 544 | "outputs": [ 545 | { 546 | "name": "stdout", 547 | "output_type": "stream", 548 | "text": [ 549 | "There are total of 22781 unique items in the training set and 22781 unique items in the entire dataset\n" 550 | ] 551 | } 552 | ], 553 | "source": [ 554 | "print \"There are total of %d unique items in the training set and %d unique items in the entire dataset\" % \\\n", 555 | "(len(pd.unique(train_tp['sid'])), len(pd.unique(tp['sid'])))" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "Pick out 10% of the training rating as validation set" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": 25, 568 | "metadata": { 569 | "collapsed": false 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "np.random.seed(13579)\n", 574 | "n_ratings = train_tp.shape[0]\n", 575 | "vad = np.random.choice(n_ratings, size=int(0.10 * n_ratings), replace=False)" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 26, 581 | "metadata": { 582 | "collapsed": false 583 | }, 584 | "outputs": [], 585 | "source": [ 586 | "vad_idx = np.zeros(n_ratings, dtype=bool)\n", 587 | "vad_idx[vad] = True\n", 588 | "\n", 589 | "vad_tp = train_tp[vad_idx]\n", 590 | "train_tp = train_tp[~vad_idx]" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": {}, 596 | "source": [ 597 | "Again make sure there is no empty row or column in the training data" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 27, 603 | "metadata": { 604 | "collapsed": false 605 | }, 606 | "outputs": [ 607 | { 608 | "name": "stdout", 609 | "output_type": "stream", 610 | "text": [ 611 | "There are total of 211830 unique users in the training set and 211830 unique users in the entire dataset\n" 612 | ] 613 | } 614 | ], 615 | "source": [ 616 | "print \"There are total of %d unique users in the training set and %d unique users in the entire dataset\" % \\\n", 617 | "(len(pd.unique(train_tp['uid'])), len(pd.unique(tp['uid'])))" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 28, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [ 627 | { 628 | "name": "stdout", 629 | "output_type": "stream", 630 | "text": [ 631 | "There are total of 22781 unique items in the training set and 22781 unique items in the entire dataset\n" 632 | ] 633 | } 634 | ], 635 | "source": [ 636 | "print \"There are total of %d unique items in the training set and %d unique items in the entire dataset\" % \\\n", 637 | "(len(pd.unique(train_tp['sid'])), len(pd.unique(tp['sid'])))" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "## Numerize the data into (user_index, item_index, count) format" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 33, 650 | "metadata": { 651 | "collapsed": false 652 | }, 653 | "outputs": [], 654 | "source": [ 655 | "def numerize(tp):\n", 656 | " uid = map(lambda x: user2id[x], tp['uid'])\n", 657 | " sid = map(lambda x: song2id[x], tp['sid'])\n", 658 | " tp['uid'] = uid\n", 659 | " tp['sid'] = sid\n", 660 | " return tp" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": 30, 666 | "metadata": { 667 | "collapsed": false 668 | }, 669 | "outputs": [], 670 | "source": [ 671 | "train_tp = numerize(train_tp)\n", 672 | "train_tp.to_csv(os.path.join(TPS_DIR, 'train.num.sub.csv'), index=False)" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 31, 678 | "metadata": { 679 | "collapsed": false 680 | }, 681 | "outputs": [ 682 | { 683 | "name": "stderr", 684 | "output_type": "stream", 685 | "text": [ 686 | "-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.\n", 687 | "Try using .loc[row_index,col_indexer] = value instead\n", 688 | "-c:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.\n", 689 | "Try using .loc[row_index,col_indexer] = value instead\n" 690 | ] 691 | } 692 | ], 693 | "source": [ 694 | "test_tp = numerize(test_tp)\n", 695 | "test_tp.to_csv(os.path.join(TPS_DIR, 'test.num.sub.csv'), index=False)" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": 32, 701 | "metadata": { 702 | "collapsed": false 703 | }, 704 | "outputs": [], 705 | "source": [ 706 | "vad_tp = numerize(vad_tp)\n", 707 | "vad_tp.to_csv(os.path.join(TPS_DIR, 'vad.num.sub.csv'), index=False)" 708 | ] 709 | } 710 | ], 711 | "metadata": { 712 | "kernelspec": { 713 | "display_name": "Python 2", 714 | "language": "python", 715 | "name": "python2" 716 | }, 717 | "language_info": { 718 | "codemirror_mode": { 719 | "name": "ipython", 720 | "version": 2 721 | }, 722 | "file_extension": ".py", 723 | "mimetype": "text/x-python", 724 | "name": "python", 725 | "nbconvert_exporter": "python", 726 | "pygments_lexer": "ipython2", 727 | "version": "2.7.6" 728 | } 729 | }, 730 | "nbformat": 4, 731 | "nbformat_minor": 0 732 | } 733 | --------------------------------------------------------------------------------