├── .DS_Store ├── .ipynb_checkpoints └── loss_functions-checkpoint.ipynb ├── README.md ├── catboost ├── catboost_version_diff.ipynb ├── leaderboard.ipynb ├── ohc_dtreeviz.ipynb ├── proof.ipynb └── utils.py ├── class_notes ├── .ipynb_checkpoints │ ├── Lecture 3-checkpoint.ipynb │ ├── Lecture 4-checkpoint.ipynb │ ├── Lecture 5-checkpoint.ipynb │ ├── Lecture 6-checkpoint.ipynb │ └── Lecture 7-checkpoint.ipynb ├── Lecture 10 | 11.ipynb ├── Lecture 3.ipynb ├── Lecture 4.ipynb ├── Lecture 5.ipynb ├── Lecture 6.ipynb ├── Lecture 7.ipynb ├── Lecture 8.ipynb ├── Lecture 9|10.ipynb ├── Untitled.ipynb ├── fastai └── waterfall.ipynb ├── images ├── Expo_Loss.png ├── Huber_Loss.png ├── Logcosh_Loss.png ├── MAE_Loss.png ├── MSE_Loss.png ├── Quantile_Loss.png ├── all_regression.png ├── huber.png ├── mse.png ├── roc_segments.png └── tileshop.jpeg ├── ml_slides ├── Poster_DeepOdds.pdf └── Semi-Supervised Learning.pptx └── notebooks ├── .ipynb_checkpoints ├── 08_kmeans_scratch-checkpoint.ipynb └── 09_Quantile_Regression-checkpoint.ipynb ├── 01_Gradient_Boosting_Scratch.ipynb ├── 02_Collaborative_Filtering.ipynb ├── 03_Random_Forest_Interpretetion.ipynb ├── 04_Neural_Net_Scratch.ipynb ├── 05_Loss_Functions.ipynb ├── 06_NLP_Fastai.ipynb ├── 07_Eigenfaces.ipynb ├── 08_kmeans_scratch.ipynb ├── 09_Quantile_Regression.ipynb ├── 10_Transfer_Learn_MXNet.ipynb ├── 11_2_Partial_AUC_Range_Simulation.ipynb └── 11_Applications_of_different_parts_of_an_ROC_curve.ipynb /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | This repo contains tutorials to implement various ML algorithms from scratch or using pre-built libraries. This is a living repo and I will be adding more tutorials as I learn more. Hope it will be helpful for someone who wants to understand these algorithms conceptually as well as learn how to implement them using Python. 3 | 4 | * [01_Gradient_Boosting_Scratch.ipynb](notebooks/01_Gradient_Boosting_Scratch.ipynb) 5 | This jupyter notebook has implementation of basic gradient boosting algorithm with an intuitive example. Learn about decision tree and intuition behind gradient boositng trees. 6 | 7 | *  [02_Collaborative_Filtering.ipynb](notebooks/02_Collaborative_Filtering.ipynb) 8 | Builting MovieLens recommendation system with collaborating filtering using PyTorch and fast.ai. 9 | 10 | * [03_Random_Forest_Interpretetion.ipynb](notebooks/03_Random_Forest_Interpretetion.ipynb) 11 | How to interpret a seemimngly blackbox algorithm. Feature importance, Tree interpretor and Confidence intervals for predictions. 12 | 13 | * [04_Neural_Net_Scratch.ipynb](notebooks/04_Neural_Net_Scratch.ipynb) 14 | Using MNSIT data, this notebook has implementation of neural net from scratch using PyTorch. 15 | 16 | * [05_Loss_Functions.ipynb ](notebooks/05_Loss_Functions.ipynb ) 17 | Exploring regression and classification loss functions. 18 |   19 | * [06_NLP_Fastai.ipynb](notebooks/06_NLP_Fastai.ipynb) 20 | Naive bayes, logistic regression, bag of words on IMDB data. 21 | 22 | * [07_Eigenfaces.ipynb](notebooks/07_Eigenfaces.ipynb) 23 | Preprocessing of faces and PCA analysis on the data to recontruct faces and see similarities among differnt faces. 24 | 25 | * [08_kmeans_scratch.ipynb](notebooks/08_kmeans_scratch.ipynb) 26 | Implementation and visualization of kmeans algorithm from scratch. 27 | 28 | * [09_Quantile_Regression.ipynb](notebooks/09_Quantile_Regression.ipynb) 29 | Implementation of quantile regression using sklearn. 30 | 31 | * [10_Transfer_Learn_MXNet.ipynb](notebooks/10_Transfer_Learn_MXNet.ipynb) 32 | Tutorial on how to perform transfer learning using MXNet. Notebook used in [this blogpost](https://groverpr.github.io/2020/02/18/Transfer-Learning-Using-MXNet.html#step-4-training-base-model). 33 | 34 | * [11_Applications_of_different_parts_of_an_ROC_curve.ipynb](notebooks/11_Applications_of_different_parts_of_an_ROC_curve.ipynb) 35 | Understanding the importance of different parts of an ROC curve and exploring variants of AUC for ML applications 36 | -------------------------------------------------------------------------------- /catboost/ohc_dtreeviz.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### One Hot Encoding" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 223, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from sklearn.datasets import *\n", 17 | "from sklearn import tree\n", 18 | "from dtreeviz.trees import *\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "from sklearn.preprocessing import OneHotEncoder" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 224, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "binary_feature_with_high_cardinality = np.random.randint(0, 100, 1000) # 100 cateogries for 1000 observations\n", 31 | "X = pd.DataFrame({\"x\": binary_feature_with_high_cardinality})\n", 32 | "\n", 33 | "# target_labels = np.random.binomial(1, 0.5, size=1000) # 0 or 1 with 50-50 ratio" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 225, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/plain": [ 44 | "array([ 0., 11., 22., 33., 44., 55., 66., 77., 88., 99.])" 45 | ] 46 | }, 47 | "execution_count": 225, 48 | "metadata": {}, 49 | "output_type": "execute_result" 50 | } 51 | ], 52 | "source": [ 53 | "np.linspace(0,99,10)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 226, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "target = []\n", 63 | "for i in binary_feature_with_high_cardinality:\n", 64 | " if i in np.linspace(0,99,10):\n", 65 | " target.append(1)\n", 66 | " else: target.append(0)\n", 67 | " \n", 68 | "target = np.array(target)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 227, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "data": { 78 | "text/plain": [ 79 | "99" 80 | ] 81 | }, 82 | "execution_count": 227, 83 | "metadata": {}, 84 | "output_type": "execute_result" 85 | } 86 | ], 87 | "source": [ 88 | "sum(target)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 230, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "enc = OneHotEncoder(handle_unknown='ignore')\n", 98 | "enc.fit(X)\n", 99 | "X_ohc = enc.transform(X)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 59, 105 | "metadata": { 106 | "scrolled": true 107 | }, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,\n", 113 | " max_features=None, max_leaf_nodes=None,\n", 114 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 115 | " min_samples_leaf=1, min_samples_split=2,\n", 116 | " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", 117 | " splitter='best')" 118 | ] 119 | }, 120 | "execution_count": 59, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "classifier = tree.DecisionTreeClassifier(max_depth=5) # limit depth of tree\n", 127 | "classifier.fit(X_ohc_noisy, target)\n", 128 | "\n", 129 | "viz = dtreeviz(classifier, \n", 130 | " X_ohc_noisy, \n", 131 | " target,\n", 132 | " target_name='target',\n", 133 | " feature_names= X_ohc_noisy.columns, \n", 134 | " class_names = [\"label-0\", \"label-1\"] # need class_names for classifier\n", 135 | " ) \n", 136 | " \n", 137 | "viz.view() " 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "### K-Fold Target Encoding" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 162, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "from category_encoders.target_encoder import TargetEncoder\n", 154 | "from sklearn.model_selection import KFold" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 194, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "def target_encoder_regularized(train, cols_encode, target, folds=4):\n", 164 | " \"\"\"\n", 165 | " Mean regularized target encoding based on kfold\n", 166 | " \"\"\"\n", 167 | "\n", 168 | " kf = KFold(n_splits=folds, random_state=1)\n", 169 | "\n", 170 | " for col in cols_encode:\n", 171 | " global_mean = train[target].mean()\n", 172 | "\n", 173 | " for train_index, test_index in kf.split(train):\n", 174 | " mean_target = train.iloc[train_index].groupby(col)[target].mean()\n", 175 | " train.loc[test_index, col + \"_mean_enc\"] = train.loc[test_index, col].map(mean_target)\n", 176 | " train[col + \"_mean_enc\"].fillna(global_mean, inplace=True)\n", 177 | " return train" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 204, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "X_encode = target_encoder_regularized(pd.concat([X, pd.DataFrame({\"y\":target})], axis=1), \"x\", \"y\")\n", 187 | "X_encode = pd.DataFrame({\"x\": X_encode[\"x_mean_enc\"]})" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 140, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "encoder = TargetEncoder(cols=\"x\")\n", 197 | "encoder.fit(X, target)\n", 198 | "X_encode = encoder.transform(X, target)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 143, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "classifier = tree.DecisionTreeClassifier(max_depth=5) # limit depth of tree\n", 208 | "classifier.fit(X_encode, target)\n", 209 | "\n", 210 | "viz = dtreeviz(classifier, \n", 211 | " X_encode, \n", 212 | " target,\n", 213 | " target_name='target',\n", 214 | " feature_names= X_encode.columns, \n", 215 | " class_names = [\"label-0\", \"label-1\"] # need class_names for classifier\n", 216 | " ) \n", 217 | " \n", 218 | "viz.view() " 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "### Catboost Encoding" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 150, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "def catboost_target_encoder(train, cols_encode, target):\n", 235 | " train_new = train.copy()\n", 236 | " for column in cols_encode:\n", 237 | " global_mean = train[target].mean()\n", 238 | " cumulative_sum = train.groupby(column)[target].cumsum() - train[target]\n", 239 | " cumulative_count = train.groupby(column).cumcount()\n", 240 | " train_new[column + \"_cat_mean_target\"] = cumulative_sum/cumulative_count\n", 241 | " train_new[column + \"_cat_mean_target\"].fillna(global_mean, inplace=True)\n", 242 | " return train_new" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 153, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "X_cat = catboost_target_encoder(pd.concat([X, pd.DataFrame({\"y\":target})], axis=1), \"x\", \"y\")\n", 252 | "X_cat = pd.DataFrame({\"x\": X_cat[\"x_cat_mean_target\"]})" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "classifier = tree.DecisionTreeClassifier(max_depth=5) # limit depth of tree\n", 262 | "classifier.fit(X_cat, target)\n", 263 | "\n", 264 | "viz = dtreeviz(classifier, \n", 265 | " X_cat, \n", 266 | " target,\n", 267 | " target_name='target',\n", 268 | " feature_names= \"x\", \n", 269 | " class_names = [\"label-0\", \"label-1\"] # need class_names for classifier\n", 270 | " ) \n", 271 | " \n", 272 | "viz.view() " 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [] 281 | } 282 | ], 283 | "metadata": { 284 | "kernelspec": { 285 | "display_name": "Python 3", 286 | "language": "python", 287 | "name": "python3" 288 | }, 289 | "language_info": { 290 | "codemirror_mode": { 291 | "name": "ipython", 292 | "version": 3 293 | }, 294 | "file_extension": ".py", 295 | "mimetype": "text/x-python", 296 | "name": "python", 297 | "nbconvert_exporter": "python", 298 | "pygments_lexer": "ipython3", 299 | "version": "3.7.3" 300 | } 301 | }, 302 | "nbformat": 4, 303 | "nbformat_minor": 2 304 | } 305 | -------------------------------------------------------------------------------- /catboost/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Helper functions for categorical encodings 3 | """ 4 | from pandas.api.types import is_string_dtype, is_numeric_dtype 5 | from sklearn.preprocessing import OneHotEncoder 6 | from sklearn.feature_extraction import FeatureHasher 7 | from sklearn.model_selection import KFold 8 | import pandas as pd 9 | from sklearn.ensemble import RandomForestClassifier 10 | from sklearn.metrics import roc_auc_score 11 | 12 | def kfold_target_encoder(train, test, cols_encode, target, folds=10): 13 | """ 14 | Mean regularized target encoding based on kfold 15 | """ 16 | train_new = train.copy() 17 | test_new = test.copy() 18 | kf = KFold(n_splits=folds, random_state=1) 19 | for col in cols_encode: 20 | global_mean = train_new[target].mean() 21 | for train_index, test_index in kf.split(train): 22 | mean_target = train_new.iloc[train_index].groupby(col)[target].mean() 23 | train_new.loc[test_index, col + "_mean_enc"] = train_new.loc[test_index, col].map(mean_target) 24 | train_new[col + "_mean_enc"].fillna(global_mean, inplace=True) 25 | # making test encoding using full training data 26 | col_mean = train_new.groupby(col)[target].mean() 27 | test_new[col + "_mean_enc"] = test_new[col].map(col_mean) 28 | test_new[col + "_mean_enc"].fillna(global_mean, inplace=True) 29 | 30 | # filtering only mean enc cols 31 | train_new = train_new.filter(like="mean_enc", axis=1) 32 | test_new = test_new.filter(like="mean_enc", axis=1) 33 | return train_new, test_new 34 | 35 | def catboost_target_encoder(train, test, cols_encode, target): 36 | """ 37 | Encoding based on ordering principle 38 | """ 39 | train_new = train.copy() 40 | test_new = test.copy() 41 | for column in cols_encode: 42 | global_mean = train[target].mean() 43 | cumulative_sum = train.groupby(column)[target].cumsum() - train[target] 44 | cumulative_count = train.groupby(column).cumcount() 45 | train_new[column + "_cat_mean_enc"] = cumulative_sum/cumulative_count 46 | train_new[column + "_cat_mean_enc"].fillna(global_mean, inplace=True) 47 | # making test encoding using full training data 48 | col_mean = train_new.groupby(column).mean()[column + "_cat_mean_enc"] # 49 | test_new[column + "_cat_mean_enc"] = test[column].map(col_mean) 50 | test_new[column + "_cat_mean_enc"].fillna(global_mean, inplace=True) 51 | # filtering only mean enc cols 52 | train_new = train_new.filter(like="cat_mean_enc", axis=1) 53 | test_new = test_new.filter(like="cat_mean_enc", axis=1) 54 | return train_new, test_new 55 | 56 | def one_hot_encoder(train, test, cols_encode, target=None): 57 | """ one hot encoding""" 58 | ohc_enc = OneHotEncoder(handle_unknown='ignore') 59 | ohc_enc.fit(train[cols_encode]) 60 | train_ohc = ohc_enc.transform(train[cols_encode]) 61 | test_ohc = ohc_enc.transform(test[cols_encode]) 62 | return train_ohc, test_ohc 63 | 64 | def label_encoder(train, test, cols_encode=None, target=None): 65 | """ 66 | Code borrowed from fast.ai and is tweaked a little. 67 | Convert columns in a training and test dataframe into numeric labels 68 | """ 69 | train_new = train.drop(target, axis=1).copy() 70 | test_new = test.drop(target, axis=1).copy() 71 | 72 | for n,c in train_new.items(): 73 | if is_string_dtype(c) or n in cols_encode : train_new[n] = c.astype('category').cat.as_ordered() 74 | 75 | if test_new is not None: 76 | for n,c in test_new.items(): 77 | if (n in train_new.columns) and (train_new[n].dtype.name=='category'): 78 | test_new[n] = pd.Categorical(c, categories=train_new[n].cat.categories, ordered=True) 79 | 80 | cols = list(train_new.columns[train_new.dtypes == 'category']) 81 | for c in cols: 82 | train_new[c] = train_new[c].astype('category').cat.codes 83 | if test_new is not None: test_new[c] = test_new[c].astype('category').cat.codes 84 | return train_new, test_new 85 | 86 | def hash_encoder(train, test, cols_encode, target=None, n_features=10): 87 | """hash encoder""" 88 | h = FeatureHasher(n_features=n_features, input_type="string") 89 | for col_encode in cols_encode: 90 | h.fit(train[col_encode]) 91 | train_hash = h.transform(train[col_encode]) 92 | test_hash = h.transform(test[col_encode]) 93 | return train_hash, test_hash 94 | 95 | 96 | def fitmodel_and_auc_score(encoder, train, test, cols_encode, target, **kwargs): 97 | """ 98 | Fits and returns scores of a random forest model. Uses ROCAUC as scoring metric 99 | """ 100 | model = RandomForestClassifier(n_estimators=500, 101 | n_jobs=-1, 102 | class_weight="balanced", 103 | max_depth=10) 104 | if encoder: 105 | train_encoder, test_encoder = encoder(train, test, cols_encode=cols_encode, target=target) 106 | else: 107 | train_encoder, test_encoder = train.drop(target, axis=1), test.drop(target, axis=1) 108 | model.fit(train_encoder, train[target]) 109 | train_score = roc_auc_score(train[target], model.predict(train_encoder)) 110 | valid_score = roc_auc_score(test[target], model.predict(test_encoder)) 111 | return train_score, valid_score 112 | -------------------------------------------------------------------------------- /class_notes/.ipynb_checkpoints/Lecture 3-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Grocery sales data :" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "* Data is avaiable in form of **star schema** (central table and other tables with metadata). Sometimes we have **snowflake schema**. \n", 15 | "* To reduce the space -- \n", 16 | " * Define datatypes before reading data (read datatypes by loading only first 2 rows of data in bash/ use `shuf` to get random sample of data\n", 17 | " * Use feather format\n", 18 | " * `onpromotion` saved as `object` because it is boolean with NAs --> Replaced later in code and converted to bool\n", 19 | "* EDA of data --\n", 20 | " * Use only last few weeks/months data\n", 21 | " * Transform `sales` to log as Kaggle is checking accuracy on `room mean square log`\n", 22 | " " 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Interpretetion from Random Forest result :" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Confidence interval :\n", 37 | "\n", 38 | "* Standard deviation of trees of prediction gives how confident we are for the predicitons\n", 39 | "* Use **Parallel trees** (fastai) --> To stack trees parallely (specific to random forest)\n", 40 | "* Feedback tips --> Group confidence intervals and look at which group (categorical feature) is contributing to low prediction accurac" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Feature importance :\n", 48 | "\n", 49 | "Which features matter in random forest ?\n", 50 | "\n", 51 | "* Use `rf_feat_importance` and plot top features based on their importance\n", 52 | "\n", 53 | "What to do with important features --\n", 54 | "* Gather domain knowledge about important feature\n", 55 | "* Redo random forest using only top features (select a cutoff threshold) --> It will make model **slightly** better and speed up modeling speed\n", 56 | "* We want **independent** variables for interpretetion\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### A technique for working with features:\n", 64 | "* **Shuffling rows** of one column one by one, keeping same rf model and doing predictions to find out how the results change" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Questions :\n", 72 | "\n", 73 | "* **How does 1 decision tree in default random forest takes sub sample or does it train on complete data ? ** \n", 74 | "*Answer*- If `bootstraping` = False then it takes all samples without replacement. So it will have all the raws. If `bootstraping` = True then it will take len(df) rows but with replacement. So there will be duplicates which make each tree different. Default is *True* \n", 75 | " \n", 76 | "\n", 77 | "* **Because Kaggle is validating on `log error`, does that necessarily mean that log of our independent variable is dependent on dependent variables ?** \n", 78 | "*Answer*- The way Random Forests are built is invariant to monotonic transformations of the independent variables \n", 79 | " \n", 80 | "\n", 81 | "* **Why not put `oob_score` while using `set_rf_sample` ?** \n", 82 | "*Answer*- Because we are passing full data to random forest model and trees are built on subset of data defined in `set_rf_sample`. So each tree will try to calculcate score on remaining data (len(df) - set_rf_sample) which will consume more time. \n", 83 | " \n", 84 | " \n", 85 | "* **What did he talk about data leakage problem ?** \n", 86 | "*Answer*- Sometimes, we make model on some variables which might not be available during real time predictions. This causes data leakage. \n", 87 | " \n", 88 | " " 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### HW :\n", 96 | "\n", 97 | "* Get insights about predictors by playing around with data" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [] 108 | } 109 | ], 110 | "metadata": { 111 | "kernelspec": { 112 | "display_name": "Python 3", 113 | "language": "python", 114 | "name": "python3" 115 | }, 116 | "language_info": { 117 | "codemirror_mode": { 118 | "name": "ipython", 119 | "version": 3 120 | }, 121 | "file_extension": ".py", 122 | "mimetype": "text/x-python", 123 | "name": "python", 124 | "nbconvert_exporter": "python", 125 | "pygments_lexer": "ipython3", 126 | "version": "3.6.2" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /class_notes/.ipynb_checkpoints/Lecture 4-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Bagging and Boosting" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "If different are uncorrelated with each other, residuals are also uncorrelated with each other and on an average it goes to zero by bagging" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "#### Checklist for RF to make it work\n", 40 | "* data in numerical format -> means handle categorical variables\n", 41 | "* handle missing values\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Information game\n", 49 | "* Keep on checking how validation score improves with next split in the tree" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "#### Hyperparameters:\n", 57 | "* **max_depth** : default depth of tree = log2(n) where n is number of rows in data. {! think like each node split up in 2 subnodes and tree will end with 1 row in each node, i.e. n leaf nodes}. If I chose min_leaves = 2, then our max_depth will be log2(n) - 1 \n", 58 | "* **max_features** : 0.5 means 50% of total features and it is different features for different split\n", 59 | "* **min_sample_leaf** : Minimum number of number of rows/observations in each lowest leaf node\n", 60 | "\n", 61 | "#### Why subsample? \n", 62 | "##### What to we need for a good RF :\n", 63 | "1. Each tree better --> minus for subsampling\n", 64 | "2. b/w trees less correlation --> plus for subsampling\n", 65 | "\n", 66 | "Therefore, chosing right hyperparamters and subsample, we want to reduce correlation among trees of a forest, so that on an average they perform well!" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "* Look at 5 scores printed from rf print function. It can help to determine some feature which has high feature importance but is decreasing validation score\n", 74 | "* " 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Interpretetion of RF\n", 82 | "* Look at the std dev of all different trees\n", 83 | "* \n", 84 | "\n", 85 | "#### Important features \n", 86 | "(can use some threshold cut-off)\n", 87 | "* It is not necessary that removing important features will improve the accuracy. But by getting rid of features that are not important is a way to get rid of \n", 88 | "\n", 89 | "**Why can't we take only 1 feature per tree and make forest? ** \n", 90 | "-- we will not be capturing interactions then. e.g. prob. of claim depends on how old the car is, then we need both year claimed and year sold. Taking only 1 feature in each tree will not capture that.\n", 91 | "\n", 92 | "**Why need one hot encoding**? -- Rule of thumb: One hot encoding for column of cardinality > x (x = 7)\n", 93 | "-- Let's say we have 5 levels of feature C1 = {VL, L, M, H, VH} and we are only interested in VL level, then using C1 will make a lot of nodes for C1 and reduce it's importance. But we could have 5 differnt columns (0,1) and only VL will come out to be important. \n", 94 | "\n", 95 | "Doing this can give particular level of some feature which turn out to be important.\n", 96 | "\n", 97 | "** Rank correlation (spearmanr) ** -- As correlation coefficient can't capture non-linear relationship, we can do a rank correlation to check if the 2 things are related, regardless linear or not. The idea is to first convert all to rank, then calculate correlation coefficient\n", 98 | "\n", 99 | "#### Plot for interpretetion and insights :\n", 100 | "* pdp \n", 101 | "* use of univariate plots\n", 102 | "* partial dependence plots\n" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Questions:\n", 110 | "1. " 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": true 118 | }, 119 | "outputs": [], 120 | "source": [] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "Python 3", 126 | "language": "python", 127 | "name": "python3" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": { 131 | "name": "ipython", 132 | "version": 3 133 | }, 134 | "file_extension": ".py", 135 | "mimetype": "text/x-python", 136 | "name": "python", 137 | "nbconvert_exporter": "python", 138 | "pygments_lexer": "ipython3", 139 | "version": "3.6.2" 140 | } 141 | }, 142 | "nbformat": 4, 143 | "nbformat_minor": 2 144 | } 145 | -------------------------------------------------------------------------------- /class_notes/.ipynb_checkpoints/Lecture 5-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /class_notes/.ipynb_checkpoints/Lecture 6-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Why need feature importance?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Defined objective ->> Levers *(what inputs can we control, intersection of levers and random forest important features)* ->> Data ->> Model" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "* What is causing something to happen? Why is that happening? For all these questions, feature interpretetion is importance. \n", 22 | "* We use **simulation** to predict the outcome of things that can happen from our predictions. \n", 23 | " * E.g. of simulation -> What would be change in Jeremy's behaviour as customer if we send a this customer service guy to him? \n", 24 | " \n", 25 | "* Some industries measure the success on AUC-ROC, which is different than finding what would be the outcome of it \n", 26 | " * E.g. about tree interpretor. In a hospital, we predicted that this patient is likely to be readmitted in 2 weeks. But main aim of model is to tell what can we do about it? for e.g. this patient is highly likely to come back to hospital, but what we are interested in is what can we do about this patient now? Maybe he is highly likely to come back because of his high BP, so fix that now. \n", 27 | "\n", 28 | "*" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "*** -- thought cloud -- ***\n", 36 | "\n", 37 | "* If you building " 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Feature importance\n", 45 | "\n", 46 | "How is it calculated? \n", 47 | "\n", 48 | "* Let our original more score is s1 -> Randomly shuffle a feature f1 -> make prediction on same model with only f1 shuffled -> new score s2 -> importance is s2 - s1\n", 49 | "\n", 50 | "For e.g. if some feature is 1000 times more important than any other feature, then focus on using that one variable. \n", 51 | "\n", 52 | "* Look at feature importance line plot. See where it flatten out. \n", 53 | "\n" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### Tree interpretor\n", 61 | "\n", 62 | "e.g. \n", 63 | "\n", 64 | "10(mean/bias) -**route1**-> 9.4(contribution = 10-9.4 = **-0.6**) -**route2**-> 9.7(contribution = 9.7-9.4 = **0.3**) -**route3**-> 9.1(contribution = 9.7-9.1 = **0.6**\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Interaction term contributions\n", 72 | "\n", 73 | "e.g. in above tree interpretor case, **contribution of rout1(*)route2 = 9.7-10 = -0.3**.\n", 74 | "\n", 75 | "Intercation here means splitting on some branch (don't think about multiplication, addition or ratio)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### Coding for feature importance'" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 18, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "import numpy as np\n", 92 | "import matplotlib.pyplot as plt\n", 93 | "from sklearn.ensemble import RandomForestRegressor" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": { 100 | "collapsed": true 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "x = np.linspace(0,1)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 8, 110 | "metadata": { 111 | "collapsed": true 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "y = x + np.random.uniform(-.2, .2, x.shape)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "plt.scatter(x,y)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 15, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/plain": [ 137 | "(50, 1)" 138 | ] 139 | }, 140 | "execution_count": 15, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "x[:, None].shape" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 16, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "(50, 1)" 158 | ] 159 | }, 160 | "execution_count": 16, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "x[...,None].shape" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 17, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "(1, 50)" 178 | ] 179 | }, 180 | "execution_count": 17, 181 | "metadata": {}, 182 | "output_type": "execute_result" 183 | } 184 | ], 185 | "source": [ 186 | "x[None,:].shape" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "collapsed": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "x_trn, x_tst = " 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "# creating random forest\n", 209 | "m = RandomForestRegressor().fit(x_trn, y_trn)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": true 217 | }, 218 | "outputs": [], 219 | "source": [] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "collapsed": true 226 | }, 227 | "outputs": [], 228 | "source": [] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "outputs": [], 237 | "source": [] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "*** -- thought cloud -- ***\n", 244 | "\n", 245 | "* " 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.6.2" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 2 279 | } 280 | -------------------------------------------------------------------------------- /class_notes/.ipynb_checkpoints/Lecture 7-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Random forest from scratch" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "1. Take average prediction from all the trees \n", 15 | " \n", 16 | "2. indexes to keep track which row goes to right, which to left hand side of tree . \n", 17 | " \n", 18 | "3. prediction in each node = mean of dependent variables that are in that node (branch) of that tree. \n", 19 | " \n", 20 | "5. __repr__ to change default print of method to helpful formatted stuff (e.g. where that method is present) .\n", 21 | " \n", 22 | "6. **@ notation** -> decorator. Think of **flask** from data acquisition class. here using **@property** decorator i.e. we don't need to put any parenthesis anyore in function (mostly with no arguments)\n", 23 | " \n", 24 | "7. Score similar to minimizing RMSE is minimizing group standard deviatons. think of cat and dog example to think intuitively. \n", 25 | " \n", 26 | "4. How to find which variable to split on? --> minimize weighted group std deviations for each split.\n", 27 | " \n", 28 | "8. What is computaition complexity of **find_better_split** ? n square. (n loops x check each lhs(i) n times --> n squared) --> changed to order n in next section\n", 29 | " \n", 30 | "9. ** %prun ** similar to ** %time ** : gives internal processes time\n", 31 | " \n", 32 | "10. **alpha** in scatter plot helps if dots are sitting on top of each other \n", 33 | " \n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "#### Explore Cython \n", 41 | "\n", 42 | "* Make stuff faster and easy to edit python codes. How/why? \n", 43 | "* similar imports --> **cimport numpy as np** " 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "**-- thought cloud -- **\n", 51 | " \n", 52 | "* Algorithms go obsolete. No point of using SVM today \n", 53 | "* Magic number for T to normal distribution = 22. This we should have both in validation/train set. \n", 54 | "* Downside of tree based algos --> They don't extrapolate \n", 55 | "* Size of validation set? --> first answer how much accuracy we want . For e.g. for fraud detection, even 0.2% change in accuracy matter, maybe not for differentiating cat and dog.\n", 56 | " * A way to think about it --> even with 0.2% differece in accuracy, we could get 50% of change in accuracy from 0.4% . \n", 57 | "* set_rf_sample also does with replacement " 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### Questions -- \n", 65 | "\n", 66 | "* 22 number or 22% ?\n", 67 | "* How is cython faster? (what does c++ has to make it faster)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### HW --\n", 75 | "\n", 76 | "* Write code (from scratch) for removing redundant features, partial dependence and tree interpretor. \n", 77 | "* Add gist and nb extension on jupyter notebook" 78 | ] 79 | } 80 | ], 81 | "metadata": { 82 | "kernelspec": { 83 | "display_name": "Python 3", 84 | "language": "python", 85 | "name": "python3" 86 | }, 87 | "language_info": { 88 | "codemirror_mode": { 89 | "name": "ipython", 90 | "version": 3 91 | }, 92 | "file_extension": ".py", 93 | "mimetype": "text/x-python", 94 | "name": "python", 95 | "nbconvert_exporter": "python", 96 | "pygments_lexer": "ipython3", 97 | "version": "3.6.2" 98 | } 99 | }, 100 | "nbformat": 4, 101 | "nbformat_minor": 2 102 | } 103 | -------------------------------------------------------------------------------- /class_notes/Lecture 10 | 11.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### kaggle insurance winner\n", 8 | "\n", 9 | "* **auto encoder** -- \n", 10 | " *creating neural network features (embeddings, matrices) when we don't have dependent varibiable\n", 11 | " * denoise auto-encoder to learn embeddings (15% randomly replace with some other row)\n", 12 | " \n", 13 | "\n" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### IMDB\n", 21 | "\n", 22 | "* **vocablory** - list of all unique words that appear in all the documents. \n", 23 | "* **bag of words** representation. i.e. of term document matrix. which words appear , how many times they appear. (no order). In **RNN**, can take use of order of words also. \n", 24 | "* **tokenization** - convert bag into list of words. convert piece of text into list of tokens . every token is either word or single piece of punctuation. \n", 25 | " e.g. this \"movie\" isn't good will be tokenized to (this \" movie \" isn ' t good .) we don't want `good.` as an object. It means nothing. \n", 26 | "* Fit tranform on training (to make vocublary from training) -> transform on test (apply in same order). but if test has some new words `unknown` category that we build in train. \n", 27 | " \n", 28 | "* **Countvectorizer** : term document matrix \n", 29 | "* **fit transform** term doc matrix form . stores sparse matrices in form (docID, wordID, no. of occur) . \n", 30 | "* **transform** \n", 31 | "* **vecrz.get_feature()** get feature names, words in this case . \n", 32 | "* **tokenization** splitting words in doc to make term doc matrix (seperate words in each row) . \n", 33 | "* **logistic regression** on val_term_doc handles when naive bayes does not follow independent assumptions. It handles linear relationship b/w variables (here variable are our words) . \n", 34 | "* **duel = True** in logistic regression. when you have more cols and rows\n", 35 | "* **ngrams** max_features - throws which ngrams do not exist much. like once etc. \n", 36 | "* **solver** -- (sag , \n", 37 | " * conjugate gradient \n", 38 | " * bfgs (" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## L.11" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "* weight decay -- e.g. L2 penaly = a*x^2. Derivate = weight decay = 2*a*x . \n", 53 | "* cross entropy for classification? --> \n", 54 | "* weights of 0 or 1 end up making it non opinionistic (posterior probability) . \n", 55 | "* **weight decay = 0.4** works pretty well in most of cases \n", 56 | "\n", 57 | "* Sparse way of storing is to list out the indexes\n", 58 | "* one hot embedding -- mathematically identical to multiplying indexs matrix by one hot matrix. basically embedding means make a multiplication by one hot encoded matrix faster by replacing it with simple array lookup. \n", 59 | "\n", 60 | "* process - now have high dimentionality categorical variable -> convert to number from 0 to numb of levels --> learn a linear layer from that as if we had 1 hot encoded it without ever actually contructing the 1 hot encoded version and without doing matrix multiply -> just saving as index and doing array lookup -> in one hot encoded version everything was 0, no gradient. --> now gradient that are flowing back will update the particular row of the embedding matrix we have used ->" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## HW" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "* Understand gradients \n", 75 | "* Look at `DotProdNB` function . \n", 76 | "* **Embedding matrix** see video . " 77 | ] 78 | } 79 | ], 80 | "metadata": { 81 | "kernelspec": { 82 | "display_name": "Python 3", 83 | "language": "python", 84 | "name": "python3" 85 | }, 86 | "language_info": { 87 | "codemirror_mode": { 88 | "name": "ipython", 89 | "version": 3 90 | }, 91 | "file_extension": ".py", 92 | "mimetype": "text/x-python", 93 | "name": "python", 94 | "nbconvert_exporter": "python", 95 | "pygments_lexer": "ipython3", 96 | "version": "3.6.3" 97 | } 98 | }, 99 | "nbformat": 4, 100 | "nbformat_minor": 2 101 | } 102 | -------------------------------------------------------------------------------- /class_notes/Lecture 3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Grocery sales data :" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "* Data is avaiable in form of **star schema** (central table and other tables with metadata). Sometimes we have **snowflake schema**. \n", 15 | "* To reduce the space -- \n", 16 | " * Define datatypes before reading data (read datatypes by loading only first 2 rows of data in bash/ use `shuf` to get random sample of data\n", 17 | " * Use feather format\n", 18 | " * `onpromotion` saved as `object` because it is boolean with NAs --> Replaced later in code and converted to bool\n", 19 | "* EDA of data --\n", 20 | " * Use only last few weeks/months data\n", 21 | " * Transform `sales` to log as Kaggle is checking accuracy on `room mean square log`\n", 22 | " " 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Interpretetion from Random Forest result :" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Confidence interval :\n", 37 | "\n", 38 | "* Standard deviation of trees of prediction gives how confident we are for the predicitons\n", 39 | "* Use **Parallel trees** (fastai) --> To stack trees parallely (specific to random forest)\n", 40 | "* Feedback tips --> Group confidence intervals and look at which group (categorical feature) is contributing to low prediction accurac" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Feature importance :\n", 48 | "\n", 49 | "Which features matter in random forest ?\n", 50 | "\n", 51 | "* Use `rf_feat_importance` and plot top features based on their importance\n", 52 | "\n", 53 | "What to do with important features --\n", 54 | "* Gather domain knowledge about important feature\n", 55 | "* Redo random forest using only top features (select a cutoff threshold) --> It will make model **slightly** better and speed up modeling speed\n", 56 | "* We want **independent** variables for interpretetion\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### A technique for working with features:\n", 64 | "* **Shuffling rows** of one column one by one, keeping same rf model and doing predictions to find out how the results change" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Questions :\n", 72 | "\n", 73 | "* **How does 1 decision tree in default random forest takes sub sample or does it train on complete data ? ** \n", 74 | "*Answer*- If `bootstraping` = False then it takes all samples without replacement. So it will have all the raws. If `bootstraping` = True then it will take len(df) rows but with replacement. So there will be duplicates which make each tree different. Default is *True* \n", 75 | " \n", 76 | "\n", 77 | "* **Because Kaggle is validating on `log error`, does that necessarily mean that log of our independent variable is dependent on dependent variables ?** \n", 78 | "*Answer*- The way Random Forests are built is invariant to monotonic transformations of the independent variables \n", 79 | " \n", 80 | "\n", 81 | "* **Why not put `oob_score` while using `set_rf_sample` ?** \n", 82 | "*Answer*- Because we are passing full data to random forest model and trees are built on subset of data defined in `set_rf_sample`. So each tree will try to calculcate score on remaining data (len(df) - set_rf_sample) which will consume more time. \n", 83 | " \n", 84 | " \n", 85 | "* **What did he talk about data leakage problem ?** \n", 86 | "*Answer*- Sometimes, we make model on some variables which might not be available during real time predictions. This causes data leakage. \n", 87 | " \n", 88 | " " 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### HW :\n", 96 | "\n", 97 | "* Get insights about predictors by playing around with data" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [] 108 | } 109 | ], 110 | "metadata": { 111 | "kernelspec": { 112 | "display_name": "Python 3", 113 | "language": "python", 114 | "name": "python3" 115 | }, 116 | "language_info": { 117 | "codemirror_mode": { 118 | "name": "ipython", 119 | "version": 3 120 | }, 121 | "file_extension": ".py", 122 | "mimetype": "text/x-python", 123 | "name": "python", 124 | "nbconvert_exporter": "python", 125 | "pygments_lexer": "ipython3", 126 | "version": "3.6.2" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /class_notes/Lecture 4.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Bagging and Boosting" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "If different are uncorrelated with each other, residuals are also uncorrelated with each other and on an average it goes to zero by bagging" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "#### Checklist for RF to make it work\n", 40 | "* data in numerical format -> means handle categorical variables\n", 41 | "* handle missing values\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Information game\n", 49 | "* Keep on checking how validation score improves with next split in the tree" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "#### Hyperparameters:\n", 57 | "* **max_depth** : default depth of tree = log2(n) where n is number of rows in data. {! think like each node split up in 2 subnodes and tree will end with 1 row in each node, i.e. n leaf nodes}. If I chose min_leaves = 2, then our max_depth will be log2(n) - 1 \n", 58 | "* **max_features** : 0.5 means 50% of total features and it is different features for different split\n", 59 | "* **min_sample_leaf** : Minimum number of number of rows/observations in each lowest leaf node\n", 60 | "\n", 61 | "#### Why subsample? \n", 62 | "##### What to we need for a good RF :\n", 63 | "1. Each tree better --> minus for subsampling\n", 64 | "2. b/w trees less correlation --> plus for subsampling\n", 65 | "\n", 66 | "Therefore, chosing right hyperparamters and subsample, we want to reduce correlation among trees of a forest, so that on an average they perform well!" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "* Look at 5 scores printed from rf print function. It can help to determine some feature which has high feature importance but is decreasing validation score\n", 74 | "* " 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Interpretetion of RF\n", 82 | "* Look at the std dev of all different trees\n", 83 | "* \n", 84 | "\n", 85 | "#### Important features \n", 86 | "(can use some threshold cut-off)\n", 87 | "* It is not necessary that removing important features will improve the accuracy. But by getting rid of features that are not important is a way to get rid of \n", 88 | "\n", 89 | "**Why can't we take only 1 feature per tree and make forest? ** \n", 90 | "-- we will not be capturing interactions then. e.g. prob. of claim depends on how old the car is, then we need both year claimed and year sold. Taking only 1 feature in each tree will not capture that.\n", 91 | "\n", 92 | "**Why need one hot encoding**? -- Rule of thumb: One hot encoding for column of cardinality > x (x = 7)\n", 93 | "-- Let's say we have 5 levels of feature C1 = {VL, L, M, H, VH} and we are only interested in VL level, then using C1 will make a lot of nodes for C1 and reduce it's importance. But we could have 5 differnt columns (0,1) and only VL will come out to be important. \n", 94 | "\n", 95 | "Doing this can give particular level of some feature which turn out to be important.\n", 96 | "\n", 97 | "** Rank correlation (spearmanr) ** -- As correlation coefficient can't capture non-linear relationship, we can do a rank correlation to check if the 2 things are related, regardless linear or not. The idea is to first convert all to rank, then calculate correlation coefficient\n", 98 | "\n", 99 | "#### Plot for interpretetion and insights :\n", 100 | "* pdp \n", 101 | "* use of univariate plots\n", 102 | "* partial dependence plots\n" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "### Questions:\n", 110 | "1. " 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "collapsed": true 118 | }, 119 | "outputs": [], 120 | "source": [] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "Python 3", 126 | "language": "python", 127 | "name": "python3" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": { 131 | "name": "ipython", 132 | "version": 3 133 | }, 134 | "file_extension": ".py", 135 | "mimetype": "text/x-python", 136 | "name": "python", 137 | "nbconvert_exporter": "python", 138 | "pygments_lexer": "ipython3", 139 | "version": "3.6.2" 140 | } 141 | }, 142 | "nbformat": 4, 143 | "nbformat_minor": 2 144 | } 145 | -------------------------------------------------------------------------------- /class_notes/Lecture 5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Class discussion --\n", 8 | "\n", 9 | "Q - How does random forest deal with outliers? \n", 10 | "A - In general they are resilient about random outliers as they will be no split occuring for a few outlier observations. For consistent outliers, it will give a consisent signal how dependent variable is dependent on some outlier value. \n", 11 | "\n", 12 | "Q - Which score is used for classification problem like R^sq for regression? \n", 13 | "A - For classification use **cross entropy/ loss loss** \n", 14 | "\n", 15 | "Q - Creating an appropriate **test set** is the most important. Why? \n", 16 | "A - If test set is good indicator of how well your model will work in production, . This is something you will keep in mind even after getting promoted to high position, even if you don't do hands on coding. Wrong test set can lead to heavy loss in business in real world. \n", 17 | "\n", 18 | "Use sequential (e.g. last 3 months) of data as test set for time related data (e.g. grocery)\n", 19 | "Think like what, how and when are you trying to predict\n", 20 | "\n", 21 | "Q - Regarding point above, we may not have data related to time in our observation, but our data might be dependent on time. How to do train-validate split then? What should our test set look like? \n", 22 | "A - Think of type of problem at hand. \n", 23 | "\n", 24 | "Q - How to deal with seasonality if we take validation and test just last few months?\n", 25 | "A - think what factors lead to have seasonality. why is sales higher in summers? what factor impact that?\n", 26 | "\n", 27 | "Q - Is OOB score better or worse estimator for model as compared to validation set score? \n", 28 | "A - Generally worse because OOB only saw subset of trees and training data saw full forest.\n", 29 | "\n" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "#### Waterfall chart\n", 37 | "Widely used in business (don't use python, would be cool if someone can make it. Not that difficult thought)\n", 38 | "\n", 39 | "#### Tree interpretor\n", 40 | "Can be used to debug why some prediction came out to be bad. (e.g. Jeremy like Citizen Kane) \n", 41 | "\n", 42 | "-------\n", 43 | "Comparing test/ validation set with training set. (is_valid flag)\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### My questions -- \n", 51 | "\n", 52 | "Q - Again, what is bias? \n", 53 | "A - ? -- Maybe difference between prediction and average of all y's (easiest guess)\n", 54 | "\n", 55 | "Q - How is particular feature come out most important for one row, different for other row?\n", 56 | "A - ?\n", 57 | "\n", 58 | "Q - About PDP, why only 1 point in 1960 (not 500 lines). Learn more about what is there on Y axis.\n", 59 | "A\n", 60 | "\n" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### Random Forest from scratch -- \n", 68 | "Will compare it with sklearn random forest to see how well we did\n", 69 | "\n", 70 | "\n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "#### HW\n", 78 | "* Try to replicate whatever we learnt in class today for some kaggle problem or any other dataset" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [] 89 | } 90 | ], 91 | "metadata": { 92 | "kernelspec": { 93 | "display_name": "Python 2", 94 | "language": "python", 95 | "name": "python2" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 2 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython2", 107 | "version": "2.7.13" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /class_notes/Lecture 6.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Why need feature importance?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Defined objective ->> Levers *(what inputs can we control, intersection of levers and random forest important features)* ->> Data ->> Model" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "* What is causing something to happen? Why is that happening? For all these questions, feature interpretetion is importance. \n", 22 | "* We use **simulation** to predict the outcome of things that can happen from our predictions. \n", 23 | " * E.g. of simulation -> What would be change in Jeremy's behaviour as customer if we send a this customer service guy to him? \n", 24 | " \n", 25 | "* Some industries measure the success on AUC-ROC, which is different than finding what would be the outcome of it \n", 26 | " * E.g. about tree interpretor. In a hospital, we predicted that this patient is likely to be readmitted in 2 weeks. But main aim of model is to tell what can we do about it? for e.g. this patient is highly likely to come back to hospital, but what we are interested in is what can we do about this patient now? Maybe he is highly likely to come back because of his high BP, so fix that now. \n", 27 | "\n", 28 | "*" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "*** -- thought cloud -- ***\n", 36 | "\n", 37 | "* If you building " 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### Feature importance\n", 45 | "\n", 46 | "How is it calculated? \n", 47 | "\n", 48 | "* Let our original more score is s1 -> Randomly shuffle a feature f1 -> make prediction on same model with only f1 shuffled -> new score s2 -> importance is s2 - s1\n", 49 | "\n", 50 | "For e.g. if some feature is 1000 times more important than any other feature, then focus on using that one variable. \n", 51 | "\n", 52 | "* Look at feature importance line plot. See where it flatten out. \n", 53 | "\n" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "### Tree interpretor\n", 61 | "\n", 62 | "e.g. \n", 63 | "\n", 64 | "10(mean/bias) -**route1**-> 9.4(contribution = 10-9.4 = **-0.6**) -**route2**-> 9.7(contribution = 9.7-9.4 = **0.3**) -**route3**-> 9.1(contribution = 9.7-9.1 = **0.6**\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Interaction term contributions\n", 72 | "\n", 73 | "e.g. in above tree interpretor case, **contribution of rout1(*)route2 = 9.7-10 = -0.3**.\n", 74 | "\n", 75 | "Intercation here means splitting on some branch (don't think about multiplication, addition or ratio)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### Coding for feature importance'" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 18, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "import numpy as np\n", 92 | "import matplotlib.pyplot as plt\n", 93 | "from sklearn.ensemble import RandomForestRegressor" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": { 100 | "collapsed": true 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "x = np.linspace(0,1)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 8, 110 | "metadata": { 111 | "collapsed": true 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "y = x + np.random.uniform(-.2, .2, x.shape)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "plt.scatter(x,y)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 15, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/plain": [ 137 | "(50, 1)" 138 | ] 139 | }, 140 | "execution_count": 15, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "x[:, None].shape" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 16, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "(50, 1)" 158 | ] 159 | }, 160 | "execution_count": 16, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "x[...,None].shape" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 17, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "(1, 50)" 178 | ] 179 | }, 180 | "execution_count": 17, 181 | "metadata": {}, 182 | "output_type": "execute_result" 183 | } 184 | ], 185 | "source": [ 186 | "x[None,:].shape" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "collapsed": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "x_trn, x_tst = " 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "collapsed": true 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "# creating random forest\n", 209 | "m = RandomForestRegressor().fit(x_trn, y_trn)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": true 217 | }, 218 | "outputs": [], 219 | "source": [] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "collapsed": true 226 | }, 227 | "outputs": [], 228 | "source": [] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "outputs": [], 237 | "source": [] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "*** -- thought cloud -- ***\n", 244 | "\n", 245 | "* " 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.6.2" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 2 279 | } 280 | -------------------------------------------------------------------------------- /class_notes/Lecture 7.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Random forest from scratch" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "1. Take average prediction from all the trees \n", 15 | " \n", 16 | "2. indexes to keep track which row goes to right, which to left hand side of tree . \n", 17 | " \n", 18 | "3. prediction in each node = mean of dependent variables that are in that node (branch) of that tree. \n", 19 | " \n", 20 | "5. __repr__ to change default print of method to helpful formatted stuff (e.g. where that method is present) .\n", 21 | " \n", 22 | "6. **@ notation** -> decorator. Think of **flask** from data acquisition class. here using **@property** decorator i.e. we don't need to put any parenthesis anyore in function (mostly with no arguments)\n", 23 | " \n", 24 | "7. Score similar to minimizing RMSE is minimizing group standard deviatons. think of cat and dog example to think intuitively. \n", 25 | " \n", 26 | "4. How to find which variable to split on? --> minimize weighted group std deviations for each split.\n", 27 | " \n", 28 | "8. What is computaition complexity of **find_better_split** ? n square. (n loops x check each lhs(i) n times --> n squared) --> changed to order n in next section\n", 29 | " \n", 30 | "9. ** %prun ** similar to ** %time ** : gives internal processes time\n", 31 | " \n", 32 | "10. **alpha** in scatter plot helps if dots are sitting on top of each other \n", 33 | " \n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "#### Explore Cython \n", 41 | "\n", 42 | "* Make stuff faster and easy to edit python codes. How/why? \n", 43 | "* similar imports --> **cimport numpy as np** " 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "**-- thought cloud -- **\n", 51 | " \n", 52 | "* Algorithms go obsolete. No point of using SVM today \n", 53 | "* Magic number for T to normal distribution = 22. This we should have both in validation/train set. \n", 54 | "* Downside of tree based algos --> They don't extrapolate. Linear algos but they are not very accurate. Neural nets are best. \n", 55 | "* Size of validation set? --> first answer how much accuracy we want . For e.g. for fraud detection, even 0.2% change in accuracy matter, maybe not for differentiating cat and dog.\n", 56 | " * A way to think about it --> even with 0.2% differece in accuracy, we could get 50% of change in accuracy from 0.4% . \n", 57 | "* set_rf_sample also does with replacement " 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### Questions -- \n", 65 | "\n", 66 | "* 22 number or 22% ?\n", 67 | "* How is cython faster? (what does c++ has to make it faster)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### HW --\n", 75 | "\n", 76 | "* Write code (from scratch) for removing redundant features, partial dependence and tree interpretor. \n", 77 | "* Add gist and nb extension on jupyter notebook" 78 | ] 79 | } 80 | ], 81 | "metadata": { 82 | "kernelspec": { 83 | "display_name": "Python 3", 84 | "language": "python", 85 | "name": "python3" 86 | }, 87 | "language_info": { 88 | "codemirror_mode": { 89 | "name": "ipython", 90 | "version": 3 91 | }, 92 | "file_extension": ".py", 93 | "mimetype": "text/x-python", 94 | "name": "python", 95 | "nbconvert_exporter": "python", 96 | "pygments_lexer": "ipython3", 97 | "version": "3.6.2" 98 | } 99 | }, 100 | "nbformat": 4, 101 | "nbformat_minor": 2 102 | } 103 | -------------------------------------------------------------------------------- /class_notes/Lecture 8.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Learn how to sell, write blogposts, show your work" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Drawbacks of random forest" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "* Rf works with clever nearest neighbours, so not good where we need extrapolation. \n", 22 | "* Grocery sales competetion where there is time series component . \n", 23 | "* " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Intro to neural nets" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "* **pickle** - saving some object, for any python object (not optimal for all e.g. not for pandas) \n", 38 | "* **destructuring** -- splitting x,y \n", 39 | "* vector = rank 1 tensor, matrix = rank 2 tensor, 3D matrix = rank 3 tensor . \n", 40 | "* row = dimension 0 (axis = 0), columns = dimension 1 (axis = 1) . \n", 41 | "* Random forest is one algo for which we can completely ignore normalization as it depends on the order not actual values (think of splits) (general idea - algos needing trees) . \n", 42 | "* Normalization matters in neural nets/ DL . (matters for k-near neighbours) \n", 43 | "* **reshape(-1)** - figure out itself about number of dimensions . \n", 44 | "* **logistic regression** is literally a neural net with 1 layer \n", 45 | "* think **torch** like **numpy** for pytorch . \n", 46 | "* we have **view** in torch like **reshape** in numpy \n", 47 | "* " 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### Caveats of neural network \n", 55 | "(michael neelson universal approx)\n", 56 | "\n", 57 | "* for continuous functions only. not with discountinuour or with jumps \n", 58 | "* " 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Blog ideas --\n", 66 | "* why normalization matters in neural net, not in rf. Also talk about same normalization for same training and validation . \n", 67 | "* can talk about where need -1 in reshape, slice, reorder . \n", 68 | "* something from rf from scratch . (about tree interpretor, feature importance, pdp) . \n", 69 | "* -ve log likelihood cost = cross entroy (logistic loss fn) -- can be binary and multi class . \n", 70 | "* " 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## HW\n", 78 | "* rewrite loss fn with if statement \n", 79 | "* read from Michael Nielsen blog \n", 80 | "* " 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "collapsed": true 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "Dl questions\n", 92 | "sumproduct, matmul\n", 93 | "\n" 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.6.2" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /class_notes/Lecture 9|10.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "* Optimizer - SGD, Adam \n", 8 | "* @ python 3 matrix multiplication \n", 9 | "* **Iterators** / **Generator** (Iter to make -> next to grab \n", 10 | "Every generator is iterator, but not vice versa\n", 11 | "\n", 12 | "Iterator - any object whose class has __next__ method and __iter__ method.\n", 13 | "Generator - object build by calling a fn. that has `yield` in it (yield is like return but it remembers local variable even after exiting fn. and can be started from there with __next__ statement)\n", 14 | "\n", 15 | "-- can use `for o in `\n", 16 | "-- or `next()` --> stream processing (because after 1st, its NEXT (2nd) thing. not the 3rd thing \n", 17 | "\n", 18 | "any dataset can be converted into iterator by wrapping in **DataLoader(d, shuffle = True, bs = 64)** {pytorch} (by default -- without replacement) \n", 19 | "\n", 20 | "* **Variable** i pytorch has similar api as tensor, but keeps track of what we did. Therefore can take derivative \n", 21 | " * `.backward` gives gradient (do it on `loss` to calculate it's lowest point) \n", 22 | " * goes inside functions like `chain rule`\n", 23 | "\n", 24 | "* **Function_()** here _ does inplace = True \n", 25 | "\n", 26 | "* **Optimizer** -- `a.data -= learning_rate * a.grad.data` . \n", 27 | "\n", 28 | "* **Momentum** -- keep track of \n", 29 | "\n", 30 | "* If we have layer inside the net, we don't need to 0 the gradient in SGD. \n", 31 | " \n", 32 | "* Why we do RELU (non linear activation funtion) ? If not, we will end up with combination of multiple linear layers which is not useful as it is just 1 linear layer with different parameters . \n", 33 | "* Final non linear layer (**softmax** - want prob. of 1 class (so use for binary), **sigmoid** - for multiclass) \n", 34 | "* For hidden non linear layer, pretty much always use **RELU** or **reaky RELU** (close to 0)(found overtime) . \n", 35 | "\n", 36 | "* **Broadcasting** - \n", 37 | "the smaller array is “broadcast” across the larger array so that they have compatible shapes (smaller array means of lower rank) \n", 38 | "\n", 39 | "* **expand_dims** to change vector to 1 column matrix \n", 40 | "* c[None] - adds 1 axis at start of len 1. c[:,None] - adds new axis at end of len 1 . c[None:,None] - adds new axis at start and end \n", 41 | "* **backpropagation** - using chain rule to find derivates . \n", 42 | "\n", 43 | "* **Eliminate overfitting** -- reduce the weights or predict better are ways to minimize loss fn. Eg. `L1` or `L2` = sum of abs weights or sum of sq weights . Common values of it's {lambda} = 1e-6 to 1e-4 . \n", 44 | "\n", 45 | "derivate for `L2` -> 2a.sum(w) (w = weights). called: **weight decay** . used to reduce training overfitting \n", 46 | "\n", 47 | "\n", 48 | "\n", 49 | "\n", 50 | "\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## Broadcasting" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "the smaller array is “broadcast” across the larger array so that they have compatible shapes (smaller array means of lower rank)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "Let's start with rules of broadcasting --\n", 72 | "\n", 73 | "When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when :\n", 74 | "\n", 75 | "* they are equal, or\n", 76 | "* one of them is 1 \n", 77 | "\n", 78 | "What does this mean? Let's see with examples" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "Questions -- \n", 86 | "* Why used logsoftmax, not softmax? \n", 87 | "* n.Parameter(torch.randn(*dims)/dims[0]) --> reason of dividing by dims[0] \n", 88 | "* torch.log(torch.exp(x)/(torch.exp(x).sum(dim=0))) --> why non lin layer back to log. (we had aim to make +ves) \n", 89 | "* " 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "HW \n", 97 | "* Try adding non linear RELU" 98 | ] 99 | } 100 | ], 101 | "metadata": { 102 | "kernelspec": { 103 | "display_name": "Python 3", 104 | "language": "python", 105 | "name": "python3" 106 | }, 107 | "language_info": { 108 | "codemirror_mode": { 109 | "name": "ipython", 110 | "version": 3 111 | }, 112 | "file_extension": ".py", 113 | "mimetype": "text/x-python", 114 | "name": "python", 115 | "nbconvert_exporter": "python", 116 | "pygments_lexer": "ipython3", 117 | "version": "3.6.3" 118 | } 119 | }, 120 | "nbformat": 4, 121 | "nbformat_minor": 2 122 | } 123 | -------------------------------------------------------------------------------- /class_notes/Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /class_notes/fastai: -------------------------------------------------------------------------------- 1 | ../fastai/fastai/ -------------------------------------------------------------------------------- /class_notes/waterfall.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from waterfallcharts import quick_charts as qc" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 8, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEYCAYAAAAJeGK1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xmc3dP9x/HXOyERaxLSIoRYWqW1dYgtFbUlSi1VO7VG\nai8/LeVX1VJL6qeJLVU0CFW1L7GVWkKJIFQsFbsIIiGCyPr5/XHOxM2YmdzE3Pu9k3k/H495zP0u\n997P3Jm5n3vO93POUURgZmZWa9oVHYCZmVljnKDMzKwmOUGZmVlNcoIyM7Oa5ARlZmY1yQnKzMxq\nkhOUWY2R9GtJlxUdh1nRnKDMWoCkNyRNlfSppPclDZW0ZBn36yPpndJ9EfGHiDi0BWJaVVJIWqSM\ncw/M5+75dZ/XrKU4QZm1nJ0iYklgQ6AOOLXgeObHz4BJwAFFB2JWzwnKrIVFxDjgLuC7AJIOkvSi\npCmSXpN0eN6/RD5vxdzy+lTSipJ+K2lY/eNJ2kTSY5I+lvSspD4lxx6U9HtJj+bHv1fScvnww/n7\nx/mxN20sXkmrAFsC/YHtJS3fsq+I2YJxgjJrYZJWBnYAnsm7PgB2BJYGDgLOl7RhRHwG9APejYgl\n89e7DR6rO3AncAbQFfgf4EZJ3UpO2yc/7jeADvkcgB/k753zY/+7iZAPAEZFxI3Ai8C+C/ijm7Uo\nJyizlnOLpI+BEcBDwB8AIuLOiHg1koeAe4HeZT7mfsDwiBgeEbMj4j5gFCkB1vtrRPw3IqYC1wPr\nz2fcBwDX5tvX4m4+qxFOUGYtZ5eI6BwRq0TEETlhIKmfpMclTcoJbAdgueYfao5VgJ/m7r2P8/23\nAFYoOee9ktufA/MszqgnaXOgJ3Bd3nUt8D1J85vkzFrcPKt7zGzBSeoI3EhqldwaETMk3QIonzKv\n5QTeBq6OiMMW4OnLWargZzmW0ZIa7h+9AM9p1mLcgjKrrA5AR2ACMFNSP2C7kuPvA8tKWqaJ+w8D\ndpK0vaT2khbLpekrlfHcE4DZwGqNHZS0GLAHqThi/ZKvo4F9yilPN6skJyizCoqIKcAxpGtDH5EK\nGm4rOf4S8DfgtdyFt2KD+78N7Az8mpRw3gZOpIz/3Yj4HDgTeDQ/9iYNTtkFmApcFRHv1X8BV5B6\nV/ouwI9s1mLkBQvNzKwWuQVlZmY1yQnKzMxqkhOUmZnVJCcoMzOrSW2qjHS55ZaLVVddtegwzMza\ntKeeeurDiOg2r/PaVIJaddVVGTVqVNFhmJm1aZLeLOc8d/GZmVlNcoIyM7Oa5ARlZmY1yQnKzMxq\nkhOUmZnVJCcoMzOrSU5QZmZWk5ygzMysJjlBmZlZTarZBCXpCkkfSHq+ieOSNFjSWEnPSdqw2jGa\nmVnl1GyCAobS/Iqe/YA181d/4JIqxGRmZlVSswkqIh4GJjVzys6kpaojIh4HOktaoTrRmZlZpdVs\ngipDd+Dtku138r65SOovaZSkURMmTKhacFa+oaOHstnlm7H5FZvz9Pinv3L8nBHnsM1V29BnaB8e\neP2BAiI0syIs9LOZR8SlwKUAdXV1UXA41sBHUz9i8BODefzQxxn3yTj2v3l/Rhw8Ys7xu165i8nT\nJvPPA/5ZYJRmVoTW3IIaB6xcsr1S3metyMhxI+ndozcd2negZ5eeTJk+hWkzp805fv0L1/PFzC/Y\n+qqt2f/m/Zn8xeQCozWzamrNCeo24IBczbcJMDkixhcdlM2fiVMn0qVTlznbnRfrzKSpX156fHfK\nu7RTO+4/4H56de/FWSPOKiJMMytAzSYoSX8D/g18W9I7kg6RNEDSgHzKcOA1YCzwF+CIgkK1r6Fr\np658/MXHc7YnfzGZrp26znW87xqpmLPvGn157v3nqh6jmRWjZq9BRcTe8zgewJFVCscqpFf3Xpz6\nwKnMmDWD8Z+OZ8kOS9JxkY5zjvdZpQ+j3h3FNqttw6h3R7FG1zUKjNbMqknpfb5tqKurCy/5Xnuu\neOYKLnv6MiQxqO8gFmm3CPe9eh8nbn4i02ZO47DbD+PtT95m0XaLctWuV7H8kssXHbKZfQ2SnoqI\nunme5wRlZmbVVG6CqtlrUGZm1rY5QZmZWU1ygjIzs5rkBGVmZjWpZsvMrXVZe8KmRYdQlhe6/buq\nzzez9yFVfb4FtcgjlxcdgtlXuAVlZmY1aZ4JStLRkrrM6zwzM7OWVE4L6pvAk5Kul9RXkiodlJmZ\n2TwTVEScSlq19nLgQOAVSX+QtHqFYzMzszasrGtQed679/LXTKALcIOkcysYm5mZtWHzrOKTdCxw\nAPAhcBlwYkTMkNQOeAX4ZWVDNDOztqicMvOuwG4R8WbpzoiYLWnHyoRlZmZtXTldfKs1TE6SrgaI\niBcrElV6jr6SXpY0VtJJjRxfRtLtkp6VNEbSQZWKxczMqq+cBLVO6Yak9sD3KxPOXM9xEdAPWBvY\nW9LaDU47EnghItYD+gDnSepQybjMzKx6mkxQkk6WNAVYV9In+WsK8AFwa4Xj2hgYGxGvRcR04Dpg\n5wbnBLBULntfEphEKuAwM7OFQJPXoCLiLOAsSWdFxMlVjAmgO/B2yfY7QK8G51wI3Aa8CywF7BkR\nsxs+kKT+QH+AHj16VCRYs6Z4CqHGeQooK0dzLai18s1/SNqw4VeV4mvO9sBoYEVgfeBCSUs3PCki\nLo2Iuoio69atW7VjNDOzBdRcFd8JwGHAeY0cC+CHFYkoGQesXLK9Ut5X6iDg7DxGa6yk14G1gJEV\njMvMzKqkuS6+w/L3raoXzhxPAmtK6klKTHsB+zQ45y1ga+ARSd8Evg28VtUozcysYppMUJJ2a+6O\nEXFTy4cz57FnSjoKuAdoD1wREWMkDcjHhwC/B4ZK+g8g4FcR8WGlYjIzs+pqrotvp2aOBVCxBAUQ\nEcOB4Q32DSm5/S6wXSVjMDOz4jTXxeeBr2ZmVpjmqvj2y9+Pb+yreiGatVFPPw2bbw6bbQZDh371\n+JQpsOmm0LkzDBtW9fDMKq25Lr4l8velqhGImTVw9NEp8XTvDptsAjvvDF1K1g7t1AluvhmGDGn6\nMcxasea6+P6cv59evXDMDIBp0+Czz6Bnz7TduzeMHAnbb//lOYssAssvX0x8ZlVQzpLvq+VJWSdI\n+kDSrZJWq0ZwZm3WxImp665e584waVJx8ZgVoJzlNq4lTdy6a97eC/gbX516yMy+rgsvhBtugDXW\ngI8//nL/5MnQtWtxcZkVoJzZzBePiKsjYmb+GgYsVunAzNqko46CBx+Eyy6DJZaAt96CGTNgxAjY\neOOio7MaNnT0UDa7fDM2v2Jznh7/9FeOnzPiHLa5ahv6DO3DA68/UECE86+5gbr1H9fuyusxXUca\n/7QnDcYnmVkFDBoEe+8NEXDEEV8WSOy7L1xzTbq9004wZgwsvnhKYi6YaJM+mvoRg58YzOOHPs64\nT8ax/837M+LgEXOO3/XKXUyeNpl/HvDPAqOcf8118T1FSkjK24eXHAug2jOcm7UtdXXw6KNf3V+f\nnABuv7168VjNGjluJL179KZD+w707NKTKdOnMG3mNDou0hGA61+4ni6LdWHrq7ZmxaVW5MJ+F7LM\nYssUHPW8NdnFFxE9I2K1/L3hl4skzMxqxMSpE+nS6cshCJ0X68ykqV8W1bw75V3aqR33H3A/vbr3\n4qwRZxUR5nwrp0gCSd8lrWw759pTRFxVqaDMzKx8XTt15eMvviyqmfzFZLp26jrX8b5r9AWg7xp9\nOeauY6oe44Iop8z8NOCC/LUVcC7w4wrHZWZmZerVvRcj3hrBjFkzeGvyWyzZYck53XsAfVbpw6h3\nRwEw6t1RrNF1jaJCnS/ltKB2B9YDnomIg/LSFp5XxcysRnTp1IUjNjqCLYduiSQG9R3E6PdGc9+r\n93Hi5idy4PoHctjth7HVlVuxaLtFuWrX1tEBVk6CmhoRsyXNzCvWfsDciwmamVnBDt7gYA7e4OC5\n9q2//PoAdFykY6tJSqXKSVCjJHUG/kKq7PsU+HdFozIzszZvntegIuKIiPg4r8W0LfCzaizFIamv\npJcljc3jsBo7p4+k0ZLGSHqo0jGZmVn1lFvFtxuwBWn80wjguUoGJak9aXqlbYF3gCcl3RYRL5Sc\n0xm4GOgbEW9J+kYlYzIzs+oqp4rvYmAA8B/geeBwSRdVOK6NgbER8VpETCfNYrFzg3P2AW6KiLcA\nIuKDCsdkZmZVVE4L6ofAdyIiACRdCYypaFTQHXi7ZPsdvjo57beARSU9SFqzalBjY7Mk9Qf6A/To\n0aMiwZrZ/FnkkcuLDqEmrflk0RGU55WNqvM85UwWOxYofWdfOe8r2iLA94EfAdsD/yvpWw1PiohL\nI6IuIuq6detW7RjNzGwBNTdZ7O2ka05LAS9KGpkPbQyMbOp+LWQcc5eyr5T3lXoHmBgRnwGfSXqY\nNF7rvxWOzczMqqC5Lr4/Vi2Kr3oSWFNST1Ji2ot0zanUrcCFkhYBOpC6AM+vapRmZlYxzS35Pqds\nO88eUd/rOLLSBQkRMVPSUcA9QHvgiogYI2lAPj4kIl6UdDeponA2cFlEPF/JuMzMrHrmWSQhaQ9g\nIPAgaemNCySdGBE3VDKwiBhOg3Wn8lis0u2BOTYzM1vIlFMkcQqwUUT8LCIOIF2D+t/KhmVm1oSn\nn4bNN4fNNoOhQ796/J57YJNNYMstYYcdYOLEqodoLaOcBNWuQZfexDLvZ2bW8o4+GoYNgwcfhMGD\n4aOP5j7+ne/AQw+lrx13hD/9qZAw7esrJ9HcLekeSQdKOhC4Ey/5bmZFmDYNPvsMevaEDh2gd28Y\n2aCouEcP6JiXmujYERYpa8Icq0Hz/M1FxIklUx0BXBoRN1c2LDOzRkycCJ07f7nduTNMmtT4ue+/\nDxdemLr8rFVqNkHlOfH+GRFbATdVJyQzswYuvBBuuAHWWAM+/nLlWCZPhq5dv3r+J5/A7rvDkCHw\nDU/T2Vo128UXEbOA2ZKWqVI8ZmZfddRR6ZrTZZfBEkvAW2/BjBkwYgRsvPHc506dCrvuCqecAr0a\nzpBmrUk516A+Bf4j6XJJg+u/Kh2YmVmjBg2CvfdOVXpHHAFduqT9++6bvl90ETz7LJx9NvTpA2ee\nWVio9vWUc/XwJty9Z2a1oq4OHn30q/uvuSZ9/5//SV/W6pVTJHGlpA7AWqS5+V7OS2CYmZlVTDkz\nSewA/Bl4lTSTRE9Jh0fEXZUOzszM2q5yuvj+D9gqIsYCSFqdNBbKCcrMzCqmnCKJKfXJKXsNmFKh\neMzMzIDyWlCjJA0Hriddg/op8GQevEtEuIDCzMxaXDkJajHgfWDLvD0B6ATsREpYTlBmZtbiyqni\nO6gagTQkqS8wiLQe1GURcXYT520E/BvYq9JLgJiZWfXU5KzkeYqli4B+wNrA3pLWbuK8c4B7qxuh\nmZlVWk0mKNKaU2Mj4rU85uo6YOdGzjsauBGo6Aq/ZmZWfbWaoLoDb5dsv5P3zSGpO7ArcElzDySp\nv6RRkkZNmDChxQM1M7PKaPIalKTjm7tjRPxfy4czX/4E/CoiZktq8qSIuBS4FKCuri6qFFujho4e\nyqVPXYokLuh3ARuusOGcY9c9fx0XjryQdmrH0h2X5tqfXMvSHZcuMFozs2I114JaKn/VAT8ntWC6\nAwOADZu5X0sYB6xcsr1S3leqDrhO0hvA7sDFknapcFwL7KOpHzH4icE8eOCDDNt1GMfcdcxcx3f7\nzm6MOHgEDx/0MBuusCFXP3t1QZGamdWGJltQEXE6gKSHgQ0jYkre/i1pJolKehJYU1JPUmLaC9in\nQXw9629LGgrcERG3VDiuBTZy3Eh69+hNh/Yd6NmlJ1OmT2HazGl0XCSt/NmhfYc55342/TPW6blO\nUaGamdWEcq5BfRMonRx2et5XMRExEzgKuAd4Ebg+IsZIGiBpQCWfu1ImTp1Il05d5mx3Xqwzk6bO\nvRLo5U9fzvcu+R6PvPUI63RzgjKztq2cgbpXASMl1S/zvgtwZeVCSiJiODC8wb4hTZx7YKXj+bq6\ndurKx198uRLo5C8m07XT3CuBHrLhIRyy4SGc++i5DHxsIOdue261wzQzqxnzbEFFxJnAQcBH+eug\niPhDpQNb2PTq3osRb41gxqwZvDX5LZbssOSc7j2AL2Z+Med258U6s/iiixcRpplZzSinBQWwOPBJ\nRPxVUjdJPSPi9UoGtrDp0qkLR2x0BFsO3RJJDOo7iNHvjea+V+/jxM1PZOCjA7n/9fuB1Nq6Yucr\nCo7YzKxYimi+8lrSaaSKuW9HxLckrQj8IyI2r0aALamuri5GjRpVdBgLpbUnbFp0CGV5odu/iw7B\nrElrPll0BOV5ZaOvd39JT0VE3bzOK6dIYlfgx8BnABHxLqn83MzMrGLKSVDTIzWzAkDSEpUNyczM\nrLwEdb2kPwOdJR0G/BO4rLJhmZlZW1fOcht/lLQt8AnwbeA3EXFfxSOrUb7WUhvPZ2YLv3kmKEnn\nRMSvgPsa2WdmZlYR5XTxbdvIvn4tHYiZmVmp5mYz/zlwBLC6pOdKDi0FPFbpwMzMrG1rrovvWuAu\n4CzgpJL9UyJiUuN3MTMzaxlNdvFFxOSIeAMYBEyKiDcj4k1gpqRe1QrQzMzapnKuQV0CfFqy/Snz\nWMXWzMzs6yonQSlK5kOKiNmUP4efmZnZAiknQb0m6RhJi+avY4HXKh2YmZm1beUkqAHAZqSVbd8B\negH9KxkUgKS+kl6WNFbSSY0c31fSc5L+I+kxSetVOiYzM6uecmaS+IC05HrVSGoPXEQag/UO8KSk\n2yLihZLTXge2jIiPJPUDLiUlTzMzWwg0Nw7qlxFxrqQLyBPFloqIYyoY18bA2Ih4LcdyHbAzMCdB\nRUTpWKzHgZUqGI+ZmVVZcy2oF/P3IhZQ6g68XbJd37XYlENIY7a+QlJ/cpdkjx49Wio+MzOrsCYT\nVETcnr9fWb1w5p+krUgJaovGjkfEpaTuP+rq6ppfndHMzGpGc118t9NI1169iPhxRSJKxgErl2yv\nlPfNRdK6pKU/+kXExArGY2ZmVdZcF98f8/fdgOWBYXl7b+D9SgYFPAmsKaknKTHtBexTeoKkHsBN\nwP4R8d8Kx2NmZlXWXBffQwCSzmuwdvztkip6XSoiZko6CrgHaA9cERFjJA3Ix4cAvwGWBS6WBDCz\nnDXuzcysdShnRoglJK1WUlHXE6j4su8RMRwY3mDfkJLbhwKHVjoOMzMrRjkJ6hfAg5JeAwSsAhxe\n0ajMzKzNK2eg7t2S1gTWyrteiohplQ3LzMzaunlOdSRpceBE4KiIeBboIWnHikdmZmZtWjlz8f0V\nmA5smrfHAWdULCIzMzPKS1CrR8S5wAyAiPicdC3KzMysYspJUNMldSIP2pW0OuBrUGZmVlHlVPGd\nBtwNrCzpGmBz4MBKBmVmZtZsglIaAfsSaTaJTUhde8dGxIdViM3MzNqwZhNURISk4RHxPeDOKsVk\nZmZW1jWopyVtVPFIzMzMSpRzDaoXsJ+kN4DPSN18ERHrVjIwMzNr28pJUNtXPAozM7MGmlsPajFg\nALAG8B/g8oiYWa3AzMysbWvuGtSVQB0pOfUDzqtKRGZmZjSfoNaOiP0i4s/A7kDvKsUEgKS+kl6W\nNFbSSY0cl6TB+fhzkjasZnxmZlZZzSWoGfU3qt21J6k9cBGp5bY2sLektRuc1g9YM3/1By6pZoxm\nZlZZzRVJrCfpk3xbQKe8XV/Ft3QF49oYGFuySOJ1wM7ACyXn7AxcFREBPC6ps6QVImJ8BeMyM7Mq\naW7J9/bVDKSB7sDbJdvvkMrd53VOd2CuBCWpP6mFRbdu3ejfvz/rrLMOvXv3ZsiQISyzzDKceOKJ\nnHrqqQCcd955nHzyyUyfPp1f/OIX3HLLLbz++uvssccejB8/ni0e+R6bbbYZq6yyCn/729/o0aMH\ne+65JwMHDqR9+/acd955HHfccQCcfvrpDBo0iEmTJnHYYYfxxBNP8Nxzz9GvXz8A7rrrLtZdd116\n9erFX/7yF7p27cqxxx7LaaedBsCf/vQnTjjhBGbNmsWJJ57I3//+d9566y323ntv3nzzTR577DF6\n9+7NCiuswPXXX0/Pnj3ZZZddOP/88zmqw1GcddZZnHDCCQCcccYZDBw4kMmTJzNgwAAeeeQRxowZ\nw0477cS0adO499572WCDDdhggw244oor6NatGz//+c/53e9+B8AFF1zA0UcfDcBJJ53E1Vdfzbhx\n49hvv/145ZVXeOKJJ+jTpw/LLrssN954I2ussQY77LADgwcPplOnTpx++un88pe/BOCss87izDPP\n5NNPP+XII4/k/vvv56WXXmKXXXZhypQp3H///dTV1bHOOutw5ZVXsvzyy3PIIYdw5plnfiWWU045\nhcsvv5z33nuPn/3sZ4wZM4ZRo0ax9dZbs9RSS3HLLbew1lprsfXWW3PRRRex5JJLcsopp3DyyScD\ncO6553LaaacxdepUjjnmGIYPH87YsWP5yU9+wsSJE3nwwQfp1asXa665JsOGDaN79+7sv//+nH32\n2V+J5Te/+Q2XXHIJEyZM4OCDD+aZZ57hmWeeYbvttqNjx47cfvvtX+tv75FHHmkVf3sdOnTw394C\n/u1t1Ur+9s645+v97ZVLqQFSWyTtDvTNy7ojaX+gV0QcVXLOHcDZETEib98P/CoiRjX1uHV1dTFq\nVJOHzcysCiQ9FRF18zqvnJkkijAOWLlke6W8b37PMTOzVqpWE9STwJqSekrqAOwF3NbgnNuAA3I1\n3ybAZF9/MjNbeJQzk0TVRcRMSUcB9wDtgSsiYoykAfn4EGA4sAMwFvgcOKioeM3MrOXVZIICiIjh\npCRUum9Iye0Ajqx2XGZmVh212sVnZmZtnBOUmZnVJCcoMzOrSU5QZmZWk5ygzMysJjlBmZlZTXKC\nMjOzmuQEZWZmNckJyszMapITlJmZ1SQnKDMzq0lOUGZmVpOcoMzMrCY5QZmZWU2quQSVFyAcLGms\npOckbdjEeddIelnS85KukLRotWM1M7PKqbkEBfQD1sxf/YFLmjjvGmAt4HtAJ+DQqkRnZmZVUYsJ\namfgqkgeBzpLWqHhSRExPJ8TwEhgpWoHamZmlVOLCao78HbJ9jt5X6Ny197+wN0VjsvMzKqoFhPU\n/LoYeDgiHmnsoKT+kkZJGjVhwoQqh2ZmZguqJhKUpCMljZY0GhgPrFxyeCVgXBP3Ow3oBhzf1GNH\nxKURURcRdd26dWvJsM3MrIKULuHUDkk/Ao4CdgB6AYMjYuNGzjsUOBjYOiKmlvnYE4A3WzDclrIc\n8GHRQdQgvy6N8+vSOL8ujavF12WViJhni6EWE5SAC4G+wOfAQRExKh8bDhwaEe9KmklKNlPyXW+K\niN8VEfPXJWlURNQVHUet8evSOL8ujfPr0rjW/LosUnQADeWqvCObOLZDye2ai93MzFpOTVyDMjMz\na8gJqjZcWnQANcqvS+P8ujTOr0vjWu3rUnPXoMzMzMAtKDMzq1FOUGZmVpOcoMzMrCY5QdWIPP7L\nmuHXyKqpsb83SX7P5KuvTaX+N/1i1wBJyuO/kPQDSW1+Zvb6P3hJK0nqJKlTRIST1JckbS+pd9Fx\nLIwa/E/uIKmvpPUiYnbRsRWtwWvzfUnfoEK5xAmqBpT8so8BBgPti42oeDkZ9QNuBE4GhklaMtpw\n2WlJ0pakxYF9gKWKjWrhVPI/eQTwv0BP4BlJ3ys0sII1SE5HAMOAm4GDJfVo6edzgqoRkrYFDgR6\nR8SbktaV9P2CwyqMpHWBP5CWUvkCWJ6SxN0WW1IlyXkZYCpwP2nRTsDdTy2h9O9KUk/gR8B2edeD\nwAulHxSqHmDBSpLTzsAWwHrAGcCGwM4tnaQ8XVBBSj+JZONJbzi/zn/32wKvSLo6IoYXEWPBZpNW\nU+4B7ALsFRGTJW0GPBkRMwqNrgD5DbE3MBDoAEwAlpX0WL79BU3M/G/lKXkDriNNsDoCOIX0RrxD\nRMySdLikOyKiTb7WuUtvf6BnREwH7pIUpMVmO0m6NiLeaYnn8ieuAjRoJi8jaWlgLGny21WAW0mT\n5b4BLF1UnEWQ1F3SisAnwG+Ay4EtI+I1SVsCxwJdi4yxmko/pecFpB8mJak9gJuA1YFtgOuBsyW1\nmdemUiTtA5wKTAd+AOwbEf0i4gtJe5F6OtqMhi3FiPgAOBOYIOmcvO9u4C5gReCzFnvuNtylX4gG\nyel40pvNEsAFEXF7/XFJuwMnAftExH8LDLniSn7mXqRPq6OAc4AdgROAQcA04LfAaRFxa1GxFkVS\nf9I6aTOAqyLiDUkrAP8AdgIWAz6NiCnNPIzNg6RdSH9310bEA5KWBR7NX7OBDUgrLPynwDALIelg\nYA1S4r6ClIwOA96PiF/nc5aIiBZLUG5BVVlJcvo56Y1lP+Aj4GZJh+U36u2BI0j/CAt1coI5BRE7\nkLqu3gP2BQ4FniG1ovYlXQf4dUTc2hb6/nMRRP3tY0gtpvtIfzMHSmoXEeOBD4ClImK8k1OLWIn0\noXHNXDk6kbQu3Z2kFsJP21Jyqr+uKWl/4BfAv4CNgAGk/HERsI7S4rGQlkhqMb4GVSWS1gMOiIgT\n8q4vgL2Aw4EA+gF3Svo8Iq7Ja7hMLCjcqsqfUo8Gfps/tW5PWoxySeDciPhnybkNr90tdHKy3k7S\necC7pJbT9qTuzQmki9IdldZEewFYtKhYFxa5XL8uIs6XNA3YFRgt6ZmImEzqTm0zJG0DvBgR4yQt\nAmwGnBkR90l6HDgN2C8ijsjJ6QOYq5CnRbgFVT3/Bf4oqVd+k/0r6Y2lL/CriLgPuBc4N5dTL7TJ\nSdK3Je0laWWA/LO+DXxHUvuIuIf0ifUoUgtzzie5NpCcdgTOAh6MiLdJ3UorkSrItgB2joiZwAFA\nv4g4NSJeLSre1qqkEq/+PXAV4FuSDo+IvwB3k7qbN26j1ZGrAovm96KZwGuk12L53FI/jfT/2jUi\nRkfEu5UIoi2+8FUlaTlJXSJiau6S+TVwR05S75CqrjbOYwpeAzaKiE+LjLmS8htDf+BqUjIeJGlJ\nYDSpxbQSQElKAAASzklEQVRFPnUUqXDkV5LWagsDJCUtT7rmdmhE3CJpsZyQhwIrAMMiYoakA0nd\nLWMKC7aVK/mgs2b+fg3wALC2pAERMRh4DDiGVDHZpkTEZaQPR5MlrQrcQirY2kHSasAPSR+wK1pN\n6y6+CspdNb8F3pD0SkScQmoRXEgagLobqYx1K1K/9z6V+iRSK/L1prtJ/fqnAv9HGoj7DUDAipIO\nJ12M/hFpdeWVgJeKibiqppH+4b+QtBhwUq5cnAJMAi5VGry8PrCbW07zr0GR0srA3ZJ+FxF/lXQj\nKRkdImmRiDg7f7j8otCgqyQX3fSIiCck7QrcBvye1HrvRaqo3QvYm9S4ObLS1z1dxVchkvqS3oAH\nksrHTwD6R8RUSR1In4pnRsQB+fxlcl93myDpFuCpiPi9pINI11U+JnWtdAT+SOpmuBTYLiLeKCjU\nqsmty+NJBSHrAP8kfYB5gTQW7L+kUfvtImJCUXG2Vg2S02p56EJfUsn0oIi4Kh+7E3gFOD0iPiou\n4uqS9E1SMvovaTD4HhHxQb7GdCjQKyLezcNAvoiISZWOyS2oCshjUYYDP8lVZxuTxqqcl6+xHJ67\naW6SNCwi9iON+1nolbxJnAn8OBePnEDq034d2Jz0JtyRVF6+W1tITjCndflnUtfSysCtETEN5pSZ\nP7cwX5ustJLk9AtgF0m7R8TdkmYDAyUtRaoi7UAqzmkTyan+fzIi3pd0EXA28MecnNpHxOlKA3Hf\nzN3tVWu5uwVVIZJ+RGoVHEhqDTwGXAbcALweEXtJWgJYZmHv1muM0mj0q0kDIY+LiD/n/YtHxOf5\n9jcj4v0Cw6wJkn5KGhO3h7v1vp5cLn0UsGNETMjX/SaRWqwDSdW1p0TEswWGWTUNWpV1QHdST8aN\nwBkR8aeScw8FHq7m0BcnqArK3QfDSeN3zs77liTNFLFHW/80LGkj0uS4u0XE+Dy2Z3b996LjK1q+\nJrAnaTDknhHxfMEhtToNhyVI+hlpUPPHwGqkMXb3ksbbzSY1tKYWEWuRJJ0A/Jg09vI1SRuQuphP\nBCYDO5CKd6qaMFzFV0GRpv/YHjhIUue8+6dAJ9Jo7LZuNKkSrXdpUnJymuNj0rWQnZ2c5l+D1sGe\nudX+Hmmg6SGkv72jgc6k4oDP20pyktSx5HZvYHfS39lruVvvGVKlXn9SodKgIoZ4+BpUheWBbccB\nIyRdTKqC6V/p6pfWIJdM/xlY1Enpq/Kb5Z1Fx9FalSSnY4GDgGci4h6lyXWnR8S03BX/PWChHdrR\nkKRvk2YjOSX/380CXo6Ij/PQhi9yknpW0lak/89CrpG7i69K8gDMm4ANIsLjV8yqIBfhDAG2jYhP\nlWbDn0qqjNydVKDzs2hb0xd1Iq0jtgrwPqkoZAipe+/tfM6+wLLARRExq7BYnaCqp7QAwMxaXiPX\nnDYgrSt2J6kQ4tukN+a9SZWzn9W/KS/slNZYOy0ifpK3B5MmfD2ANOXaLqSxT4vlfbtExIsFhQv4\nGlRVOTmZVU7DijRJqwPPAneQliUZFhE/BP4ObBYRL7WV5JS9AcyWdE3e/i1ptpYhpLXXziEtCros\n8OOikxO4BWVmCxlJRwL7AI+QFv7crGQ82d6kAfS7RMQrxUVZPQ0S91akYS/PRMShkpYhvR7dgJMi\n4r2GrdAiuQVlZgsNSZuSphDbnrRw3kfk+eIkbUKqSNujrSQnmKtY5ETSeLo7SBPjXp9nrzmTVCTy\nB6WZy2tmORu3oMxsoSCpO6l7ahNSEcD2wE65Wm/HiLgjz63XJmaIKJULI24Cjo2I/+ZhL38BPomI\nQ3JLqmOk1XJrhltQZtYq5bkL628fTFopIID/JQ0q3S4npwOAI9pScip9bep3kVbuXiNvf0JajXkH\nSZdExORaS07gcVBm1kqVdF3tDawLXBwRY3KJ9LWSfkl6U/4RqYS6zSSnktdmW9Jigh+QqhkvkjQp\nIh5XmrT6EuDK4qJtnrv4zKxVyRf6vwvMioiLJQ0D+pDm1xudz9mIlJimATdFxMtFxVsUSccAe5Am\nXz6INF3RtsDppCnY+gLbVHNuvfnlBGVmrUae3/Jc0nyW6wCfRsQBuXS6A7B3pBVg25wGLae1SKsB\n1C8nsg5pzstZeSaJjsDkiHizsIDL4ARlZq1CXrbmOlJ33UOSugFXAcdHxIuS7iBVo+0fERVd6bXW\nNEhOPUnzfR5KminiB8Dukdai+ynwUC1eb2qMiyTMrDWZBHRTWvF2Aml5jOUBImJH0swIfykwvqpr\nkJz2Ji0b8ippVeqjIuJHOTkdCBxBmrW9VXALysxqXv2bsKQ+pMUtLwPWI01dtGvpZMOSVm5jM0QA\nIGkAsCnwf3mi1/WA35HmHhwD7ExqfbaaeQedoMys5jUyG8LpwOLA1hExWVI7oF1buv4k6Yeka0xv\nk+bQ60uqytsmIh7Ig247kVpN44EnWluxiBOUmdWkPLnp9Ih4KW+XJqlepKXJLwHuLmo5iKJI2p70\n899BStSrk2Zn/yVwPFAXEW8UFmAL8TgoM6tVhwE9JP0qT+wa9QNQI+IJSacD55Oq94YVGWg1SdqQ\nNOHtNhExStKqpGTVKyL+IGlR4GFJW0XEqwWG+rW5SMLMakruriMijiZV5Z0kac28b06XT0Q8SFoR\n95ECwizSf4HXgZ8A5JZSR1KBCBFxOqna8U5JizQyq0Sr4S4+M6tJefqiHUgr3r4PHFZ/DaWWZtyu\nply9ODPPnXcvMIJUBLE+qVhkRsm5y0XEhwWF2iKcoMys5uSZxy8DNo2IKZIuIrUQToyIscVGV4yS\nSsb2ecDtMqRZIlaPiFXyOYsAsyNi9sKQxN3FZ2aFa6Qb6kPgFdKs5ETEkaSZyq+V9K0qh1coSVtI\n2rg+2eTk1D4vlfFjYLyks/KxmfUl9609OYETlJkVrEF13uK5ZfAO8DmwoaRl86nXkGbh/riYSAuz\nIXCzpLr6HSVJ6lPSsiK7STq7sAgrxF18ZlYTJB1PmpanJ3AKsBywC2mpckjrPO0dEa8XEmCB8irB\nxwD7RcSTJfvru/uWBrouDKXlpZygzKxwkn5EWsdpJ1IiOpI0xuklUgtibeCa+jFRC7vGrh9JOpY0\n6Hb/iBhZsr99RMyqdozV4HFQZlYLlgFG5/n1bpf0CfA30kwRfys2tOpq0OX5Q9Jrc29EDJI0HRgm\nad/6ltTCmpzA16DMrMqaGJfzOtBJUs9cSv0QcAtpqp42pSQ5HUNaKmMb4CFJm0XEJcAfgbskfb/A\nMKvCLSgzq5oGrYMBQA+gM3AGMAM4FhiTc9i2wFkFhVooSdsBuwJbkAYj/wg4VdIfIuLS3JJa6ItF\nfA3KzKqmZCzPfqQ5444C/gd4F7iUdP1pTaA7cEZEvFBYsFXU8JpTXo59BWBz4MCI2E7SX0lFJPtF\nxL8LCrWq3IIys4qTtDmweETcl3dtAlwUEY+RSqQvAH4bEbvl8ztGxLSCwq2qBq3Kb5EaDi8Db0o6\nAHgsn/owqbKxzVQx+hqUmVXDasClkrbJ26+SJoJdBubMu7doXiUXYHoBMVadpHYlyel40iSw9+WK\nPYAngC0kXUWq4DsuIt4rJtrqcwvKzCoqtxCuzteVzs/Xnm4DLgB2kvQ08B3SVEbTYeGYBaEc9bM+\nSPoBsBVQB3wXuELSDNIs7aeTBuOe1dpnJ59fvgZlZhUhqR/Qj7TE+DkRMV7S/qQ1i/YDBPwc6EYq\npT6uNa32+nVI+h7QNyIGSupJKgb5JrBjRHyWK/T+DFwXEX8sMtYiOUGZWYuTtC2pHHowaWn2zyPi\npHzsIFKBxEF5PaOlgUUjYmJhAVdRXq9pFWAy0C0iXshVewcD9wM3RsSkvCjjQGCXiJhUXMTFcYIy\nsxaVB5feCmwQEWMl7QHsCIwE7oqIVyX9jDTG54CIeKDAcKtK0k7A94HfA0sAlwNvR8Tx+Vhf4FlS\nkprYlopFGuMiCTNraR+SliFfI2//GphCmmPvn5LWiogrSa2oNwqJsAC5Vfl74PGImJWXqT8dWEbS\n2RFxOzCcVFr+47xwY5soFmmKW1Bm1uIkbURaUG8WcEREXJ/3n0MqhjigrRRCwJxW5W3AhhHx37xM\n+0YR8Q9J65EGKI+PiFMk9SVN+9RmqvWa4haUmbW4PE/cD4D2wKIlh94G2uL1lA9J0zatmltG15KK\nQwD+A5wPfEvSaRFxt5NT4haUmVVMSUtqAPAB6aL/gRHxfKGBFaBBq/KoiLiuZGaNdqRS+0kRMb7Q\nQGuIx0GZWcVExJP52stIYALQJyJeLDisQuTX4gekGSHqtZNUPyP5mIJCq1luQZlZxUlaG5iVp/Bp\n00paUidHxJCi46llTlBmZlWWB+I+CRwSEX8tOp5a5QRlZlYASRuQBjC3+VZlU5ygzMysJrnM3MzM\napITlJmZ1SQnKDMzq0lOUGZmVpOcoMzMrCY5QZmZWU1ygjIzs5rkBGVmZjXJCcrMzGqSE5SZmdUk\nJygzM6tJTlBmZlaTnKDMzKwmOUGZmVlNcoIyM7Oa5ARlZmY1yQnKapqk5SVdJ+lVSU9JGi7pWwv4\nWMdJWryZ45dJWjvf/nQ+H3t9STuUbP9Y0kkLEmcjj72WpNGSnpG0+gLcv9mfuxKUPCBp6QW8/4GS\nVizZfkPSco2ct6Ok332dWK12OUFZzZIk4GbgwYhYPSK+D5wMfHMBH/I4oNE3akntI+LQiHhhAR97\nfWBOgoqI2yLi7AV8rIZ2AW6IiA0i4tUFuH+TP3dTJC2yAM9Tagfg2Yj4ZAHvfyCw4rxOAu4Edqp2\nArbqcIKyWrYVMCMihtTviIhnI+KR/Al9oKTnJf1H0p4AkvpIelDSDZJeknRNPvcY0hvevyT9K5/7\nqaTzJD0LbJrvV1f/XJLOlzRG0v2SuuV9c86RtFz+ZN8B+B2wZ27p7JlbABfm84ZKGizpMUmvSdo9\n728n6eIc5325dbh76QuQW2XHAT8viXs/SSPzc/1ZUvu8/xJJo3LMp+d9jf7cJY+/u6ShJXEOkfQE\ncK6kJSRdkZ/rGUk75/PWKXn+5ySt2cjvbl/g1nz+qiW/ixfz72bxfOw3kp7Mv8dL8+9qd6AOuCY/\nR6f8mEdLejr/vtfKfw8BPAjsOK8/Jmt9nKCsln0XeKqJY7uRWi3rAdsAAyWtkI9tQHpTXxtYDdg8\nIgYD7wJbRcRW+bwlgCciYr2IGNHg8ZcARkXEOsBDwGlNBRkR04HfAH+PiPUj4u+NnLYCsAXpjbS+\nZbUbsGqOc39g00YeezgwBDg/IraS9B1gz/wzrQ/MIiUDgFMiog5YF9hS0rpN/NzNWQnYLCKOB04B\nHoiIjUkfFgZKWgIYAAzKz18HvNPI42zO3L+7bwMXR8R3gE+AI/L+CyNio4j4LtAJ2DEibgBGAfvm\n13NqPvfDiNgQuAT4n5LHHgX0LuNns1bGCcpaqy2Av0XErIh4n5RENsrHRkbEOxExGxhNSgKNmQXc\n2MSx2UB9ohmWn+/ruCUiZucuxPouyi2Af+T97wH/KuNxtga+DzwpaXTeXi0f20PS08AzwDqkxDe/\n/hERs/Lt7YCT8vM8CCwG9AD+Dfxa0q+AVUoSSKmuETGlZPvtiHg03y59PbeS9ISk/wA/zHE35ab8\n/Snm/p1+QHndgdbKfN1+ZrNKGgPsPs+zvmpaye1ZNP13/kXJm/G8RP4+ky8/2C22gDFpPu7XkIAr\nI+LkuXZKPUmtio0i4qPcbddUfFFyu+E5nzV4rp9ExMsNznkxdwP+CBgu6fCIeKDBOTMltcsfEho+\nJ0BIWgy4GKiLiLcl/baZmOHL17Dh73QxoLEkaa2cW1BWyx4AOkrqX79D0rqSegOPkK75tM/Xh34A\njJzH400BlirzudvxZXLcB6jvAnyD1IKBuZPn/Dx2vUeBn+RrUd8E+pRxn/uB3SV9A0BSV0mrAEuT\nksvk/Fj9montfUnfkdQO2LWZ57qHdN1H+bk2yN9XA17L3Ye3kroUG3qZL1t2AD0k1Xdh1r+e9cno\nQ0lLsuCv57eA58s811oRJyirWfkC+K7ANkpl5mOAs4D3SNV9zwHPkhLZL3M3WXMuBe6uLxaYh8+A\njSU9T+p6qi9l/iOpYOEZoLTs+V/A2vVFEuX9hNxIun7zAqnb62lgcnN3yF2EpwL3SnoOuA9YISKe\nJXXtvQRcS0p+9Rr+3CcBdwCPAeObebrfA4sCz+XX/vd5/x7A87nr77vAVY3c907mTrgvA0dKehHo\nAlwSER8DfyEll3uAJ0vOHwoMaVAk0ZSt8vPZQkbpPcDMiiBpyYj4VNKypBbg5mUk2pqXC1auioht\nJa0K3JELIVr6eb4JXBsRW7f0Y1vxfA3KrFh3SOoMdAB+vzAkJ4CIGC/pL1rAgbrzoQdwQoWfwwri\nFpSZmdUkX4MyM7Oa5ARlZmY1yQnKzMxqkhOUmZnVJCcoMzOrSf8P1j9Njs3eX+QAAAAASUVORK5C\nYII=\n", 22 | "text/plain": [ 23 | "" 24 | ] 25 | }, 26 | "metadata": {}, 27 | "output_type": "display_data" 28 | } 29 | ], 30 | "source": [ 31 | "a = ['Bias', 'Age', 'Sex', 'Blood Pressure']\n", 32 | "b = [0.3, 0.6, -0.1, -0.2]\n", 33 | "\n", 34 | "plot = qc.waterfall(a,b, Title= 'Patient A', y_lab= 'Predicted probability', x_lab= 'Contributing features (path)',\n", 35 | " net_label = 'Final Prediction')\n", 36 | "plot.show()" 37 | ] 38 | } 39 | ], 40 | "metadata": { 41 | "kernelspec": { 42 | "display_name": "Python 3", 43 | "language": "python", 44 | "name": "python3" 45 | }, 46 | "language_info": { 47 | "codemirror_mode": { 48 | "name": "ipython", 49 | "version": 3 50 | }, 51 | "file_extension": ".py", 52 | "mimetype": "text/x-python", 53 | "name": "python", 54 | "nbconvert_exporter": "python", 55 | "pygments_lexer": "ipython3", 56 | "version": "3.6.2" 57 | } 58 | }, 59 | "nbformat": 4, 60 | "nbformat_minor": 2 61 | } 62 | -------------------------------------------------------------------------------- /images/Expo_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/Expo_Loss.png -------------------------------------------------------------------------------- /images/Huber_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/Huber_Loss.png -------------------------------------------------------------------------------- /images/Logcosh_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/Logcosh_Loss.png -------------------------------------------------------------------------------- /images/MAE_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/MAE_Loss.png -------------------------------------------------------------------------------- /images/MSE_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/MSE_Loss.png -------------------------------------------------------------------------------- /images/Quantile_Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/Quantile_Loss.png -------------------------------------------------------------------------------- /images/all_regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/all_regression.png -------------------------------------------------------------------------------- /images/huber.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/huber.png -------------------------------------------------------------------------------- /images/mse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/mse.png -------------------------------------------------------------------------------- /images/roc_segments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/roc_segments.png -------------------------------------------------------------------------------- /images/tileshop.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/images/tileshop.jpeg -------------------------------------------------------------------------------- /ml_slides/Poster_DeepOdds.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/ml_slides/Poster_DeepOdds.pdf -------------------------------------------------------------------------------- /ml_slides/Semi-Supervised Learning.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/groverpr/Machine-Learning/0f27de263c2f27d1b8024a2d619e212f457f75a4/ml_slides/Semi-Supervised Learning.pptx -------------------------------------------------------------------------------- /notebooks/02_Collaborative_Filtering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Recommendation engine using collaborating filtering on Movielens" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import torch" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "%reload_ext autoreload\n", 26 | "%autoreload 2\n", 27 | "%matplotlib inline\n", 28 | "\n", 29 | "from fastai.learner import *\n", 30 | "from fastai.column_data import *\n", 31 | "from fastai.imports import *" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 3, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "path = '.'" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 4, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "collaborating filter.ipynb ml-latest-small.zip movielens.ipynb tmp\r\n", 53 | "ml-latest-small\t\t models\t\t ratings_small.csv\r\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "! ls ." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "ratings = pd.read_csv('ratings_small.csv')" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 6, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "data": { 77 | "text/html": [ 78 | "
\n", 79 | "\n", 92 | "\n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | "
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
2110613.01260759182
3111292.01260759185
4111724.01260759205
\n", 140 | "
" 141 | ], 142 | "text/plain": [ 143 | " userId movieId rating timestamp\n", 144 | "0 1 31 2.5 1260759144\n", 145 | "1 1 1029 3.0 1260759179\n", 146 | "2 1 1061 3.0 1260759182\n", 147 | "3 1 1129 2.0 1260759185\n", 148 | "4 1 1172 4.0 1260759205" 149 | ] 150 | }, 151 | "execution_count": 6, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "ratings.head()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 7, 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/plain": [ 168 | "(100004, 4)" 169 | ] 170 | }, 171 | "execution_count": 7, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "ratings.shape" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | " There are no NAs" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 8, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "n_users=int(ratings.userId.nunique())\n", 194 | "n_movies=int(ratings.movieId.nunique())" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 9, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "n_users = 671 || n_movies = 9066\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "print(\"n_users = \",n_users, \"||\", \"n_movies = \", n_movies )" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "Let's create a cross-tab for better visualization of user ids and item ids." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 10, 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "g = ratings.groupby('userId')['rating'].count()\n", 228 | "topg = g.sort_values(ascending = False)[:15]\n", 229 | "\n", 230 | "i = ratings.groupby('movieId')['rating'].count()\n", 231 | "topi = i.sort_values(ascending = False)[:15]" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 11, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/html": [ 242 | "
\n", 243 | "\n", 256 | "\n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | "
movieId11102602963183564805275895936081196119812702571
userId
152.03.05.05.02.01.03.04.04.05.05.05.04.05.05.0
304.05.04.05.05.05.04.05.04.04.05.04.05.05.03.0
735.04.04.55.05.05.04.05.03.04.54.05.05.05.04.5
2123.05.04.04.04.54.03.05.03.04.0NaNNaN3.03.05.0
2133.02.55.0NaNNaN2.05.0NaN4.02.52.05.03.03.04.0
2944.03.04.0NaN3.04.04.04.03.0NaNNaN4.04.54.04.5
3113.03.04.03.04.55.04.55.04.52.04.03.04.54.54.0
3804.05.04.05.04.05.04.0NaN4.05.04.04.0NaN3.05.0
4523.54.04.05.05.04.05.04.04.05.05.04.04.04.02.0
4684.03.03.53.53.53.02.5NaNNaN3.04.03.03.53.03.0
5093.05.05.05.04.04.03.05.02.04.04.55.05.03.04.5
5473.5NaNNaN5.05.02.03.05.0NaN5.05.02.52.03.53.5
5644.01.02.05.0NaN3.05.04.05.05.05.05.05.03.03.0
5804.04.54.04.54.03.53.04.04.54.04.54.03.53.04.5
6245.0NaN5.05.0NaN3.03.0NaN3.05.04.05.05.05.02.0
\n", 568 | "
" 569 | ], 570 | "text/plain": [ 571 | "movieId 1 110 260 296 318 356 480 527 589 593 608 \\\n", 572 | "userId \n", 573 | "15 2.0 3.0 5.0 5.0 2.0 1.0 3.0 4.0 4.0 5.0 5.0 \n", 574 | "30 4.0 5.0 4.0 5.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 \n", 575 | "73 5.0 4.0 4.5 5.0 5.0 5.0 4.0 5.0 3.0 4.5 4.0 \n", 576 | "212 3.0 5.0 4.0 4.0 4.5 4.0 3.0 5.0 3.0 4.0 NaN \n", 577 | "213 3.0 2.5 5.0 NaN NaN 2.0 5.0 NaN 4.0 2.5 2.0 \n", 578 | "294 4.0 3.0 4.0 NaN 3.0 4.0 4.0 4.0 3.0 NaN NaN \n", 579 | "311 3.0 3.0 4.0 3.0 4.5 5.0 4.5 5.0 4.5 2.0 4.0 \n", 580 | "380 4.0 5.0 4.0 5.0 4.0 5.0 4.0 NaN 4.0 5.0 4.0 \n", 581 | "452 3.5 4.0 4.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 5.0 \n", 582 | "468 4.0 3.0 3.5 3.5 3.5 3.0 2.5 NaN NaN 3.0 4.0 \n", 583 | "509 3.0 5.0 5.0 5.0 4.0 4.0 3.0 5.0 2.0 4.0 4.5 \n", 584 | "547 3.5 NaN NaN 5.0 5.0 2.0 3.0 5.0 NaN 5.0 5.0 \n", 585 | "564 4.0 1.0 2.0 5.0 NaN 3.0 5.0 4.0 5.0 5.0 5.0 \n", 586 | "580 4.0 4.5 4.0 4.5 4.0 3.5 3.0 4.0 4.5 4.0 4.5 \n", 587 | "624 5.0 NaN 5.0 5.0 NaN 3.0 3.0 NaN 3.0 5.0 4.0 \n", 588 | "\n", 589 | "movieId 1196 1198 1270 2571 \n", 590 | "userId \n", 591 | "15 5.0 4.0 5.0 5.0 \n", 592 | "30 4.0 5.0 5.0 3.0 \n", 593 | "73 5.0 5.0 5.0 4.5 \n", 594 | "212 NaN 3.0 3.0 5.0 \n", 595 | "213 5.0 3.0 3.0 4.0 \n", 596 | "294 4.0 4.5 4.0 4.5 \n", 597 | "311 3.0 4.5 4.5 4.0 \n", 598 | "380 4.0 NaN 3.0 5.0 \n", 599 | "452 4.0 4.0 4.0 2.0 \n", 600 | "468 3.0 3.5 3.0 3.0 \n", 601 | "509 5.0 5.0 3.0 4.5 \n", 602 | "547 2.5 2.0 3.5 3.5 \n", 603 | "564 5.0 5.0 3.0 3.0 \n", 604 | "580 4.0 3.5 3.0 4.5 \n", 605 | "624 5.0 5.0 5.0 2.0 " 606 | ] 607 | }, 608 | "execution_count": 11, 609 | "metadata": {}, 610 | "output_type": "execute_result" 611 | } 612 | ], 613 | "source": [ 614 | "# gettings ratings of top users and top items\n", 615 | "\n", 616 | "join1 = ratings.join(topg, on='userId', how = 'inner', rsuffix='_r')\n", 617 | "join1 = join1.join(topi, on='movieId', how = 'inner', rsuffix = '_r')\n", 618 | "\n", 619 | "pd.crosstab(join1.userId, join1.movieId, join1.rating, aggfunc=np.sum)" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "### Collaborative filtering" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": 10, 632 | "metadata": {}, 633 | "outputs": [], 634 | "source": [ 635 | "val_indx = get_cv_idxs(len(ratings)) # index for validation set\n", 636 | "wd = 2e-4 # weight decay\n", 637 | "n_factors = 50 # n_factors" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 11, 643 | "metadata": {}, 644 | "outputs": [], 645 | "source": [ 646 | "# data loader\n", 647 | "cf = CollabFilterDataset.from_csv(path, 'ratings_small.csv', 'userId', 'movieId', 'rating')" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": 12, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "learn = cf.get_learner(n_factors, val_indx, bs=64, opt_fn=optim.Adam)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 13, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "data": { 666 | "application/vnd.jupyter.widget-view+json": { 667 | "model_id": "065d1fd5a51945359526689db634fd06", 668 | "version_major": 2, 669 | "version_minor": 0 670 | }, 671 | "text/html": [ 672 | "

Failed to display Jupyter Widget of type HBox.

\n", 673 | "

\n", 674 | " If you're reading this message in Jupyter Notebook or JupyterLab, it may mean\n", 675 | " that the widgets JavaScript is still loading. If this message persists, it\n", 676 | " likely means that the widgets JavaScript library is either not installed or\n", 677 | " not enabled. See the Jupyter\n", 678 | " Widgets Documentation for setup instructions.\n", 679 | "

\n", 680 | "

\n", 681 | " If you're reading this message in another notebook frontend (for example, a static\n", 682 | " rendering on GitHub or NBViewer),\n", 683 | " it may mean that your frontend doesn't currently support widgets.\n", 684 | "

\n" 685 | ], 686 | "text/plain": [ 687 | "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" 688 | ] 689 | }, 690 | "metadata": {}, 691 | "output_type": "display_data" 692 | }, 693 | { 694 | "name": "stdout", 695 | "output_type": "stream", 696 | "text": [ 697 | "[ 0. 0.7727 0.80396] \n", 698 | "[ 1. 0.77782 0.77585] \n", 699 | "[ 2. 0.58389 0.76542] \n", 700 | "\n" 701 | ] 702 | } 703 | ], 704 | "source": [ 705 | "learn.fit(1e-2,2, wds = wd, cycle_len=1, cycle_mult=2)" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "We got .76" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "### Collaborating filter from scratch" 720 | ] 721 | }, 722 | { 723 | "cell_type": "code", 724 | "execution_count": 14, 725 | "metadata": {}, 726 | "outputs": [], 727 | "source": [ 728 | "u_uniq = ratings.userId.unique()\n", 729 | "user2idx = {o:i for i,o in enumerate(u_uniq)}\n", 730 | "ratings.userId = ratings.userId.apply(lambda x: user2idx[x])\n", 731 | "\n", 732 | "m_uniq = ratings.movieId.unique()\n", 733 | "movie2idx = {o:i for i,o in enumerate(m_uniq)}\n", 734 | "ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x])\n" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 15, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "data": { 744 | "text/plain": [ 745 | "(671, 9066)" 746 | ] 747 | }, 748 | "execution_count": 15, 749 | "metadata": {}, 750 | "output_type": "execute_result" 751 | } 752 | ], 753 | "source": [ 754 | "n_users, n_movies" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "`nn.Embedding` creates a lookup table that stores embeddings of a fixed dictionary and size. So word embeddings once stored can be retrieved using indices. After making `embeddings`, we get free `u.weights` which are correspondings weights of ebeddings" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": 16, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "val_indx = get_cv_idxs(len(ratings)) # index for validation set\n", 771 | "wd = 2e-4 # weight decay\n", 772 | "n_factors = 50 # n_factors i.e. 1 dimension of embeddings (random)" 773 | ] 774 | }, 775 | { 776 | "cell_type": "code", 777 | "execution_count": 17, 778 | "metadata": {}, 779 | "outputs": [ 780 | { 781 | "data": { 782 | "text/plain": [ 783 | "(0.5, 5.0)" 784 | ] 785 | }, 786 | "execution_count": 17, 787 | "metadata": {}, 788 | "output_type": "execute_result" 789 | } 790 | ], 791 | "source": [ 792 | "min_rating,max_rating = ratings.rating.min(),ratings.rating.max()\n", 793 | "min_rating,max_rating" 794 | ] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "execution_count": 18, 799 | "metadata": {}, 800 | "outputs": [], 801 | "source": [ 802 | "def get_emb(ni,nf):\n", 803 | " e = nn.Embedding(ni, nf)\n", 804 | " e.weight.data.uniform_(-0.01,0.01)\n", 805 | " #e.weight.data.normal_(0,0.003)\n", 806 | "\n", 807 | " return e" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 19, 813 | "metadata": {}, 814 | "outputs": [], 815 | "source": [ 816 | "x = ratings.drop(['rating'],axis=1)\n", 817 | "y = ratings['rating'].astype(np.float32)\n", 818 | "\n", 819 | "data = ColumnarModelData.from_data_frame(path, val_indx, x, y, ['userId', 'movieId'], 64)" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 20, 825 | "metadata": {}, 826 | "outputs": [], 827 | "source": [ 828 | "# nh = dimension of hidden linear layer\n", 829 | "# p1 = dropout1\n", 830 | "# p2 = dropout2\n", 831 | "\n", 832 | "class EmbeddingNet(nn.Module):\n", 833 | " def __init__(self, n_users, _n_movies, nh = 10, p1 = 0.05, p2= 0.5):\n", 834 | " super().__init__()\n", 835 | " (self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [\n", 836 | " (n_users, n_factors), (n_movies, n_factors),\n", 837 | " (n_users,1), (n_movies,1)\n", 838 | " ]]\n", 839 | " \n", 840 | " self.lin1 = nn.Linear(n_factors*2, nh) # bias is True by default\n", 841 | " self.lin2 = nn.Linear(nh, 1)\n", 842 | " self.drop1 = nn.Dropout(p = p1)\n", 843 | " self.drop2 = nn.Dropout(p = p2)\n", 844 | " \n", 845 | " def forward(self, cats, conts): # forward pass i.e. dot product of vector from movie embedding matrixx\n", 846 | " # and vector from user embeddings matrix\n", 847 | " \n", 848 | " # torch.cat : concatenates both embedding matrix to make more columns, same rows i.e. n_factors*2, n : rows\n", 849 | " # u(users) is doing lookup for indexed mentioned in users\n", 850 | " # users has indexes to lookup in embedding matrix. \n", 851 | " \n", 852 | " users,movies = cats[:,0],cats[:,1]\n", 853 | " u2,m2 = self.u(users) , self.m(movies)\n", 854 | " \n", 855 | " x = self.drop1(torch.cat([u2,m2], 1)) # drop initialized weights\n", 856 | " x = self.drop2(F.relu(self.lin1(x))) # drop 1st linear + nonlinear wt\n", 857 | " r = F.sigmoid(self.lin2(x)) * (max_rating - min_rating) + min_rating \n", 858 | " return r" 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": 24, 864 | "metadata": {}, 865 | "outputs": [], 866 | "source": [ 867 | "wd=1e-5\n", 868 | "model = EmbeddingNet(n_users, n_movies)\n", 869 | "model = model.cuda()\n", 870 | "opt = optim.Adam(model.parameters(), 1e-3, weight_decay=wd) # got parameter() for free , lr = 1e-3" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 25, 876 | "metadata": {}, 877 | "outputs": [ 878 | { 879 | "data": { 880 | "text/plain": [ 881 | "EmbeddingNet (\n", 882 | " (u): Embedding(671, 50)\n", 883 | " (m): Embedding(9066, 50)\n", 884 | " (ub): Embedding(671, 1)\n", 885 | " (mb): Embedding(9066, 1)\n", 886 | " (lin1): Linear (100 -> 10)\n", 887 | " (lin2): Linear (10 -> 1)\n", 888 | " (drop1): Dropout (p = 0.05)\n", 889 | " (drop2): Dropout (p = 0.5)\n", 890 | ")" 891 | ] 892 | }, 893 | "execution_count": 25, 894 | "metadata": {}, 895 | "output_type": "execute_result" 896 | } 897 | ], 898 | "source": [ 899 | "model" 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": 28, 905 | "metadata": { 906 | "scrolled": true 907 | }, 908 | "outputs": [ 909 | { 910 | "data": { 911 | "application/vnd.jupyter.widget-view+json": { 912 | "model_id": "2f45cf8456bc4101a27367f9c26d9f6a", 913 | "version_major": 2, 914 | "version_minor": 0 915 | }, 916 | "text/html": [ 917 | "

Failed to display Jupyter Widget of type HBox.

\n", 918 | "

\n", 919 | " If you're reading this message in Jupyter Notebook or JupyterLab, it may mean\n", 920 | " that the widgets JavaScript is still loading. If this message persists, it\n", 921 | " likely means that the widgets JavaScript library is either not installed or\n", 922 | " not enabled. See the Jupyter\n", 923 | " Widgets Documentation for setup instructions.\n", 924 | "

\n", 925 | "

\n", 926 | " If you're reading this message in another notebook frontend (for example, a static\n", 927 | " rendering on GitHub or NBViewer),\n", 928 | " it may mean that your frontend doesn't currently support widgets.\n", 929 | "

\n" 930 | ], 931 | "text/plain": [ 932 | "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" 933 | ] 934 | }, 935 | "metadata": {}, 936 | "output_type": "display_data" 937 | }, 938 | { 939 | "name": "stdout", 940 | "output_type": "stream", 941 | "text": [ 942 | "[ 0. 0.74293 0.79247] \n", 943 | "[ 1. 0.74748 0.79483] \n", 944 | "[ 2. 0.75364 0.79638] \n", 945 | "\n" 946 | ] 947 | } 948 | ], 949 | "source": [ 950 | "fit(model, data, 3, opt, F.mse_loss)" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 27, 956 | "metadata": {}, 957 | "outputs": [ 958 | { 959 | "name": "stdout", 960 | "output_type": "stream", 961 | "text": [ 962 | " \r" 963 | ] 964 | } 965 | ], 966 | "source": [ 967 | "# from tqdm import tqdm as tqdm_cls\n", 968 | "\n", 969 | "# inst = tqdm_cls._instances\n", 970 | "# for i in range(len(inst)): inst.pop().close()" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": 31, 976 | "metadata": {}, 977 | "outputs": [], 978 | "source": [ 979 | "set_lrs(opt, 1e-3)" 980 | ] 981 | }, 982 | { 983 | "cell_type": "code", 984 | "execution_count": 24, 985 | "metadata": { 986 | "scrolled": true 987 | }, 988 | "outputs": [ 989 | { 990 | "data": { 991 | "application/vnd.jupyter.widget-view+json": { 992 | "model_id": "ea1cd58e12574699969fd0040c18f1c2", 993 | "version_major": 2, 994 | "version_minor": 0 995 | }, 996 | "text/html": [ 997 | "

Failed to display Jupyter Widget of type HBox.

\n", 998 | "

\n", 999 | " If you're reading this message in Jupyter Notebook or JupyterLab, it may mean\n", 1000 | " that the widgets JavaScript is still loading. If this message persists, it\n", 1001 | " likely means that the widgets JavaScript library is either not installed or\n", 1002 | " not enabled. See the Jupyter\n", 1003 | " Widgets Documentation for setup instructions.\n", 1004 | "

\n", 1005 | "

\n", 1006 | " If you're reading this message in another notebook frontend (for example, a static\n", 1007 | " rendering on GitHub or NBViewer),\n", 1008 | " it may mean that your frontend doesn't currently support widgets.\n", 1009 | "

\n" 1010 | ], 1011 | "text/plain": [ 1012 | "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" 1013 | ] 1014 | }, 1015 | "metadata": {}, 1016 | "output_type": "display_data" 1017 | }, 1018 | { 1019 | "name": "stdout", 1020 | "output_type": "stream", 1021 | "text": [ 1022 | "[ 0. 0.79631 0.78994] \n", 1023 | "[ 1. 0.78677 0.79127] \n", 1024 | "[ 2. 0.7614 0.7906] \n", 1025 | "\n" 1026 | ] 1027 | } 1028 | ], 1029 | "source": [ 1030 | "fit(model, data, 3, opt, F.mse_loss)" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "markdown", 1035 | "metadata": {}, 1036 | "source": [ 1037 | "## Surprise package" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": 12, 1043 | "metadata": {}, 1044 | "outputs": [ 1045 | { 1046 | "name": "stdout", 1047 | "output_type": "stream", 1048 | "text": [ 1049 | "collaborating filter.ipynb ml-latest-small.zip movielens.ipynb tmp\r\n", 1050 | "ml-latest-small\t\t models\t\t ratings_small.csv\r\n" 1051 | ] 1052 | } 1053 | ], 1054 | "source": [ 1055 | "! ls ." 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": 13, 1061 | "metadata": {}, 1062 | "outputs": [ 1063 | { 1064 | "name": "stdout", 1065 | "output_type": "stream", 1066 | "text": [ 1067 | "196\t242\t3\t881250949\r\n", 1068 | "186\t302\t3\t891717742\r\n", 1069 | "22\t377\t1\t878887116\r\n", 1070 | "244\t51\t2\t880606923\r\n", 1071 | "166\t346\t1\t886397596\r\n" 1072 | ] 1073 | } 1074 | ], 1075 | "source": [ 1076 | "! head -5 '/home/ubuntu/.surprise_data/ml-100k/ml-100k/u.data'" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": 14, 1082 | "metadata": {}, 1083 | "outputs": [ 1084 | { 1085 | "name": "stdout", 1086 | "output_type": "stream", 1087 | "text": [ 1088 | "userId,movieId,rating,timestamp\r", 1089 | "\r\n", 1090 | "1,31,2.5,1260759144\r", 1091 | "\r\n", 1092 | "1,1029,3.0,1260759179\r", 1093 | "\r\n", 1094 | "1,1061,3.0,1260759182\r", 1095 | "\r\n", 1096 | "1,1129,2.0,1260759185\r", 1097 | "\r\n" 1098 | ] 1099 | } 1100 | ], 1101 | "source": [ 1102 | "! head -5 'ratings_small.csv'" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": 11, 1108 | "metadata": {}, 1109 | "outputs": [], 1110 | "source": [ 1111 | "from surprise import Reader, Dataset\n", 1112 | "# Define the format\n", 1113 | "\n", 1114 | "reader = Reader(line_format='user item rating timestamp', sep='\\t')\n", 1115 | "# Load the data from the file using the reader format\n", 1116 | "\n", 1117 | "data = Dataset.load_from_file('/home/ubuntu/.surprise_data/ml-100k/ml-100k/u.data', reader=reader)" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "execution_count": 12, 1123 | "metadata": {}, 1124 | "outputs": [ 1125 | { 1126 | "data": { 1127 | "text/html": [ 1128 | "
\n", 1129 | "\n", 1142 | "\n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | "
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
\n", 1169 | "
" 1170 | ], 1171 | "text/plain": [ 1172 | " userId movieId rating timestamp\n", 1173 | "0 1 31 2.5 1260759144\n", 1174 | "1 1 1029 3.0 1260759179" 1175 | ] 1176 | }, 1177 | "execution_count": 12, 1178 | "metadata": {}, 1179 | "output_type": "execute_result" 1180 | } 1181 | ], 1182 | "source": [ 1183 | "ratings[:2]" 1184 | ] 1185 | }, 1186 | { 1187 | "cell_type": "code", 1188 | "execution_count": 13, 1189 | "metadata": {}, 1190 | "outputs": [], 1191 | "source": [ 1192 | "ratings_dict = {'itemID': list(ratings.movieId),\n", 1193 | " 'userID': list(ratings.userId),\n", 1194 | " 'rating': list(ratings.rating)}\n", 1195 | "df = pd.DataFrame(ratings_dict)\n", 1196 | "\n", 1197 | "# A reader is still needed but only the rating_scale param is requiered.\n", 1198 | "reader = Reader(rating_scale=(0.5, 5.0))\n", 1199 | "# The columns must correspond to user id, item id and ratings (in that order).\n", 1200 | "data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)" 1201 | ] 1202 | }, 1203 | { 1204 | "cell_type": "code", 1205 | "execution_count": 14, 1206 | "metadata": {}, 1207 | "outputs": [], 1208 | "source": [ 1209 | "# Split data into 5 folds\n", 1210 | "\n", 1211 | "data.split(n_folds=5)" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "execution_count": 15, 1217 | "metadata": {}, 1218 | "outputs": [], 1219 | "source": [ 1220 | "from surprise import SVD, evaluate" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 16, 1226 | "metadata": {}, 1227 | "outputs": [], 1228 | "source": [ 1229 | "from surprise import GridSearch" 1230 | ] 1231 | }, 1232 | { 1233 | "cell_type": "code", 1234 | "execution_count": 42, 1235 | "metadata": {}, 1236 | "outputs": [ 1237 | { 1238 | "name": "stdout", 1239 | "output_type": "stream", 1240 | "text": [ 1241 | "[{'lr_all': 0.002, 'reg_all': 0.4}, {'lr_all': 0.002, 'reg_all': 0.6}, {'lr_all': 0.005, 'reg_all': 0.4}, {'lr_all': 0.005, 'reg_all': 0.6}]\n", 1242 | "------------\n", 1243 | "Parameters combination 1 of 4\n", 1244 | "params: {'lr_all': 0.002, 'reg_all': 0.4}\n", 1245 | "------------\n", 1246 | "Mean RMSE: 0.9133\n", 1247 | "------------\n", 1248 | "------------\n", 1249 | "Parameters combination 2 of 4\n", 1250 | "params: {'lr_all': 0.002, 'reg_all': 0.6}\n", 1251 | "------------\n", 1252 | "Mean RMSE: 0.9214\n", 1253 | "------------\n", 1254 | "------------\n", 1255 | "Parameters combination 3 of 4\n", 1256 | "params: {'lr_all': 0.005, 'reg_all': 0.4}\n", 1257 | "------------\n", 1258 | "Mean RMSE: 0.9031\n", 1259 | "------------\n", 1260 | "------------\n", 1261 | "Parameters combination 4 of 4\n", 1262 | "params: {'lr_all': 0.005, 'reg_all': 0.6}\n", 1263 | "------------\n", 1264 | "Mean RMSE: 0.9121\n", 1265 | "------------\n" 1266 | ] 1267 | } 1268 | ], 1269 | "source": [ 1270 | "param_grid = {'lr_all': [0.002, 0.005],\n", 1271 | " 'reg_all': [0.4, 0.6]}\n", 1272 | "grid_search = GridSearch(SVD, param_grid, measures=['RMSE'])\n", 1273 | "grid_search.evaluate(data)" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": 52, 1279 | "metadata": {}, 1280 | "outputs": [ 1281 | { 1282 | "name": "stdout", 1283 | "output_type": "stream", 1284 | "text": [ 1285 | "Evaluating RMSE of algorithm SVD.\n", 1286 | "\n", 1287 | "------------\n", 1288 | "Fold 1\n", 1289 | "RMSE: 0.8990\n", 1290 | "------------\n", 1291 | "Fold 2\n", 1292 | "RMSE: 0.8983\n", 1293 | "------------\n", 1294 | "Fold 3\n", 1295 | "RMSE: 0.8941\n", 1296 | "------------\n", 1297 | "Fold 4\n", 1298 | "RMSE: 0.8962\n", 1299 | "------------\n", 1300 | "Fold 5\n", 1301 | "RMSE: 0.8962\n", 1302 | "------------\n", 1303 | "------------\n", 1304 | "Mean RMSE: 0.8967\n", 1305 | "------------\n", 1306 | "------------\n" 1307 | ] 1308 | }, 1309 | { 1310 | "data": { 1311 | "text/plain": [ 1312 | "CaseInsensitiveDefaultDict(list,\n", 1313 | " {'rmse': [0.89895181594737417,\n", 1314 | " 0.89831051013903251,\n", 1315 | " 0.89405859774725671,\n", 1316 | " 0.89621812893141306,\n", 1317 | " 0.89617318551492264]})" 1318 | ] 1319 | }, 1320 | "execution_count": 52, 1321 | "metadata": {}, 1322 | "output_type": "execute_result" 1323 | } 1324 | ], 1325 | "source": [ 1326 | "algo = SVD()\n", 1327 | "evaluate(algo, data, measures=['RMSE'])" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "markdown", 1332 | "metadata": {}, 1333 | "source": [ 1334 | "As noticed above best RMSE from SVD is still higher than default result from fast.ai's neural net version for collaborative filtering using embeddings" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "markdown", 1339 | "metadata": {}, 1340 | "source": [ 1341 | "### Let's try `KNN` algorithm also. " 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "code", 1346 | "execution_count": 10, 1347 | "metadata": {}, 1348 | "outputs": [], 1349 | "source": [ 1350 | "from surprise import KNNBasic" 1351 | ] 1352 | }, 1353 | { 1354 | "cell_type": "code", 1355 | "execution_count": 25, 1356 | "metadata": {}, 1357 | "outputs": [ 1358 | { 1359 | "name": "stdout", 1360 | "output_type": "stream", 1361 | "text": [ 1362 | "Evaluating RMSE, MAE of algorithm KNNBasic.\n", 1363 | "\n", 1364 | "------------\n", 1365 | "Fold 1\n", 1366 | "Computing the msd similarity matrix...\n", 1367 | "Done computing similarity matrix.\n", 1368 | "RMSE: 0.9662\n", 1369 | "MAE: 0.7645\n", 1370 | "------------\n", 1371 | "Fold 2\n", 1372 | "Computing the msd similarity matrix...\n", 1373 | "Done computing similarity matrix.\n", 1374 | "RMSE: 0.9834\n", 1375 | "MAE: 0.7787\n", 1376 | "------------\n", 1377 | "Fold 3\n", 1378 | "Computing the msd similarity matrix...\n", 1379 | "Done computing similarity matrix.\n", 1380 | "RMSE: 0.9802\n", 1381 | "MAE: 0.7744\n", 1382 | "------------\n", 1383 | "Fold 4\n", 1384 | "Computing the msd similarity matrix...\n", 1385 | "Done computing similarity matrix.\n", 1386 | "RMSE: 0.9812\n", 1387 | "MAE: 0.7728\n", 1388 | "------------\n", 1389 | "Fold 5\n", 1390 | "Computing the msd similarity matrix...\n", 1391 | "Done computing similarity matrix.\n", 1392 | "RMSE: 0.9804\n", 1393 | "MAE: 0.7735\n", 1394 | "------------\n", 1395 | "------------\n", 1396 | "Mean RMSE: 0.9783\n", 1397 | "Mean MAE : 0.7728\n", 1398 | "------------\n", 1399 | "------------\n" 1400 | ] 1401 | }, 1402 | { 1403 | "data": { 1404 | "text/plain": [ 1405 | "CaseInsensitiveDefaultDict(list,\n", 1406 | " {'mae': [0.76447687302283862,\n", 1407 | " 0.77871336218916276,\n", 1408 | " 0.77444253761129189,\n", 1409 | " 0.77277756247233054,\n", 1410 | " 0.77353073380751081],\n", 1411 | " 'rmse': [0.96618541819639647,\n", 1412 | " 0.98337516247695278,\n", 1413 | " 0.98018440899082937,\n", 1414 | " 0.98120591146396685,\n", 1415 | " 0.98038668816669572]})" 1416 | ] 1417 | }, 1418 | "execution_count": 25, 1419 | "metadata": {}, 1420 | "output_type": "execute_result" 1421 | } 1422 | ], 1423 | "source": [ 1424 | "algo = KNNBasic()\n", 1425 | "evaluate(algo, data, measures=['RMSE', 'MAE'])" 1426 | ] 1427 | }, 1428 | { 1429 | "cell_type": "markdown", 1430 | "metadata": {}, 1431 | "source": [ 1432 | "## NMF" 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "code", 1437 | "execution_count": 19, 1438 | "metadata": {}, 1439 | "outputs": [], 1440 | "source": [ 1441 | "from surprise import NMF" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "code", 1446 | "execution_count": 21, 1447 | "metadata": {}, 1448 | "outputs": [ 1449 | { 1450 | "name": "stdout", 1451 | "output_type": "stream", 1452 | "text": [ 1453 | "Evaluating RMSE of algorithm NMF.\n", 1454 | "\n", 1455 | "------------\n", 1456 | "Fold 1\n", 1457 | "RMSE: 0.9476\n", 1458 | "------------\n", 1459 | "Fold 2\n", 1460 | "RMSE: 0.9449\n", 1461 | "------------\n", 1462 | "Fold 3\n", 1463 | "RMSE: 0.9479\n", 1464 | "------------\n", 1465 | "Fold 4\n", 1466 | "RMSE: 0.9494\n", 1467 | "------------\n", 1468 | "Fold 5\n", 1469 | "RMSE: 0.9450\n", 1470 | "------------\n", 1471 | "------------\n", 1472 | "Mean RMSE: 0.9469\n", 1473 | "------------\n", 1474 | "------------\n" 1475 | ] 1476 | }, 1477 | { 1478 | "data": { 1479 | "text/plain": [ 1480 | "CaseInsensitiveDefaultDict(list,\n", 1481 | " {'rmse': [0.9475771765522677,\n", 1482 | " 0.94487435132530351,\n", 1483 | " 0.94786484545358385,\n", 1484 | " 0.94936598409066575,\n", 1485 | " 0.94501542053063314]})" 1486 | ] 1487 | }, 1488 | "execution_count": 21, 1489 | "metadata": {}, 1490 | "output_type": "execute_result" 1491 | } 1492 | ], 1493 | "source": [ 1494 | "algo = NMF()\n", 1495 | "evaluate(algo, data, measures=['RMSE'])" 1496 | ] 1497 | }, 1498 | { 1499 | "cell_type": "markdown", 1500 | "metadata": {}, 1501 | "source": [ 1502 | "## cosine distance" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "code", 1507 | "execution_count": 33, 1508 | "metadata": {}, 1509 | "outputs": [], 1510 | "source": [ 1511 | "ratings = pd.read_csv('ratings_small.csv')" 1512 | ] 1513 | }, 1514 | { 1515 | "cell_type": "code", 1516 | "execution_count": 4, 1517 | "metadata": { 1518 | "scrolled": true 1519 | }, 1520 | "outputs": [ 1521 | { 1522 | "data": { 1523 | "text/html": [ 1524 | "
\n", 1525 | "\n", 1538 | "\n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | "
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
\n", 1565 | "
" 1566 | ], 1567 | "text/plain": [ 1568 | " userId movieId rating timestamp\n", 1569 | "0 1 31 2.5 1260759144\n", 1570 | "1 1 1029 3.0 1260759179" 1571 | ] 1572 | }, 1573 | "execution_count": 4, 1574 | "metadata": {}, 1575 | "output_type": "execute_result" 1576 | } 1577 | ], 1578 | "source": [ 1579 | "ratings[:2]" 1580 | ] 1581 | }, 1582 | { 1583 | "cell_type": "code", 1584 | "execution_count": 10, 1585 | "metadata": {}, 1586 | "outputs": [], 1587 | "source": [ 1588 | "ratings2 = ratings.copy()" 1589 | ] 1590 | }, 1591 | { 1592 | "cell_type": "code", 1593 | "execution_count": 11, 1594 | "metadata": {}, 1595 | "outputs": [], 1596 | "source": [ 1597 | "col = ['movieId', 'userId']" 1598 | ] 1599 | }, 1600 | { 1601 | "cell_type": "code", 1602 | "execution_count": 12, 1603 | "metadata": {}, 1604 | "outputs": [], 1605 | "source": [ 1606 | "for c in col:\n", 1607 | " ratings2[c].replace({val: i for i, val in enumerate(ratings2[c].unique())}, inplace=True)" 1608 | ] 1609 | }, 1610 | { 1611 | "cell_type": "code", 1612 | "execution_count": 13, 1613 | "metadata": {}, 1614 | "outputs": [ 1615 | { 1616 | "data": { 1617 | "text/html": [ 1618 | "
\n", 1619 | "\n", 1632 | "\n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | "
userIdmovieIdratingtimestamp
0002.51260759144
1013.01260759179
\n", 1659 | "
" 1660 | ], 1661 | "text/plain": [ 1662 | " userId movieId rating timestamp\n", 1663 | "0 0 0 2.5 1260759144\n", 1664 | "1 0 1 3.0 1260759179" 1665 | ] 1666 | }, 1667 | "execution_count": 13, 1668 | "metadata": {}, 1669 | "output_type": "execute_result" 1670 | } 1671 | ], 1672 | "source": [ 1673 | "ratings2[:2]" 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "code", 1678 | "execution_count": 14, 1679 | "metadata": {}, 1680 | "outputs": [], 1681 | "source": [ 1682 | "n_users=int(ratings2.userId.nunique())\n", 1683 | "n_items=int(ratings2.movieId.nunique())" 1684 | ] 1685 | }, 1686 | { 1687 | "cell_type": "code", 1688 | "execution_count": 15, 1689 | "metadata": {}, 1690 | "outputs": [ 1691 | { 1692 | "name": "stdout", 1693 | "output_type": "stream", 1694 | "text": [ 1695 | "n_users = 671 || n_items = 9066\n" 1696 | ] 1697 | } 1698 | ], 1699 | "source": [ 1700 | "print(\"n_users = \",n_users, \"||\", \"n_items = \", n_items )" 1701 | ] 1702 | }, 1703 | { 1704 | "cell_type": "code", 1705 | "execution_count": 16, 1706 | "metadata": {}, 1707 | "outputs": [ 1708 | { 1709 | "name": "stderr", 1710 | "output_type": "stream", 1711 | "text": [ 1712 | "/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 1713 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 1714 | ] 1715 | } 1716 | ], 1717 | "source": [ 1718 | "from sklearn import cross_validation as cv\n", 1719 | "train_data, test_data = cv.train_test_split(ratings2, test_size=0.25)" 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": 17, 1725 | "metadata": {}, 1726 | "outputs": [], 1727 | "source": [ 1728 | "#Create two user-item matrices, one for training and another for testing\n", 1729 | "train_data_matrix = np.zeros((n_users, n_items))\n", 1730 | "for line in train_data.itertuples():\n", 1731 | " train_data_matrix[line[1]-1, line[2]-1] = line[3]\n", 1732 | " \n", 1733 | "test_data_matrix = np.zeros((n_users, n_items))\n", 1734 | "for line in test_data.itertuples():\n", 1735 | " test_data_matrix[line[1]-1, line[2]-1] = line[3]" 1736 | ] 1737 | }, 1738 | { 1739 | "cell_type": "code", 1740 | "execution_count": 18, 1741 | "metadata": {}, 1742 | "outputs": [], 1743 | "source": [ 1744 | "def predict(ratings, similarity, type='user'):\n", 1745 | " if type == 'user':\n", 1746 | " mean_user_rating = ratings.mean(axis=1)\n", 1747 | " #You use np.newaxis so that mean_user_rating has same format as ratings\n", 1748 | " ratings_diff = (ratings - mean_user_rating[:, np.newaxis])\n", 1749 | " pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T\n", 1750 | " elif type == 'item':\n", 1751 | " pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])\n", 1752 | " return pred" 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": 19, 1758 | "metadata": {}, 1759 | "outputs": [], 1760 | "source": [ 1761 | "from sklearn.metrics.pairwise import pairwise_distances\n", 1762 | "user_similarity = pairwise_distances(train_data_matrix, metric='cosine')\n", 1763 | "item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')" 1764 | ] 1765 | }, 1766 | { 1767 | "cell_type": "code", 1768 | "execution_count": 21, 1769 | "metadata": {}, 1770 | "outputs": [], 1771 | "source": [ 1772 | "item_prediction = predict(train_data_matrix, item_similarity, type='item')\n", 1773 | "user_prediction = predict(train_data_matrix, user_similarity, type='user')" 1774 | ] 1775 | }, 1776 | { 1777 | "cell_type": "code", 1778 | "execution_count": 79, 1779 | "metadata": {}, 1780 | "outputs": [], 1781 | "source": [ 1782 | "from sklearn.metrics import mean_squared_error\n", 1783 | "from math import sqrt\n", 1784 | "def mse(prediction, ground_truth):\n", 1785 | " prediction = prediction[ground_truth.nonzero()].flatten()\n", 1786 | " ground_truth = ground_truth[ground_truth.nonzero()].flatten()\n", 1787 | " return mean_squared_error(prediction, ground_truth)" 1788 | ] 1789 | }, 1790 | { 1791 | "cell_type": "code", 1792 | "execution_count": 84, 1793 | "metadata": {}, 1794 | "outputs": [ 1795 | { 1796 | "name": "stdout", 1797 | "output_type": "stream", 1798 | "text": [ 1799 | "User-based CF MSE: 11.3668710905\n", 1800 | "Item-based CF MSE: 12.8400786831\n" 1801 | ] 1802 | } 1803 | ], 1804 | "source": [ 1805 | "print('User-based CF MSE: ' , str(mse(user_prediction, test_data_matrix)))\n", 1806 | "print('Item-based CF MSE: ' , str(mse(item_prediction, test_data_matrix)))" 1807 | ] 1808 | }, 1809 | { 1810 | "cell_type": "markdown", 1811 | "metadata": {}, 1812 | "source": [ 1813 | "## plot comparison" 1814 | ] 1815 | }, 1816 | { 1817 | "cell_type": "code", 1818 | "execution_count": 1, 1819 | "metadata": {}, 1820 | "outputs": [], 1821 | "source": [ 1822 | "import matplotlib.pyplot as plt" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "code", 1827 | "execution_count": 11, 1828 | "metadata": {}, 1829 | "outputs": [ 1830 | { 1831 | "data": { 1832 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAELCAYAAAD3HtBMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFY5JREFUeJzt3XuYXVV9xvH3dSZFLoqtDJaKOt6l\noEIyBhFBBEQwlIpKgYIKoqOMIojaassjJNrSei+2QVNuKhhUKpYSiyAkeAECMxBIAtQL5RKrZvLg\nhUQfLpNf/9jrJIfhnDmZldnn+v08Tx722Xufs36bOTPvWWvvs7YjQgAATNeTWl0AAKAzESAAgCwE\nCAAgCwECAMhCgAAAshAgAIAsBAgAIAsBAgDIQoAAALL0t7qAajvttFMMDg62ugwA6BhjY2PrImKg\nFW23VYAMDg5qdHS01WUAQMewfV+r2mYICwCQhQABAGQhQAAAWQgQAEAWAgQAkIUAAYAWGVkyov4F\n/fJ8q39Bv0aWjLS6pGlpq8t4AaBXjCwZ0bmj5256PBETmx4vnLewVWVNCz0QAGiBRWOLprW+HREg\nANACEzExrfXtiAABgBboc9+01rcjAgQAWmB4zvC01rcjTqIDQAtUTpQvGlukiZhQn/s0PGe4Y06g\nS5IjotU1bDI0NBRMpggAW872WEQMtaJthrAAAFkIEABAFgIEAJCFAAEAZCFAAABZCBAAQBYCBACQ\nhQABAGQpNUBsP832Zbbvtn2X7X3KbA8A0DxlT2XyL5Kuioi32P4jSduV3B4AoElKCxDbO0raX9IJ\nkhQRj0h6pKz2AADNVeYQ1nMljUu60PZtts+zvX2J7QEAmqjMAOmXNFvSuRGxl6QNkj4yeSfbw7ZH\nbY+Oj4+XWA4AYCaVGSBrJK2JiOXp8WUqAuVxImJRRAxFxNDAwECJ5QAAZlJpARIRv5T0gO0Xp1UH\nSbqzrPYAAM1V9lVYp0i6JF2BdY+kE0tuDwDQJKUGSESskNSSG50AAMrFN9EBAFkIEABAFgIEAJCF\nAAEAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBkIUAAAFkIEABAFgIEAJCF\nAAEAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBkIUAAAFkIEABAlv4yX9z2\nvZIekjQh6bGIGCqzPQBA85QaIMlrI2JdE9oBADQRQ1gAgCxlB0hIutr2mO3hWjvYHrY9ant0fHy8\n5HIAADOl7AB5dUTMlnSYpPfa3n/yDhGxKCKGImJoYGCg5HIAADOl1ACJiJ+n/66VdLmkuWW2BwBo\nntICxPb2tp9SWZZ0iKRVZbUHAGiuMq/Ceoaky21X2vlaRFxVYnsAgCYqLUAi4h5JLy/r9QEArcVl\nvACALAQIACALAQIAyEKAAACyECAAgCwECAAgCwECAMhCgAAAshAgAIAsBAgAIAsBAgDIQoAAALIQ\nIACALAQIACALAQIAyEKAAACyECAAgCwECAAgCwECAMhCgAAAshAgAIAsBAgAIAsBAgDIQoAAALIQ\nIACALAQIACBL6QFiu8/2bbavLLstAEDzNKMHcqqku5rQDgCgiUoNENu7Spon6bwy2wEANF/ZPZDP\nS/obSRvr7WB72Pao7dHx8fGSywEAzJTSAsT24ZLWRsTYVPtFxKKIGIqIoYGBgbLKAQDMsDJ7IPtK\nOsL2vZIulXSg7YtLbA8A0ESlBUhEfDQido2IQUnHSLouIo4vqz0AQHPxPRAAQJYpA8T28VXL+07a\n9r4tbSQilkXE4dMvDwDQrhr1QE6vWv7CpG3vmOFaAAAdpFGAuM5yrccAgB7SKECiznKtxwCAHtLf\nYPtLbN+horfx/LSs9Ph5pVYGAGhrjQJkt6ZUAQDoOFMGSETcV/3Y9tMl7S/p/kbfMAcAdLdGl/Fe\naXuPtLyLpFUqrr76qu3TmlAfAKBNNTqJ/tyIWJWWT5R0TUT8haS9xWW8ANDTGgXIo1XLB0n6jiRF\nxEOaYoZdAED3a3QS/QHbp0haI2m2pKskyfa2kmaVXBsAoI016oGcJGl3SSdIOjoifpPWv1LShSXW\nBQBoc42uwlor6T011i+VtLSsogAA7W/KALF9xVTbI+KImS0HANApGp0D2UfSA5IWS1ou5r8CACSN\nAuRPJb1O0rGS/lrSEkmLI2J12YUBANrblCfRI2IiIq6KiLerOHH+U0nLpnMvEABAd2rUA5HtbSTN\nU9ELGZR0jqTLyy0LANDuGp1E/4qkPVR8gXB+1bfSAQA9rlEP5HhJGySdKun99qZz6JYUEfHUEmsD\nALSxRt8DafRFQwBAjyIgAABZCBAAQBYCBACQhQABAGQhQAAAWUoLENtPtn2z7dttr7Y9v6y2AADN\n1/Cb6FvhYUkHRsR627Mk/dD2f0fETSW2CQBoktICJCJC0vr0cFb6F2W1BwBorlLPgdjus71C0lpJ\n10TE8jLbAwA0T6kBkmbz3VPSrpLm2t5j8j62h22P2h4dHx8vsxwAwAxqylVY6V7qSyUdWmPboogY\nioihgYGBZpQDAJgBZV6FNWD7aWl5WxU3prq7rPYAAM1V5lVYu0j6su0+FUH1jYi4ssT2AABNVOZV\nWHdI2qus1wcAtBbfRAcAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBkIUAA\nAFkIEABAFgIEAJCFAAEAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBkIUAA\nAFkIEABAFgIEAJCFAAEAZCFAAABZSgsQ28+yvdT2nbZX2z61rLYAAM3XX+JrPybpgxFxq+2nSBqz\nfU1E3FlimwCAJimtBxIRv4iIW9PyQ5LukvTMstoDADRXU86B2B6UtJek5c1oDwBQvtIDxPYOkv5D\n0mkR8bsa24dtj9oeHR8fL7scAMAMKTVAbM9SER6XRMS3au0TEYsiYigihgYGBsosBwAwg8q8CsuS\nzpd0V0R8tqx2AACtUWYPZF9Jb5V0oO0V6d8bSmwPANBEpV3GGxE/lOSyXh8A0Fp8Ex0AkIUAAQBk\nIUAAAFkIEABAFgIEAJCFAAEAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBk\nIUAAAFkIEABAFgIEAJCFAAEAZCFAAABZCBAAQBYCBACQhQABAGQhQAAAWQgQAEAWAgQAkIUAAQBk\nKS1AbF9ge63tVWW1AQBonTJ7IBdJOrTE1wcAtFBpARIR35f0YFmv38tGloyof0G/PN/qX9CvkSUj\nrS4JQA/qb3UBmJ6RJSM6d/TcTY8nYmLT44XzFraqLAA9qOUn0W0P2x61PTo+Pt7qctreorFF01rf\nLeh1Ae2n5QESEYsiYigihgYGBlpdTtubiIlpre8GlV5X5RgrvS5CBGitlgcIpqfPfdNa3w16tdcF\ntLsyL+NdLOlGSS+2vcb2SWW11UuG5wxPa3036MVeF9AJyrwK69iI2CUiZkXErhFxfhnt9NrY+MJ5\nC3Xy0Mmbehx97tPJQyd39Qn0Xux19dr7WurNY+50HX0VVq9ekbRw3sKuPr7JhucMP+7nXL2+G/Xi\n+7oXj7kbOCJaXcMmQ0NDMTo6usX79y/orzmM0ec+Pfaxx2ayNLTYyJIRLRpbpImYUJ/7NDxnuGv/\nsPTi+7oXj3mm2B6LiKFWtN3RPRDGxntHL/W6evF93YvH3A06+iqsXhwbR/frxfd1Lx5zN+joAOnF\nK5LQ/Xrxfd2Lx9wNOnoIqzKk0Stj4+gNvfi+7sVj7gYdfRIdAHpdK0+id/QQFgCgdQgQAEAWAgQA\nkIUAAQBkIUAAAFna6ios2+OS7st8+k6S1s1gOZ2AY+5+vXa8Esc8Xc+JiJbcTKmtAmRr2B5t1aVs\nrcIxd79eO16JY+4kDGEBALIQIACALN0UIL14f1OOufv12vFKHHPH6JpzIACA5uqmHggAoIk6LkBs\nT9heYXu17dttf9D2k9K2A2xf2eoat5bt9VXLb7D9Y9vPsX2W7d/b3rnOvmH7M1WPP2T7rKYVvhWm\nqj0dd9h+QdX209K6ofT4Xtsr03tjhe1XNf0gpqnqvbzK9jdtb5fWh+2Lq/brtz1eeW/bPiE9rhzr\nV1p1DNNl++/T7+4dqfYzbZ89aZ89bd+Vlis/15W277T9CdtPbk31mKzjAkTSHyJiz4jYXdLrJB0m\n6cwW11QK2wdJOkfSYRFR+X7MOkkfrPOUhyW9yfZOzahvhjWqfaWkY6oeHyVp9aR9XpveG3tGxA1l\nFDnDKu/lPSQ9Iuk9af0GSXvY3jY9fp2kn0967terjvVtTap3q9jeR9LhkmZHxMskHSxpqaSjJ+16\njKTFVY9fGxEvlTRX0vMkfakJ5T6B7SHb57Si7XbViQGySUSslTQs6X223ep6ZpLt/SX9u6TDI+Jn\nVZsukHS07T+p8bTHVJyM+0ATSpxpjWr/tqS/lCTbz5f0W3XXl81+IOkFVY+/I2leWj5Wj/+D2ql2\nkbQuIh6WpIhYFxHfl/Rr23tX7fdXqnG8EbFeRci+sc77v1QRMRoR7292u81gO+veUB0dIJIUEfdI\n6pO0c6N9O8g2Kv5gvjEi7p60bb2KEDm1znP/TdJxtncssb6yTFX77yQ9YHsPFZ9Qv15jn6VpWGR5\nmUXOtPTLe5iKXlbFpZKOScM1L5M0+ZiOrhrCOrFJpW6tqyU9Kw3JLrT9mrR+sVLv0vYrJT0YET+p\n9QIR8TtJ/yvphTkF2H5bGj673fZXbQ/avi6tu9b2s9N+R6Whxdttfz+t2zREnoZVL7C9zPY9tt9f\n1cbxtm9OP5sv2fXvy2t7ve1PpWG979meW/WaR6R9+tI+t6Q6311Vz/W2/zPt/0+2j0ttr0wftDTF\nMV5k+4vp9+WTtn9ieyBte5Ltn1Ye19PxAdKlHpV0g6ST6mw/R9LbbT9l8ob0C/YVSR33SWkLar9U\nxR+aN0q6vMb2yhDW3jW2taNtba+QNCrpfknnVzZExB2SBlX0Pr5T47nVQ1gXNqPYrZV6EHNUjBqM\nS/q67RNUfBh4i4tzmZOHr2rJGm2wvbukMyQdGBEvV/Eh7AuSvpyG1C5R8bslSR+T9Pq03xF1XvIl\nkl6vYmjtTNuzbO+mYkhu34jYU9KEpOOmKGt7SdelIfmHJH1CxZDlkZIWpH1OkvTbiHiFpFdIepft\n56ZtL1fRK9tN0lslvSgi5ko6T9IpaZ96xyhJu0p6VUScLuniqloPlnR7RIxPUXvnB4jt56n4Ia1t\ndS0zaKOKbvxc2383eWNE/EbS1yS9t87zP6/iTbd9aRWWZ6rar1TxS3J/CptO94eqEDglIh6ZtP0K\nSZ9WdwxfSZIiYiIilkXEmZLeJ+nNEfGAil7FayS9WbV7l5Kk9KFpUNKPM5o/UNI3I2JdquVBSfuo\n+F2SpK9KenVa/pGki2y/S8UIRy1LIuLh9HprJT1D0kEqQvKW9OHgIBXnbep5RNJVaXmlpOsj4tG0\nPJjWHyLpben1lkt6ujb3wG6JiF+kYcGfqejladLz6x2j0v+PibR8gaTK+bR3SGr4waSj74meuldf\nlPSvERHddBokIn5ve56kH9j+VUScP2mXz0q6RTV+hhHxoO1vqPhDfEH51c6cqWpP/0/+Vnl/PDrR\nBZJ+ExErbR/Q6mK2lu0XS9pYNTy1pzZPnrpY0uck3RMRa+o8fwdJCyV9OyJ+XWatEfGedF5mnqQx\n23Nq7PZw1fKEit9Fq/i0/9EtbOrR2PxlvI2V14yIjVXnJSzplIj4bvUT03uiuoaNVY83asv+vm+o\nLETEA7Z/ZftAFb2qqXpOkjqzB7JtGltcLel7KhJ3ftX2g2yvqfq3T2vK3HrpE9Khks6ojIdWbVun\nYhhnmzpP/4yKGT47Ud3aI+LSiLi1yfW0RESsiYhuuupnB0lfdnE57h2S/lzSWWnbNyXtrtq9raW2\nV0m6WcVQ37sz279O0lG2ny5J6UT8Ddp8dd9xKi5mkO3nR8TyiPiYiuG2Z21hG9eqGI7budKG7edk\n1lvxXUkn256VXvNFtqczulDzGOs4T8VQVnXPpK6O64FERN0TUhGxTNK29bZ3iojYoWr5AUmV8c4r\nJu13uqTT6zzvV5K2K7fSmTNV7RFxVp3nHFC1PFhedeWoPuZG69N7e1lavkjSReVVVo6IGJNU8/s5\n6QPRrBrrB2ew/dW2/0HS9bYnJN2m4jzBhbY/rCIoKhckfMr2C1V8+r9W0u0qhtgatXGn7TMkXZ3O\n6TyqYqg59zYVUvFHfVDSrS6GWcZVnAfcUvWOsZYrVAxdbdF5NaYyAQBIKr7rIulzEbHfluzfcT0Q\nAMDMs/0RSSdrC859bHoOPRAAKFf6rsXk85VvjYiVtfbvFAQIACBLJ16FBQBoAwQIACALAYK25Kmn\n7d80K6rtbdIcQitsH217v/ScFd48m20Z9R3gaU4Z7y653QBQwVVYaFd/SHMJKX0p62uSnirpzIgY\nVTF/lCTtJUlV+35R0tkRcfETX/KJ0nX1joiN06zvABUTW3bCtPFAKeiBoO1Nnra/8kk+BcvFkl6R\nehzvVjGH2MdtXyJJtj9cNYvp/LRu0Pb/uLgR0yoVM8QeYvtG27e6uLnTDmnfe23PT+tX2n6J7UEV\nE9h9ILX7uGvmXcyoeqPt22zfkKbw0KR9Bmxfk3pL59m+z+leKLZPdzET7Crbp6V129teknpjq2xP\nvocG0HT0QNARIuIeF9Ni71y1bq3td0r6UEQcLm26adGVEXGZ7UNUTDo3V8U3iq9wcZ+V+9P6t0fE\nTekP9xmSDo6IDWm+rdO1eTbUdREx2/ZIauudqaezPiI+XaPcuyXtFxGP2T5Y0j+qmCSw2pkqZmE9\n2/ahSjMvpzmXTpS0d6p5ue3rVUzI938RMS/t14nT9aPLECDoZoekf7elxzuoCI77Jd0XETel9a9U\nMS/Tj4oRLf2RpBurXudb6b9jkt60Be3uqGLOpxdKCtWYokPFjKhHSlJEXGX711XrL4+IDZJk+1uS\n9lMxY+tnbP+zioCcaj4joCkIEHQEP37a/t229Gkqzoc87haoaQhqw6T9romIY+u8TmWG08qMq418\nXNLSiDgytbVsC+utKyJ+bHu2pDdI+oTtayNiQaPnAWXiHAjanidN2z+Np35X0juqzmc8szJL6iQ3\nSdrX9gvSftvbflGD135I0hNu6JXsqM33MD+hzj4/UnG+Rmmo7Y/T+h+ouGXrdmnG1SNVTOn/Z5J+\nny4O+JSk2Q3qA0pHgKBdNZq2v6GIuFrF1Vs32l4p6TLV+KOf7rp2gqTFLqYZv1HF3eam8l+Sjqx1\nEl3SJyWdbfs21e+xzJd0iItpyo+S9EtJD6Wp6i9SMXX5cknnRcRtkl4q6WYXNxU6U8Wd64CWYioT\noAVsbyNpIp1o30fSuZVLkYFOwTkQoDWeLekb6cuRj0h6V4vrAaaNHggAIAvnQAAAWQgQAEAWAgQA\nkIUAAQBkIUAAAFkIEABAlv8H/z6zOk63tPwAAAAASUVORK5CYII=\n", 1833 | "text/plain": [ 1834 | "" 1835 | ] 1836 | }, 1837 | "metadata": {}, 1838 | "output_type": "display_data" 1839 | } 1840 | ], 1841 | "source": [ 1842 | "mses = [6.47, .957, .897, .804, .801, .79]\n", 1843 | "algos = ['cosine_memory', 'KNN', \"NMF\", 'SVD', 'PMF', 'DL']\n", 1844 | "plt.plot(algos, mses, 'go', )\n", 1845 | "plt.xlabel(\"Different algos\")\n", 1846 | "plt.ylabel(\"MSE\")\n", 1847 | "plt.show()" 1848 | ] 1849 | }, 1850 | { 1851 | "cell_type": "code", 1852 | "execution_count": 30, 1853 | "metadata": {}, 1854 | "outputs": [ 1855 | { 1856 | "data": { 1857 | "text/html": [ 1858 | "collaborating_filter.ipynb
" 1859 | ], 1860 | "text/plain": [ 1861 | "/home/ubuntu/collaborate_filter/collaborating_filter.ipynb" 1862 | ] 1863 | }, 1864 | "execution_count": 30, 1865 | "metadata": {}, 1866 | "output_type": "execute_result" 1867 | } 1868 | ], 1869 | "source": [ 1870 | "FileLink('collaborating_filter.ipynb')" 1871 | ] 1872 | } 1873 | ], 1874 | "metadata": { 1875 | "kernelspec": { 1876 | "display_name": "Python 3", 1877 | "language": "python", 1878 | "name": "python3" 1879 | }, 1880 | "language_info": { 1881 | "codemirror_mode": { 1882 | "name": "ipython", 1883 | "version": 3 1884 | }, 1885 | "file_extension": ".py", 1886 | "mimetype": "text/x-python", 1887 | "name": "python", 1888 | "nbconvert_exporter": "python", 1889 | "pygments_lexer": "ipython3", 1890 | "version": "3.6.3" 1891 | } 1892 | }, 1893 | "nbformat": 4, 1894 | "nbformat_minor": 2 1895 | } 1896 | -------------------------------------------------------------------------------- /notebooks/06_NLP_Fastai.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Why we use `Naive bayes` ?\n", 8 | "\n", 9 | "**Step 1** - We are trying to calculate what is probability of class 1 (other will be just 1- p(1)) given a document x. i.e. P(c=1|x). \n", 10 | "**Step 2** - Using naive bayes formula, we can rewrite it as: P(c=1|x) = {P(x|c=1) P(c=1)} / P(x) . Why we did this transformation? Because from data we know that how many of total docs are class = 1, and how many of class 1 have words that we have in our document. We can rewrite -> P(x|c=1) as P(x1|c=1) P(x2|c=1) P(x3|c=1) if x1,x2,x3 are 3 words in our doc x (considering independence of x1,x2,x3). But we can not rewrite, P(c=1|x) as P(c=1|x1) P(c=1|x2) P(c=1|x3). So just a mathematical transformaiton to easily calculate probabilities. \n", 11 | "\n", 12 | "**Step 3** So from data (term doc matrix), what we need are no. of times class = 1, no. of times some word appear whereever class = 1. (same with class 0 too)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": {}, 19 | "outputs": [ 20 | { 21 | "ename": "ModuleNotFoundError", 22 | "evalue": "No module named 'fastai.models'", 23 | "output_type": "error", 24 | "traceback": [ 25 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 26 | "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", 27 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0mfastai\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlp\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinear_model\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mLogisticRegression\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorchtext\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mvocab\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdatasets\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 28 | "\u001b[0;32m~/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/nlp.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mimports\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mtorch_imports\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 29 | "\u001b[0;32m~/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/torch_imports.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtorchvision\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodels\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdensenet121\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdensenet161\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdensenet169\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdensenet201\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mmodels\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresnext_50_32x4d\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mresnext_50_32x4d\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mmodels\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresnext_101_32x4d\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mresnext_101_32x4d\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mmodels\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresnext_101_64x4d\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mresnext_101_64x4d\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 30 | "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'fastai.models'" 31 | ] 32 | } 33 | ], 34 | "source": [ 35 | "%reload_ext autoreload\n", 36 | "%autoreload 2\n", 37 | "%matplotlib inline\n", 38 | "\n", 39 | "from sklearn.linear_model import LogisticRegression\n", 40 | "from sklearn.model_selection import ParameterGrid\n", 41 | "from joblib import Parallel, delayed\n", 42 | "import warnings\n", 43 | "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n", 44 | "warnings.filterwarnings(\"ignore\", category=UserWarning)\n", 45 | "\n", 46 | "\n", 47 | "from fastai.nlp import *\n", 48 | "from sklearn.linear_model import LogisticRegression\n", 49 | "from torchtext import vocab, data, datasets\n", 50 | "from IPython.core.display import Image " 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 61, 56 | "metadata": { 57 | "scrolled": true 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "#! ls /Users/groverprince/Documents/msan/msan_ml/data/aclImdb/" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.\n", 69 | "To get the dataset, in your terminal run the following commands: \n", 70 | "`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz` \n", 71 | "`gunzip aclImdb_v1.tar.gz` \n", 72 | "`tar -xvf aclImdb_v1.tar` " 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 4, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "PATH='/Users/groverprince/Documents/msan/msan_ml/data/aclImdb/'\n", 82 | "names = ['neg','pos']" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 66, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "desk = '/Users/groverprince/Desktop/'" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 5, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "trn, trn_Y = texts_from_folders(f'{PATH}train', names) # taking text from each file, saving in list. saving 0 or 1 in other list" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 6, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [ 109 | "val,val_y = texts_from_folders(f'{PATH}test',names)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 7, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "data": { 119 | "text/plain": [ 120 | "(\"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.\",\n", 121 | " 0,\n", 122 | " 25000)" 123 | ] 124 | }, 125 | "execution_count": 7, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "trn[0], trn_Y[0], len(trn)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": { 138 | "scrolled": true 139 | }, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "(\"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.\",\n", 145 | " 0,\n", 146 | " 25000)" 147 | ] 148 | }, 149 | "execution_count": 8, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "val[0],val_y[0], len(val)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 9, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "# this is just like \"this is good\" --> this = 1, is = 1, good = 1\n", 165 | "# but it is just an object\n", 166 | "\n", 167 | "veczr = CountVectorizer(tokenizer=tokenize) # making tokenizer (bag of words representation) - sparse matrix" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 10, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "# using veczr object to make term doc matrix\n", 177 | "\n", 178 | "trn_term_doc = veczr.fit_transform(trn) \n", 179 | "val_term_doc = veczr.transform(val)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 11, 185 | "metadata": { 186 | "scrolled": true 187 | }, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/plain": [ 192 | "128" 193 | ] 194 | }, 195 | "execution_count": 11, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "len(set(val[0].split()))" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 12, 207 | "metadata": { 208 | "scrolled": true 209 | }, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "(<1x75132 sparse matrix of type ''\n", 215 | " \twith 93 stored elements in Compressed Sparse Row format>,\n", 216 | " <25000x75132 sparse matrix of type ''\n", 217 | " \twith 3640465 stored elements in Compressed Sparse Row format>)" 218 | ] 219 | }, 220 | "execution_count": 12, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "trn_term_doc[0], val_term_doc" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "* 25000 = number of training documents . \n", 234 | "* 75132 = number of unique words (represent columns of tokenized version) in all documents combined . keys of dict\n", 235 | "* training term document has 93 words out of those 75132 unique words. \n", 236 | "* 128 should technically have been 93. maybe because of ./', signs" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 13, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "data": { 246 | "text/plain": [ 247 | "1205.7655009765356" 248 | ] 249 | }, 250 | "execution_count": 13, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | } 254 | ], 255 | "source": [ 256 | "100000*(146**-0.75)*(np.log10(1+np.log10(162)))" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "to get `feature_names` in particular column index" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 14, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "data": { 273 | "text/plain": [ 274 | "['ameteurish', 'ametuer', 'amfortas', 'amg', 'ami']" 275 | ] 276 | }, 277 | "execution_count": 14, 278 | "metadata": {}, 279 | "output_type": "execute_result" 280 | } 281 | ], 282 | "source": [ 283 | "vocab = veczr.get_feature_names(); vocab[3000:3005]" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "opposite of above, to get index of particular `column name (vocublary)`. _ is just used to replace inplace (not here) what is it for ?" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 15, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/plain": [ 301 | "3001" 302 | ] 303 | }, 304 | "execution_count": 15, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "veczr.vocabulary_['ametuer']" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "what is going on here?" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 16, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "0" 329 | ] 330 | }, 331 | "execution_count": 16, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "trn_term_doc[0,2297]" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "### Use of Naive Bayes to fit a model" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 17, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "x=trn_term_doc\n", 354 | "y=trn_Y" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 18, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "data": { 364 | "text/plain": [ 365 | "(25000, 75132)" 366 | ] 367 | }, 368 | "execution_count": 18, 369 | "metadata": {}, 370 | "output_type": "execute_result" 371 | } 372 | ], 373 | "source": [ 374 | "x.shape # 75312 columns, distinct words, 25000 documents/rows" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 19, 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [ 383 | "p = x[y==1].sum(0)+1 # no of times a word appear, given class = 1. \n", 384 | "q = x[y==0].sum(0)+1 # no of times a word appear, given class = 0" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 20, 390 | "metadata": {}, 391 | "outputs": [ 392 | { 393 | "data": { 394 | "text/plain": [ 395 | "((1, 75132), (1, 75132))" 396 | ] 397 | }, 398 | "execution_count": 20, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "p.shape, q.shape" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 21, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/plain": [ 415 | "(matrix([[ 2, 1, 11820, ..., 2, 2, 1]], dtype=int64),\n", 416 | " matrix([[ 1, 2, 12742, ..., 1, 2, 8]], dtype=int64))" 417 | ] 418 | }, 419 | "execution_count": 21, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "p, q" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "p/psum = (total times word a1 appear in all docs)/(total words appearning in all the docs) (+ 1 is added for each word, ignoring that for understanding for now). (all this given class = 1)\n", 433 | "What does it mean? It means, given class = 1, what is probability of word a1 appearing. \n", 434 | "\n", 435 | "we calculate ratio of this wrt when class = 0 as ratio gives whether this word has higher chance of being from class 1 or 0." 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 22, 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "data": { 445 | "text/plain": [ 446 | "(matrix([[3747495]], dtype=int64), matrix([[3789022]], dtype=int64))" 447 | ] 448 | }, 449 | "execution_count": 22, 450 | "metadata": {}, 451 | "output_type": "execute_result" 452 | } 453 | ], 454 | "source": [ 455 | "q.sum(1) , p.sum(1) # total words in 0 class are more" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": 23, 461 | "metadata": {}, 462 | "outputs": [ 463 | { 464 | "data": { 465 | "text/plain": [ 466 | "((1, 75132),\n", 467 | " matrix([[ 0.68213, -0.70417, -0.08613, ..., 0.68213, -0.01102, -2.09046]]))" 468 | ] 469 | }, 470 | "execution_count": 23, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "r = np.log((p/p.sum(1))/(q/q.sum(1))) \n", 477 | "r.shape, r" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 30, 483 | "metadata": {}, 484 | "outputs": [ 485 | { 486 | "data": { 487 | "text/plain": [ 488 | "((1, 75132),\n", 489 | " matrix([[ 0.69315, -0.69315, -0.07511, ..., 0.69315, 0. , -2.07944]]))" 490 | ] 491 | }, 492 | "execution_count": 30, 493 | "metadata": {}, 494 | "output_type": "execute_result" 495 | } 496 | ], 497 | "source": [ 498 | "b = np.log(p/q)\n", 499 | "b.shape, b" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "by using naive bayes prediction, we are sort of directly making a model and those predictions can help us to predict" 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": 31, 512 | "metadata": {}, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/plain": [ 517 | "(75132, 1)" 518 | ] 519 | }, 520 | "execution_count": 31, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "r.T.shape" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 32, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "data": { 536 | "text/plain": [ 537 | "<25000x75132 sparse matrix of type ''\n", 538 | "\twith 3640465 stored elements in Compressed Sparse Row format>" 539 | ] 540 | }, 541 | "execution_count": 32, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "val_term_doc" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "validation has 1 or no. of times occurence of word in doc where that word appears, 0 otherwise. so it will multiply `word count in doc with chances of word appearning given class 1 probability is higher` " 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 33, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "pre_preds = val_term_doc @ r.T " 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 34, 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "data": { 573 | "text/plain": [ 574 | "matrix([[ -8.33253, -22.4765 , -26.97431, ..., 97.91732, 18.31415, 21.76217]])" 575 | ] 576 | }, 577 | "execution_count": 34, 578 | "metadata": {}, 579 | "output_type": "execute_result" 580 | } 581 | ], 582 | "source": [ 583 | "pre_preds.T" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 36, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "data": { 593 | "text/plain": [ 594 | "0.80740000000000001" 595 | ] 596 | }, 597 | "execution_count": 36, 598 | "metadata": {}, 599 | "output_type": "execute_result" 600 | } 601 | ], 602 | "source": [ 603 | "pre_preds = val_term_doc @ r.T\n", 604 | "preds = pre_preds.T>0 # because if sum of log ratios is > 0 means class 1\n", 605 | "(preds==val_y).mean()" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "binarised naive bayes. Funda is don't need to take number of occureces of word in a document. Just whether is appears or not is fine. Think of a case where a word appears many many times but only in 1 document for class 1 and once in many documents of class 0. Where should it be assigned. Class 1 or 0? Here it is trying to assign it to class 0 which makes more sense and that appearance in only 1 doc for class 1 might just be an outlier" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 29, 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "data": { 622 | "text/plain": [ 623 | "0.82623999999999997" 624 | ] 625 | }, 626 | "execution_count": 29, 627 | "metadata": {}, 628 | "output_type": "execute_result" 629 | } 630 | ], 631 | "source": [ 632 | "pre_preds = val_term_doc.sign() @ r.T + b\n", 633 | "preds = pre_preds.T>0\n", 634 | "(preds==val_y).mean()" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": 30, 640 | "metadata": { 641 | "scrolled": true 642 | }, 643 | "outputs": [ 644 | { 645 | "data": { 646 | "text/plain": [ 647 | "matrix([[ -4.35523, -19.08058, -23.69877, ..., 54.90807, 14.71441, 18.40103]])" 648 | ] 649 | }, 650 | "execution_count": 30, 651 | "metadata": {}, 652 | "output_type": "execute_result" 653 | } 654 | ], 655 | "source": [ 656 | "pre_preds.T" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 31, 662 | "metadata": { 663 | "scrolled": true 664 | }, 665 | "outputs": [ 666 | { 667 | "data": { 668 | "text/plain": [ 669 | "matrix([[ 0.68213, -0.70417, -0.08613, ..., 0.68213, -0.01102, -2.09046]])" 670 | ] 671 | }, 672 | "execution_count": 31, 673 | "metadata": {}, 674 | "output_type": "execute_result" 675 | } 676 | ], 677 | "source": [ 678 | "r" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "Now let's try same thing using **Logistic regression**. (Considering onyl unigrams like this first)\n", 686 | "\n", 687 | "`C` in Logistic is inverse of regularization constant ($ \\lambda $). So lower c means high regularization. Let's try first without any regularization. Then we will move on to regularization to see how it improves (by regularizing parameters we aim to hangle overfitting)" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 32, 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "data": { 697 | "text/plain": [ 698 | "0.85387999999999997" 699 | ] 700 | }, 701 | "execution_count": 32, 702 | "metadata": {}, 703 | "output_type": "execute_result" 704 | } 705 | ], 706 | "source": [ 707 | "m = LogisticRegression(C = 1e5)\n", 708 | "m.fit(x,y) # feeding raw form of sparse matrix and target y without changin anything -- so easy :D\n", 709 | "preds = m.predict(val_term_doc)\n", 710 | "np.mean(preds == val_y)" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "One thing that we tried in NBC, we can try here too. Ignore no. of times a word appears in document. Even if it word `a1` appears 5 times in a doc, just take 1. That what `val_term_doc.sign()` can do. It return 1 for +ve, 0 for 0 and -1 for all negative values. " 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 33, 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "data": { 727 | "text/plain": [ 728 | "0.85768" 729 | ] 730 | }, 731 | "execution_count": 33, 732 | "metadata": {}, 733 | "output_type": "execute_result" 734 | } 735 | ], 736 | "source": [ 737 | "m = LogisticRegression(C = 1e5)\n", 738 | "m.fit(x.sign(),y) # sign() in training data\n", 739 | "preds = m.predict(val_term_doc.sign()) # sign in test data\n", 740 | "np.mean(preds == val_y)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": {}, 746 | "source": [ 747 | "So, that's observable! binarising our matrix values improves accuracy. .85768 > .85387" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "Now let's try some regularization. (maybe we are overfitting). But let's first check if we are actually overfitting or not. Way to do that is to check both training and validation score. If training score is very high, then there is chance of overfitting. Let's check that" 755 | ] 756 | }, 757 | { 758 | "cell_type": "code", 759 | "execution_count": 34, 760 | "metadata": {}, 761 | "outputs": [ 762 | { 763 | "data": { 764 | "text/plain": [ 765 | "1.0" 766 | ] 767 | }, 768 | "execution_count": 34, 769 | "metadata": {}, 770 | "output_type": "execute_result" 771 | } 772 | ], 773 | "source": [ 774 | "preds = m.predict(x.sign())\n", 775 | "np.mean(preds == y)" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "Wow. training accuracy is 100%. Definitely it is overfitting. Let's reduce it by introducing some regularization. \n", 783 | "Starting with `c = 0.1` first" 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": 35, 789 | "metadata": {}, 790 | "outputs": [ 791 | { 792 | "data": { 793 | "text/plain": [ 794 | "(0.88400000000000001, 0.96736)" 795 | ] 796 | }, 797 | "execution_count": 35, 798 | "metadata": {}, 799 | "output_type": "execute_result" 800 | } 801 | ], 802 | "source": [ 803 | "m = LogisticRegression(C = 0.1)\n", 804 | "m.fit(x.sign(),y) # sign() in training data\n", 805 | "preds = m.predict(val_term_doc.sign()) # sign in test data\n", 806 | "preds2 = m.predict(x.sign())\n", 807 | "\n", 808 | "np.mean(preds == y), np.mean(preds2 == y)" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "metadata": {}, 814 | "source": [ 815 | "Ok. So our validation accuracy has definitely improved using some regularization and training has reduced too. Let's try to tune `C` to find best value of it by using a gridsearch" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 36, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "def fitOne(model, X, y, params):\n", 825 | " m = model(**params)\n", 826 | " return m.fit(X, y)\n", 827 | "\n", 828 | "def fitModels(model, paramGrid, X, y, n_jobs=-1, verbose=10):\n", 829 | " return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model,\n", 830 | " X,\n", 831 | " y,\n", 832 | " params) for params in paramGrid)" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": 51, 838 | "metadata": {}, 839 | "outputs": [ 840 | { 841 | "name": "stderr", 842 | "output_type": "stream", 843 | "text": [ 844 | "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.1s\n", 845 | "[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.9s remaining: 4.8s\n", 846 | "[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 2.8s remaining: 3.8s\n", 847 | "[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 4.6s remaining: 3.5s\n", 848 | "[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 7.7s remaining: 3.1s\n", 849 | "[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 10.0s remaining: 0.0s\n", 850 | "[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 10.0s finished\n" 851 | ] 852 | } 853 | ], 854 | "source": [ 855 | "model = LogisticRegression\n", 856 | "grid = {\n", 857 | " 'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100], # regularization\n", 858 | " 'penalty': ['l2'], # penalty type\n", 859 | " 'n_jobs': [-1] # parallelize within each fit over all cores\n", 860 | "}\n", 861 | "paramGrid = ParameterGrid(grid)\n", 862 | "myModels = fitModels(model, paramGrid, x.sign(), y)" 863 | ] 864 | }, 865 | { 866 | "cell_type": "code", 867 | "execution_count": 52, 868 | "metadata": {}, 869 | "outputs": [], 870 | "source": [ 871 | "def print_score(m):\n", 872 | " preds = m.predict(val_term_doc.sign()) # sign in test data\n", 873 | " preds2 = m.predict(x.sign())\n", 874 | "\n", 875 | " print(np.mean(preds == y), np.mean(preds2 == y))" 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": 53, 881 | "metadata": {}, 882 | "outputs": [ 883 | { 884 | "name": "stdout", 885 | "output_type": "stream", 886 | "text": [ 887 | "0.80576 0.81452\n", 888 | "0.85056 0.8644\n", 889 | "0.8798 0.91468\n", 890 | "0.884 0.96736\n", 891 | "0.87384 0.99784\n", 892 | "0.86824 1.0\n", 893 | "0.8638 1.0\n" 894 | ] 895 | } 896 | ], 897 | "source": [ 898 | "for i in myModels:\n", 899 | " print_score(i)" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "In above print, lhs is validation score and rhs is train score. Validation score is highest for `C` = 0.1 and we can chose 0.1 as our regularization " 907 | ] 908 | }, 909 | { 910 | "cell_type": "markdown", 911 | "metadata": {}, 912 | "source": [ 913 | "What else can be tried to improve the accuracy? One thing that can be tried is tweaking Naive bayes a little bit. We assumed our words/features to be independent of each other. Let's try some pair interaction that is something like making new features which are made using combinations of old independent features. \n", 914 | "\n", 915 | "We can use `bigrams` or `trigrams`. What are `bigrams`. Very simple concept. Some words tend to appear together very frequently. Like `this is`, `what are`, `you are` etc. One major drawback of `naive` bayes (that's why it is called naive :P) is that it assumes all our features to be independent. But in actual sense they are not. Like frequenct occuring `what are`. these will come together and be dependent of each other. \n", 916 | "\n", 917 | "Bigram example -- `what are you doing` -> `what are`, `are you`, `you doing` . Adding bigrams will add both sinle word double adjacent word features" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 37, 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [ 926 | "# ngram_range=(1,2) does unigram, bigram and trigram. \n", 927 | "# max_features is limiting documents to not make each and every combination of adjacent 3 features. \n", 928 | "# It first sorts tri/bi grams based on number of together occurences and only takes ngrams which appear most.\n", 929 | "\n", 930 | "veczr = CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)\n", 931 | "trn_term_doc = veczr.fit_transform(trn)\n", 932 | "val_term_doc = veczr.transform(val)" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 38, 938 | "metadata": {}, 939 | "outputs": [ 940 | { 941 | "data": { 942 | "text/plain": [ 943 | "((25000, 800000), (25000, 800000))" 944 | ] 945 | }, 946 | "execution_count": 38, 947 | "metadata": {}, 948 | "output_type": "execute_result" 949 | } 950 | ], 951 | "source": [ 952 | "trn_term_doc.shape, val_term_doc.shape # 25k rows and max_feature (800k) columns" 953 | ] 954 | }, 955 | { 956 | "cell_type": "markdown", 957 | "metadata": {}, 958 | "source": [ 959 | "let's see how vocublary look like now" 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": 39, 965 | "metadata": {}, 966 | "outputs": [], 967 | "source": [ 968 | "vocab = veczr.get_feature_names()" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 40, 974 | "metadata": {}, 975 | "outputs": [ 976 | { 977 | "data": { 978 | "text/plain": [ 979 | "['! the camera',\n", 980 | " '! the cast',\n", 981 | " '! the character',\n", 982 | " '! the characters',\n", 983 | " '! the cinematography',\n", 984 | " '! the climax',\n", 985 | " '! the dialog',\n", 986 | " '! the direction']" 987 | ] 988 | }, 989 | "execution_count": 40, 990 | "metadata": {}, 991 | "output_type": "execute_result" 992 | } 993 | ], 994 | "source": [ 995 | "vocab[1002:1010]" 996 | ] 997 | }, 998 | { 999 | "cell_type": "markdown", 1000 | "metadata": {}, 1001 | "source": [ 1002 | "these words tend to appear together a lot." 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "code", 1007 | "execution_count": 41, 1008 | "metadata": {}, 1009 | "outputs": [], 1010 | "source": [ 1011 | "# repeating same steps to extract x, y \n", 1012 | "\n", 1013 | "x = trn_term_doc.sign()\n", 1014 | "y = trn_Y\n", 1015 | "val_x = val_term_doc.sign()\n", 1016 | "\n", 1017 | "p = x[y == 1].sum(0) + 1 # i need sum of words appearing in class 1. i.e in all docs\n", 1018 | "q = x[y == 0].sum(0) + 1\n", 1019 | "r = np.log((p/p.sum()) / (q/q.sum())) # r is log count ratio. log of count ratios\n", 1020 | "b = np.log((y==1).sum()/(y==0).sum())" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 42, 1026 | "metadata": {}, 1027 | "outputs": [ 1028 | { 1029 | "data": { 1030 | "text/plain": [ 1031 | "0.88044" 1032 | ] 1033 | }, 1034 | "execution_count": 42, 1035 | "metadata": {}, 1036 | "output_type": "execute_result" 1037 | } 1038 | ], 1039 | "source": [ 1040 | "# naive bayes again\n", 1041 | "\n", 1042 | "pre_preds = val_term_doc.sign() @ r.T\n", 1043 | "preds = pre_preds.T>0\n", 1044 | "(preds==val_y).mean()" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": 74, 1050 | "metadata": {}, 1051 | "outputs": [ 1052 | { 1053 | "data": { 1054 | "text/plain": [ 1055 | "0.90500000000000003" 1056 | ] 1057 | }, 1058 | "execution_count": 74, 1059 | "metadata": {}, 1060 | "output_type": "execute_result" 1061 | } 1062 | ], 1063 | "source": [ 1064 | "m = LogisticRegression(C=0.1, dual=True)\n", 1065 | "m.fit(x, y);\n", 1066 | "\n", 1067 | "preds = m.predict(val_x)\n", 1068 | "(preds.T==val_y).mean()" 1069 | ] 1070 | }, 1071 | { 1072 | "cell_type": "markdown", 1073 | "metadata": {}, 1074 | "source": [ 1075 | "Now the idea is to fit a logistic regression model with updated input matrix. We are updating input x to x.mult.r where r is log count ratio. Why? So, in NB solution x@r.T was our output (matrix mult) which was saying that `r` is a good weight matrix for our linear layer. Now" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "code", 1080 | "execution_count": 80, 1081 | "metadata": {}, 1082 | "outputs": [ 1083 | { 1084 | "data": { 1085 | "text/plain": [ 1086 | "<25000x800000 sparse matrix of type ''\n", 1087 | "\twith 12589101 stored elements in COOrdinate format>" 1088 | ] 1089 | }, 1090 | "execution_count": 80, 1091 | "metadata": {}, 1092 | "output_type": "execute_result" 1093 | } 1094 | ], 1095 | "source": [ 1096 | "x.multiply(r)" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "code", 1101 | "execution_count": 84, 1102 | "metadata": {}, 1103 | "outputs": [ 1104 | { 1105 | "data": { 1106 | "text/plain": [ 1107 | "matrix([[ -19.31273],\n", 1108 | " [-322.31574],\n", 1109 | " [ -36.51194],\n", 1110 | " ..., \n", 1111 | " [ 101.11562],\n", 1112 | " [ 99.71359],\n", 1113 | " [ 42.95914]])" 1114 | ] 1115 | }, 1116 | "execution_count": 84, 1117 | "metadata": {}, 1118 | "output_type": "execute_result" 1119 | } 1120 | ], 1121 | "source": [ 1122 | "x@r.T" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "code", 1127 | "execution_count": 85, 1128 | "metadata": {}, 1129 | "outputs": [ 1130 | { 1131 | "data": { 1132 | "text/plain": [ 1133 | "0.91768000000000005" 1134 | ] 1135 | }, 1136 | "execution_count": 85, 1137 | "metadata": {}, 1138 | "output_type": "execute_result" 1139 | } 1140 | ], 1141 | "source": [ 1142 | "x_nb = x.multiply(r)\n", 1143 | "\n", 1144 | "m = LogisticRegression(dual=True, C=0.1)\n", 1145 | "m.fit(x_nb, y);\n", 1146 | "\n", 1147 | "val_x_nb = val_x.multiply(r)\n", 1148 | "\n", 1149 | "preds = m.predict(val_x_nb)\n", 1150 | "(preds.T==val_y).mean()" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "markdown", 1155 | "metadata": {}, 1156 | "source": [ 1157 | "Accuracy improved from the previous version" 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "markdown", 1162 | "metadata": {}, 1163 | "source": [ 1164 | "### fasti NB SVM" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "code", 1169 | "execution_count": 43, 1170 | "metadata": {}, 1171 | "outputs": [], 1172 | "source": [ 1173 | "si = 2000 #(stop iteration)" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "code", 1178 | "execution_count": 47, 1179 | "metadata": {}, 1180 | "outputs": [], 1181 | "source": [ 1182 | "??TextClassifierData" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 53, 1188 | "metadata": {}, 1189 | "outputs": [], 1190 | "source": [ 1191 | "??dotprod_nb_learner() # gets model . dot product with naive bayes like in above case " 1192 | ] 1193 | }, 1194 | { 1195 | "cell_type": "code", 1196 | "execution_count": 46, 1197 | "metadata": {}, 1198 | "outputs": [], 1199 | "source": [ 1200 | "# Here is how we get a model from a bag of words\n", 1201 | "md = TextClassifierData.from_bow(trn_term_doc, trn_Y, val_term_doc, val_y, si)" 1202 | ] 1203 | }, 1204 | { 1205 | "cell_type": "code", 1206 | "execution_count": 54, 1207 | "metadata": {}, 1208 | "outputs": [ 1209 | { 1210 | "data": { 1211 | "application/vnd.jupyter.widget-view+json": { 1212 | "model_id": "6487dca33f1547489d107b13992a7b34", 1213 | "version_major": 2, 1214 | "version_minor": 0 1215 | }, 1216 | "text/html": [ 1217 | "

Failed to display Jupyter Widget of type HBox.

\n", 1218 | "

\n", 1219 | " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", 1220 | " that the widgets JavaScript is still loading. If this message persists, it\n", 1221 | " likely means that the widgets JavaScript library is either not installed or\n", 1222 | " not enabled. See the Jupyter\n", 1223 | " Widgets Documentation for setup instructions.\n", 1224 | "

\n", 1225 | "

\n", 1226 | " If you're reading this message in another frontend (for example, a static\n", 1227 | " rendering on GitHub or NBViewer),\n", 1228 | " it may mean that your frontend doesn't currently support widgets.\n", 1229 | "

\n" 1230 | ], 1231 | "text/plain": [ 1232 | "HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))" 1233 | ] 1234 | }, 1235 | "metadata": {}, 1236 | "output_type": "display_data" 1237 | }, 1238 | { 1239 | "name": "stdout", 1240 | "output_type": "stream", 1241 | "text": [ 1242 | "[ 0. 0.02432 0.11949 0.91768] \n", 1243 | "\n" 1244 | ] 1245 | } 1246 | ], 1247 | "source": [ 1248 | "learner = md.dotprod_nb_learner() # \n", 1249 | "learner.fit(0.02, 1, wds=1e-6, cycle_len=1)" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 55, 1255 | "metadata": {}, 1256 | "outputs": [ 1257 | { 1258 | "data": { 1259 | "application/vnd.jupyter.widget-view+json": { 1260 | "model_id": "ae4bd499be00469eb2a7afb1199d4215", 1261 | "version_major": 2, 1262 | "version_minor": 0 1263 | }, 1264 | "text/html": [ 1265 | "

Failed to display Jupyter Widget of type HBox.

\n", 1266 | "

\n", 1267 | " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", 1268 | " that the widgets JavaScript is still loading. If this message persists, it\n", 1269 | " likely means that the widgets JavaScript library is either not installed or\n", 1270 | " not enabled. See the Jupyter\n", 1271 | " Widgets Documentation for setup instructions.\n", 1272 | "

\n", 1273 | "

\n", 1274 | " If you're reading this message in another frontend (for example, a static\n", 1275 | " rendering on GitHub or NBViewer),\n", 1276 | " it may mean that your frontend doesn't currently support widgets.\n", 1277 | "

\n" 1278 | ], 1279 | "text/plain": [ 1280 | "HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))" 1281 | ] 1282 | }, 1283 | "metadata": {}, 1284 | "output_type": "display_data" 1285 | }, 1286 | { 1287 | "name": "stdout", 1288 | "output_type": "stream", 1289 | "text": [ 1290 | "[ 0. 0.01933 0.11368 0.92148] \n", 1291 | "[ 1. 0.01358 0.1118 0.92104] \n", 1292 | "\n" 1293 | ] 1294 | } 1295 | ], 1296 | "source": [ 1297 | "learner.fit(0.02, 2, wds=1e-6, cycle_len=1)" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "metadata": {}, 1303 | "source": [ 1304 | "** 92.1% accurately predict whether IMDB review is good or bad**" 1305 | ] 1306 | } 1307 | ], 1308 | "metadata": { 1309 | "kernelspec": { 1310 | "display_name": "Python 3", 1311 | "language": "python", 1312 | "name": "python3" 1313 | }, 1314 | "language_info": { 1315 | "codemirror_mode": { 1316 | "name": "ipython", 1317 | "version": 3 1318 | }, 1319 | "file_extension": ".py", 1320 | "mimetype": "text/x-python", 1321 | "name": "python", 1322 | "nbconvert_exporter": "python", 1323 | "pygments_lexer": "ipython3", 1324 | "version": "3.6.3" 1325 | } 1326 | }, 1327 | "nbformat": 4, 1328 | "nbformat_minor": 2 1329 | } 1330 | --------------------------------------------------------------------------------