"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "button": false,
23 | "deletable": true,
24 | "new_sheet": false,
25 | "run_control": {
26 | "read_only": false
27 | }
28 | },
29 | "source": [
30 | "In this lab exercise, you will learn a popular machine learning algorithm, Decision Tree. You will use this classification algorithm to build a model from historical data of patients, and their response to different medications. Then you use the trained decision tree to predict the class of a unknown patient, or to find a proper drug for a new patient."
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "
\n",
105 | " Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. \n",
106 | " \n",
107 | " \n",
108 | " Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.\n",
109 | " \n",
110 | " \n",
111 | " It is a sample of binary classifier, and you can use the training part of the dataset \n",
112 | " to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.\n",
113 | "
\n",
765 | " We will first create an instance of the DecisionTreeClassifier called drugTree. \n",
766 | " Inside of the classifier, specify criterion=\"entropy\" so we can see the information gain of each node.\n",
767 | "
\n",
1118 | "\n",
1119 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
1120 | "\n",
1121 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
1122 | "\n",
1123 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers."
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "
\n",
40 | " \n",
41 | ""
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {
47 | "button": false,
48 | "new_sheet": false,
49 | "run_control": {
50 | "read_only": false
51 | }
52 | },
53 | "source": [
54 | "\n",
55 | "## What is the difference between Linear and Logistic Regression?\n",
56 | "\n",
57 | "While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.\n",
58 | "\n",
59 | "
\n",
60 | "Recall linear regression:\n",
61 | " \n",
62 | " \n",
63 | " As you know, Linear regression finds a function that relates a continuous dependent variable, y, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, Simple linear regression assumes a function of the form:\n",
64 | "
\n",
65 | "$$\n",
66 | "y = \\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\cdots\n",
67 | "$$\n",
68 | " \n",
69 | "and finds the values of parameters $\\theta_0, \\theta_1, \\theta_2$, etc, where the term $\\theta_0$ is the \"intercept\". It can be generally shown as:\n",
70 | "
\n",
77 | "\n",
78 | "Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n",
79 | "\n",
80 | "Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:\n",
81 | "\n",
82 | "$$\n",
83 | "ℎ_\\theta(𝑥) = \\sigma({\\theta^TX}) = \\frac {e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +...)}}{1 + e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +\\cdots)}}\n",
84 | "$$\n",
85 | "Or:\n",
86 | "$$\n",
87 | "ProbabilityOfaClass_1 = P(Y=1|X) = \\sigma({\\theta^TX}) = \\frac{e^{\\theta^TX}}{1+e^{\\theta^TX}} \n",
88 | "$$\n",
89 | "\n",
90 | "In this equation, ${\\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\\sigma(\\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common \"S\" shape (sigmoid curve).\n",
91 | "\n",
92 | "So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n",
93 | "\n",
94 | "\n",
96 | "\n",
97 | "\n",
98 | "The objective of __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_\\theta(𝑥)$ = $\\sigma({\\theta^TX})$, in such a way that the model best predicts the class of each case."
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "### Customer churn with Logistic Regression\n",
106 | "A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why."
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {
112 | "button": false,
113 | "new_sheet": false,
114 | "run_control": {
115 | "read_only": false
116 | }
117 | },
118 | "source": [
119 | "Lets first import required libraries:"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 2,
125 | "metadata": {
126 | "button": false,
127 | "new_sheet": false,
128 | "run_control": {
129 | "read_only": false
130 | }
131 | },
132 | "outputs": [],
133 | "source": [
134 | "import pandas as pd\n",
135 | "import pylab as pl\n",
136 | "import numpy as np\n",
137 | "import scipy.optimize as opt\n",
138 | "from sklearn import preprocessing\n",
139 | "%matplotlib inline \n",
140 | "import matplotlib.pyplot as plt"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {
146 | "button": false,
147 | "new_sheet": false,
148 | "run_control": {
149 | "read_only": false
150 | }
151 | },
152 | "source": [
153 | "
About the dataset
\n",
154 | "We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n",
155 | "\n",
156 | "\n",
157 | "This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n",
158 | "\n",
159 | "\n",
160 | "\n",
161 | "The dataset includes information about:\n",
162 | "\n",
163 | "- Customers who left within the last month – the column is called Churn\n",
164 | "- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n",
165 | "- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n",
166 | "- Demographic info about customers – gender, age range, and if they have partners and dependents\n"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {
172 | "button": false,
173 | "new_sheet": false,
174 | "run_control": {
175 | "read_only": false
176 | }
177 | },
178 | "source": [
179 | "### Load the Telco Churn data \n",
180 | "Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n",
181 | "\n",
182 | "To download the data, we will use `!wget` to download it from IBM Object Storage."
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 3,
188 | "metadata": {
189 | "button": false,
190 | "new_sheet": false,
191 | "run_control": {
192 | "read_only": false
193 | }
194 | },
195 | "outputs": [
196 | {
197 | "name": "stdout",
198 | "output_type": "stream",
199 | "text": [
200 | "--2019-07-11 02:13:17-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv\n",
201 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
202 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
203 | "HTTP request sent, awaiting response... 200 OK\n",
204 | "Length: 36144 (35K) [text/csv]\n",
205 | "Saving to: ‘ChurnData.csv’\n",
206 | "\n",
207 | "ChurnData.csv 100%[===================>] 35.30K --.-KB/s in 0.02s \n",
208 | "\n",
209 | "2019-07-11 02:13:17 (1.63 MB/s) - ‘ChurnData.csv’ saved [36144/36144]\n",
210 | "\n"
211 | ]
212 | }
213 | ],
214 | "source": [
215 | "#Click here and press Shift+Enter\n",
216 | "!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {
229 | "button": false,
230 | "new_sheet": false,
231 | "run_control": {
232 | "read_only": false
233 | }
234 | },
235 | "source": [
236 | "### Load Data From CSV File "
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 4,
242 | "metadata": {
243 | "button": false,
244 | "new_sheet": false,
245 | "run_control": {
246 | "read_only": false
247 | }
248 | },
249 | "outputs": [
250 | {
251 | "data": {
252 | "text/html": [
253 | "
"
1001 | ]
1002 | },
1003 | {
1004 | "cell_type": "markdown",
1005 | "metadata": {},
1006 | "source": [
1007 | "### jaccard index\n",
1008 | "Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n",
1009 | "\n"
1010 | ]
1011 | },
1012 | {
1013 | "cell_type": "code",
1014 | "execution_count": null,
1015 | "metadata": {},
1016 | "outputs": [],
1017 | "source": [
1018 | "from sklearn.metrics import jaccard_similarity_score\n",
1019 | "jaccard_similarity_score(y_test, yhat)"
1020 | ]
1021 | },
1022 | {
1023 | "cell_type": "markdown",
1024 | "metadata": {},
1025 | "source": [
1026 | "### confusion matrix\n",
1027 | "Another way of looking at accuracy of classifier is to look at __confusion matrix__."
1028 | ]
1029 | },
1030 | {
1031 | "cell_type": "code",
1032 | "execution_count": null,
1033 | "metadata": {},
1034 | "outputs": [],
1035 | "source": [
1036 | "from sklearn.metrics import classification_report, confusion_matrix\n",
1037 | "import itertools\n",
1038 | "def plot_confusion_matrix(cm, classes,\n",
1039 | " normalize=False,\n",
1040 | " title='Confusion matrix',\n",
1041 | " cmap=plt.cm.Blues):\n",
1042 | " \"\"\"\n",
1043 | " This function prints and plots the confusion matrix.\n",
1044 | " Normalization can be applied by setting `normalize=True`.\n",
1045 | " \"\"\"\n",
1046 | " if normalize:\n",
1047 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
1048 | " print(\"Normalized confusion matrix\")\n",
1049 | " else:\n",
1050 | " print('Confusion matrix, without normalization')\n",
1051 | "\n",
1052 | " print(cm)\n",
1053 | "\n",
1054 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
1055 | " plt.title(title)\n",
1056 | " plt.colorbar()\n",
1057 | " tick_marks = np.arange(len(classes))\n",
1058 | " plt.xticks(tick_marks, classes, rotation=45)\n",
1059 | " plt.yticks(tick_marks, classes)\n",
1060 | "\n",
1061 | " fmt = '.2f' if normalize else 'd'\n",
1062 | " thresh = cm.max() / 2.\n",
1063 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
1064 | " plt.text(j, i, format(cm[i, j], fmt),\n",
1065 | " horizontalalignment=\"center\",\n",
1066 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n",
1067 | "\n",
1068 | " plt.tight_layout()\n",
1069 | " plt.ylabel('True label')\n",
1070 | " plt.xlabel('Predicted label')\n",
1071 | "print(confusion_matrix(y_test, yhat, labels=[1,0]))"
1072 | ]
1073 | },
1074 | {
1075 | "cell_type": "code",
1076 | "execution_count": null,
1077 | "metadata": {},
1078 | "outputs": [],
1079 | "source": [
1080 | "# Compute confusion matrix\n",
1081 | "cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n",
1082 | "np.set_printoptions(precision=2)\n",
1083 | "\n",
1084 | "\n",
1085 | "# Plot non-normalized confusion matrix\n",
1086 | "plt.figure()\n",
1087 | "plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')"
1088 | ]
1089 | },
1090 | {
1091 | "cell_type": "markdown",
1092 | "metadata": {},
1093 | "source": [
1094 | "Look at first row. The first row is for customers whose actual churn value in test set is 1.\n",
1095 | "As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n",
1096 | "And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n",
1097 | "\n",
1098 | "It means, for 6 customers, the actual churn value were 1 in test set, and classifier also correctly predicted those as 1. However, while the actual label of 9 customers were 1, the classifier predicted those as 0, which is not very good. We can consider it as error of the model for first row.\n",
1099 | "\n",
1100 | "What about the customers with churn value 0? Lets look at the second row.\n",
1101 | "It looks like there were 25 customers whom their churn value were 0. \n",
1102 | "\n",
1103 | "\n",
1104 | "The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes. In specific case of binary classifier, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. "
1105 | ]
1106 | },
1107 | {
1108 | "cell_type": "code",
1109 | "execution_count": null,
1110 | "metadata": {},
1111 | "outputs": [],
1112 | "source": [
1113 | "print (classification_report(y_test, yhat))\n"
1114 | ]
1115 | },
1116 | {
1117 | "cell_type": "markdown",
1118 | "metadata": {},
1119 | "source": [
1120 | "Based on the count of each section, we can calculate precision and recall of each label:\n",
1121 | "\n",
1122 | "\n",
1123 | "- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n",
1124 | "\n",
1125 | "- __Recall__ is true positive rate. It is defined as: Recall = TP / (TP + FN)\n",
1126 | "\n",
1127 | " \n",
1128 | "So, we can calculate precision and recall of each class.\n",
1129 | "\n",
1130 | "__F1 score:__\n",
1131 | "Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n",
1132 | "\n",
1133 | "The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n",
1134 | "\n",
1135 | "\n",
1136 | "And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case."
1137 | ]
1138 | },
1139 | {
1140 | "cell_type": "markdown",
1141 | "metadata": {},
1142 | "source": [
1143 | "### log loss\n",
1144 | "Now, lets try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n",
1145 | "Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n"
1146 | ]
1147 | },
1148 | {
1149 | "cell_type": "code",
1150 | "execution_count": 1,
1151 | "metadata": {},
1152 | "outputs": [
1153 | {
1154 | "ename": "NameError",
1155 | "evalue": "name 'y_test' is not defined",
1156 | "output_type": "error",
1157 | "traceback": [
1158 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1159 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
1160 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmetrics\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mlog_loss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mlog_loss\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myhat_prob\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
1161 | "\u001b[0;31mNameError\u001b[0m: name 'y_test' is not defined"
1162 | ]
1163 | }
1164 | ],
1165 | "source": [
1166 | "from sklearn.metrics import log_loss\n",
1167 | "log_loss(y_test, yhat_prob)"
1168 | ]
1169 | },
1170 | {
1171 | "cell_type": "markdown",
1172 | "metadata": {},
1173 | "source": [
1174 | "
Practice
\n",
1175 | "Try to build Logistic Regression model again for the same dataset, but this time, use different __solver__ and __regularization__ values? What is new __logLoss__ value?"
1176 | ]
1177 | },
1178 | {
1179 | "cell_type": "code",
1180 | "execution_count": null,
1181 | "metadata": {},
1182 | "outputs": [],
1183 | "source": [
1184 | "# write your code here\n",
1185 | "\n"
1186 | ]
1187 | },
1188 | {
1189 | "cell_type": "markdown",
1190 | "metadata": {},
1191 | "source": [
1192 | "Double-click __here__ for the solution.\n",
1193 | "\n",
1194 | ""
1201 | ]
1202 | },
1203 | {
1204 | "cell_type": "markdown",
1205 | "metadata": {
1206 | "button": false,
1207 | "new_sheet": false,
1208 | "run_control": {
1209 | "read_only": false
1210 | }
1211 | },
1212 | "source": [
1213 | "
Want to learn more?
\n",
1214 | "\n",
1215 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
1216 | "\n",
1217 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
1218 | "\n",
1219 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n",
17 | "\n",
18 | "SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong."
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "
\n",
669 | "Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?"
670 | ]
671 | },
672 | {
673 | "cell_type": "code",
674 | "execution_count": null,
675 | "metadata": {},
676 | "outputs": [],
677 | "source": [
678 | "# write your code here\n"
679 | ]
680 | },
681 | {
682 | "cell_type": "markdown",
683 | "metadata": {},
684 | "source": [
685 | "Double-click __here__ for the solution.\n",
686 | "\n",
687 | ""
696 | ]
697 | },
698 | {
699 | "cell_type": "markdown",
700 | "metadata": {
701 | "button": false,
702 | "new_sheet": false,
703 | "run_control": {
704 | "read_only": false
705 | }
706 | },
707 | "source": [
708 | "
Want to learn more?
\n",
709 | "\n",
710 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
711 | "\n",
712 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
713 | "\n",
714 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. \n",
17 | "Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014."
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "
Importing required libraries
"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "metadata": {
31 | "collapsed": false,
32 | "jupyter": {
33 | "outputs_hidden": false
34 | }
35 | },
36 | "outputs": [],
37 | "source": [
38 | "import numpy as np\n",
39 | "import matplotlib.pyplot as plt\n",
40 | "%matplotlib inline"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = $2x$ + 3."
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "x = np.arange(-5.0, 5.0, 0.1)\n",
57 | "\n",
58 | "##You can adjust the slope and intercept to verify the changes in the graph\n",
59 | "y = 2*(x) + 3\n",
60 | "y_noise = 2 * np.random.normal(size=x.size)\n",
61 | "ydata = y + y_noise\n",
62 | "#plt.figure(figsize=(8,6))\n",
63 | "plt.plot(x, ydata, 'bo')\n",
64 | "plt.plot(x,y, 'r') \n",
65 | "plt.ylabel('Dependent Variable')\n",
66 | "plt.xlabel('Indepdendent Variable')\n",
67 | "plt.show()"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "Non-linear regressions are a relationship between independent variables $x$ and a dependent variable $y$ which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of $k$ degrees (maximum power of $x$). \n",
75 | "\n",
76 | "$$ \\ y = a x^3 + b x^2 + c x + d \\ $$\n",
77 | "\n",
78 | "Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: $$ y = \\log(x)$$\n",
79 | " \n",
80 | "Or even, more complicated such as :\n",
81 | "$$ y = \\log(a x^3 + b x^2 + c x + d)$$"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "Let's take a look at a cubic function's graph."
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {
95 | "collapsed": false,
96 | "jupyter": {
97 | "outputs_hidden": false
98 | }
99 | },
100 | "outputs": [],
101 | "source": [
102 | "x = np.arange(-5.0, 5.0, 0.1)\n",
103 | "\n",
104 | "##You can adjust the slope and intercept to verify the changes in the graph\n",
105 | "y = 1*(x**3) + 1*(x**2) + 1*x + 3\n",
106 | "y_noise = 20 * np.random.normal(size=x.size)\n",
107 | "ydata = y + y_noise\n",
108 | "plt.plot(x, ydata, 'bo')\n",
109 | "plt.plot(x,y, 'r') \n",
110 | "plt.ylabel('Dependent Variable')\n",
111 | "plt.xlabel('Indepdendent Variable')\n",
112 | "plt.show()"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "As you can see, this function has $x^3$ and $x^2$ as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function."
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "Some other types of non-linear functions are:"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "### Quadratic"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "$$ Y = X^2 $$"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "collapsed": false,
148 | "jupyter": {
149 | "outputs_hidden": false
150 | }
151 | },
152 | "outputs": [],
153 | "source": [
154 | "x = np.arange(-5.0, 5.0, 0.1)\n",
155 | "\n",
156 | "##You can adjust the slope and intercept to verify the changes in the graph\n",
157 | "\n",
158 | "y = np.power(x,2)\n",
159 | "y_noise = 2 * np.random.normal(size=x.size)\n",
160 | "ydata = y + y_noise\n",
161 | "plt.plot(x, ydata, 'bo')\n",
162 | "plt.plot(x,y, 'r') \n",
163 | "plt.ylabel('Dependent Variable')\n",
164 | "plt.xlabel('Indepdendent Variable')\n",
165 | "plt.show()"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "### Exponential"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "An exponential function with base c is defined by $$ Y = a + b c^X$$ where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable. \n",
180 | "\n"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {
187 | "collapsed": false,
188 | "jupyter": {
189 | "outputs_hidden": false
190 | }
191 | },
192 | "outputs": [],
193 | "source": [
194 | "X = np.arange(-5.0, 5.0, 0.1)\n",
195 | "\n",
196 | "##You can adjust the slope and intercept to verify the changes in the graph\n",
197 | "\n",
198 | "Y= np.exp(X)\n",
199 | "\n",
200 | "plt.plot(X,Y) \n",
201 | "plt.ylabel('Dependent Variable')\n",
202 | "plt.xlabel('Indepdendent Variable')\n",
203 | "plt.show()"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "### Logarithmic\n",
211 | "\n",
212 | "The response $y$ is a results of applying logarithmic map from input $x$'s to output variable $y$. It is one of the simplest form of __log()__: i.e. $$ y = \\log(x)$$\n",
213 | "\n",
214 | "Please consider that instead of $x$, we can use $X$, which can be polynomial representation of the $x$'s. In general form it would be written as \n",
215 | "\\begin{equation}\n",
216 | "y = \\log(X)\n",
217 | "\\end{equation}"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": null,
223 | "metadata": {
224 | "collapsed": false,
225 | "jupyter": {
226 | "outputs_hidden": false
227 | }
228 | },
229 | "outputs": [],
230 | "source": [
231 | "X = np.arange(-5.0, 5.0, 0.1)\n",
232 | "\n",
233 | "Y = np.log(X)\n",
234 | "\n",
235 | "plt.plot(X,Y) \n",
236 | "plt.ylabel('Dependent Variable')\n",
237 | "plt.xlabel('Indepdendent Variable')\n",
238 | "plt.show()"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "### Sigmoidal/Logistic"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "$$ Y = a + \\frac{b}{1+ c^{(X-d)}}$$"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": null,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "X = np.arange(-5.0, 5.0, 0.1)\n",
262 | "\n",
263 | "\n",
264 | "Y = 1-4/(1+np.power(3, X-2))\n",
265 | "\n",
266 | "plt.plot(X,Y) \n",
267 | "plt.ylabel('Dependent Variable')\n",
268 | "plt.xlabel('Indepdendent Variable')\n",
269 | "plt.show()"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "\n",
277 | "# Non-Linear Regression example"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "For an example, we're going to try and fit a non-linear model to the datapoints corresponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year. "
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": null,
290 | "metadata": {
291 | "collapsed": false,
292 | "jupyter": {
293 | "outputs_hidden": false
294 | }
295 | },
296 | "outputs": [],
297 | "source": [
298 | "import numpy as np\n",
299 | "import pandas as pd\n",
300 | "\n",
301 | "#downloading dataset\n",
302 | "!wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv\n",
303 | " \n",
304 | "df = pd.read_csv(\"china_gdp.csv\")\n",
305 | "df.head(10)"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {},
311 | "source": [
312 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "### Plotting the Dataset ###\n",
320 | "This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s."
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {
327 | "collapsed": false,
328 | "jupyter": {
329 | "outputs_hidden": false
330 | }
331 | },
332 | "outputs": [],
333 | "source": [
334 | "plt.figure(figsize=(8,5))\n",
335 | "x_data, y_data = (df[\"Year\"].values, df[\"Value\"].values)\n",
336 | "plt.plot(x_data, y_data, 'ro')\n",
337 | "plt.ylabel('GDP')\n",
338 | "plt.xlabel('Year')\n",
339 | "plt.show()"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "### Choosing a model ###\n",
347 | "\n",
348 | "From an initial look at the plot, we determine that the logistic function could be a good approximation,\n",
349 | "since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:"
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": null,
355 | "metadata": {
356 | "collapsed": false,
357 | "jupyter": {
358 | "outputs_hidden": false
359 | }
360 | },
361 | "outputs": [],
362 | "source": [
363 | "X = np.arange(-5.0, 5.0, 0.1)\n",
364 | "Y = 1.0 / (1.0 + np.exp(-X))\n",
365 | "\n",
366 | "plt.plot(X,Y) \n",
367 | "plt.ylabel('Dependent Variable')\n",
368 | "plt.xlabel('Indepdendent Variable')\n",
369 | "plt.show()"
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "\n",
377 | "\n",
378 | "The formula for the logistic function is the following:\n",
379 | "\n",
380 | "$$ \\hat{Y} = \\frac1{1+e^{\\beta_1(X-\\beta_2)}}$$\n",
381 | "\n",
382 | "$\\beta_1$: Controls the curve's steepness,\n",
383 | "\n",
384 | "$\\beta_2$: Slides the curve on the x-axis."
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "### Building The Model ###\n",
392 | "Now, let's build our regression model and initialize its parameters. "
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": null,
398 | "metadata": {},
399 | "outputs": [],
400 | "source": [
401 | "def sigmoid(x, Beta_1, Beta_2):\n",
402 | " y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))\n",
403 | " return y"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "metadata": {},
409 | "source": [
410 | "Lets look at a sample sigmoid line that might fit with the data:"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {
417 | "collapsed": false,
418 | "jupyter": {
419 | "outputs_hidden": false
420 | }
421 | },
422 | "outputs": [],
423 | "source": [
424 | "beta_1 = 0.10\n",
425 | "beta_2 = 1990.0\n",
426 | "\n",
427 | "#logistic function\n",
428 | "Y_pred = sigmoid(x_data, beta_1 , beta_2)\n",
429 | "\n",
430 | "#plot initial prediction against datapoints\n",
431 | "plt.plot(x_data, Y_pred*15000000000000.)\n",
432 | "plt.plot(x_data, y_data, 'ro')"
433 | ]
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "Our task here is to find the best parameters for our model. Lets first normalize our x and y:"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "# Lets normalize our data\n",
449 | "xdata =x_data/max(x_data)\n",
450 | "ydata =y_data/max(y_data)"
451 | ]
452 | },
453 | {
454 | "cell_type": "markdown",
455 | "metadata": {},
456 | "source": [
457 | "#### How we find the best parameters for our fit line?\n",
458 | "we can use __curve_fit__ which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.\n",
459 | "\n",
460 | "popt are our optimized parameters."
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": null,
466 | "metadata": {},
467 | "outputs": [],
468 | "source": [
469 | "from scipy.optimize import curve_fit\n",
470 | "popt, pcov = curve_fit(sigmoid, xdata, ydata)\n",
471 | "#print the final parameters\n",
472 | "print(\" beta_1 = %f, beta_2 = %f\" % (popt[0], popt[1]))"
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "Now we plot our resulting regression model."
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": null,
485 | "metadata": {},
486 | "outputs": [],
487 | "source": [
488 | "x = np.linspace(1960, 2015, 55)\n",
489 | "x = x/max(x)\n",
490 | "plt.figure(figsize=(8,5))\n",
491 | "y = sigmoid(x, *popt)\n",
492 | "plt.plot(xdata, ydata, 'ro', label='data')\n",
493 | "plt.plot(x,y, linewidth=3.0, label='fit')\n",
494 | "plt.legend(loc='best')\n",
495 | "plt.ylabel('GDP')\n",
496 | "plt.xlabel('Year')\n",
497 | "plt.show()"
498 | ]
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "## Practice\n",
505 | "Can you calculate what is the accuracy of our model?"
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": null,
511 | "metadata": {},
512 | "outputs": [],
513 | "source": [
514 | "# write your code here\n",
515 | "\n",
516 | "\n"
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "Double-click __here__ for the solution.\n",
524 | "\n",
525 | ""
547 | ]
548 | },
549 | {
550 | "cell_type": "markdown",
551 | "metadata": {},
552 | "source": [
553 | "
Want to learn more?
\n",
554 | "\n",
555 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
556 | "\n",
557 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
558 | "\n",
559 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
"
567 | ]
568 | }
569 | ],
570 | "metadata": {
571 | "kernelspec": {
572 | "display_name": "Python 3",
573 | "language": "python",
574 | "name": "python3"
575 | },
576 | "language_info": {
577 | "codemirror_mode": {
578 | "name": "ipython",
579 | "version": 3
580 | },
581 | "file_extension": ".py",
582 | "mimetype": "text/x-python",
583 | "name": "python",
584 | "nbconvert_exporter": "python",
585 | "pygments_lexer": "ipython3",
586 | "version": "3.6.7"
587 | }
588 | },
589 | "nbformat": 4,
590 | "nbformat_minor": 4
591 | }
592 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Assignment Intructions:
2 |
3 | Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.
4 |
5 | You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:
6 |
7 | * k-Nearest Neighbour
8 | * Decision Tree
9 | * Support Vector Machine
10 | * Logistic Regression
11 |
12 | The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:
13 |
14 | * Jaccard index
15 | * F1-score
16 | * LogLoass
17 | ------------
18 | ## Setup Instructions:
19 | ### A-Create an account in Watson Studio if you dont have (If you already have it, jump to step B).
20 |
21 | * Browse into https://www.ibm.com/cloud/watson-studio
22 | * Click on 'Start your free trial'
23 | * Enter your email, and click 'Next'
24 | * Enter your Name, and choose a Password. Then click on 'Create Account'
25 | * Go to your email, and confirm your account.
26 | * Click on 'Proceed'
27 | * In "Select Organization and Space" form, leave everything as default, and click on 'Continue'
28 | * It is done. Click on 'Get started!'
29 |
30 | ### B-Sign in into Watson Studio and import your notebook
31 |
32 | * Sign in into https://www.ibm.com/cloud/watson-studio
33 | * Click on 'New Project'
34 | * Select 'Data Science' as type of project.
35 | * Give a name to your project, and a description for your reference, then setup your project as following and click "Create".
36 |
37 | > Notice 1: because you are going to share this project with your peer for evaluation, please make sure you have unchecked `Restrict who can be a collaborator`
38 |
39 | > Notice 2: You have to create an IBM Object Storage, if you dont have any IBM Object Storage (you can use the free Lite plan)
40 |
41 | * From the top-right, Click on 'Add to project', and then select 'Notebook'. C
42 |
43 | * In the 'New notebook' form, click on 'From URL', and enter the Notebook URL: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ML0101EN-Proj-Loan-py-v1.ipynb
44 |
45 | * Give the notebook a proper name and description and click on `Create Notebook` to initialize the notebook
46 |
47 | C. Complete the notebook
48 |
49 | * Start running the notebook
50 | * Complete the notebook based on the description in the notebook.
51 |
--------------------------------------------------------------------------------
/Recommender System/ML0101EN-RecSys-Collaborative-Filtering-movies-py-v1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "button": false,
7 | "deletable": true,
8 | "new_sheet": false,
9 | "run_control": {
10 | "read_only": false
11 | }
12 | },
13 | "source": [
14 | "\n",
15 | "\n",
16 | "
COLLABORATIVE FILTERING
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "button": false,
23 | "deletable": true,
24 | "new_sheet": false,
25 | "run_control": {
26 | "read_only": false
27 | }
28 | },
29 | "source": [
30 | "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version of one using Python and the Pandas library."
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {
36 | "button": false,
37 | "deletable": true,
38 | "new_sheet": false,
39 | "run_control": {
40 | "read_only": false
41 | }
42 | },
43 | "source": [
44 | "
\n",
53 | " \n",
54 | ""
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {
60 | "button": false,
61 | "deletable": true,
62 | "new_sheet": false,
63 | "run_control": {
64 | "read_only": false
65 | }
66 | },
67 | "source": [
68 | "\n",
69 | "\n",
70 | "\n",
71 | "# Acquiring the Data"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {
77 | "button": false,
78 | "deletable": true,
79 | "new_sheet": false,
80 | "run_control": {
81 | "read_only": false
82 | }
83 | },
84 | "source": [
85 | "To acquire and extract the data, simply run the following Bash scripts: \n",
86 | "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage. \n",
87 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "button": false,
95 | "collapsed": false,
96 | "deletable": true,
97 | "jupyter": {
98 | "outputs_hidden": false
99 | },
100 | "new_sheet": false,
101 | "run_control": {
102 | "read_only": false
103 | }
104 | },
105 | "outputs": [],
106 | "source": [
107 | "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
108 | "print('unziping ...')\n",
109 | "!unzip -o -j moviedataset.zip "
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {
115 | "button": false,
116 | "deletable": true,
117 | "new_sheet": false,
118 | "run_control": {
119 | "read_only": false
120 | }
121 | },
122 | "source": [
123 | "Now you're ready to start working with the data!"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {
129 | "button": false,
130 | "deletable": true,
131 | "new_sheet": false,
132 | "run_control": {
133 | "read_only": false
134 | }
135 | },
136 | "source": [
137 | "\n",
138 | "\n",
139 | "\n",
140 | "# Preprocessing"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {
146 | "button": false,
147 | "deletable": true,
148 | "new_sheet": false,
149 | "run_control": {
150 | "read_only": false
151 | }
152 | },
153 | "source": [
154 | "First, let's get all of the imports out of the way:"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {
161 | "button": false,
162 | "collapsed": false,
163 | "deletable": true,
164 | "jupyter": {
165 | "outputs_hidden": false
166 | },
167 | "new_sheet": false,
168 | "run_control": {
169 | "read_only": false
170 | }
171 | },
172 | "outputs": [],
173 | "source": [
174 | "#Dataframe manipulation library\n",
175 | "import pandas as pd\n",
176 | "#Math functions, we'll only need the sqrt function so let's import only that\n",
177 | "from math import sqrt\n",
178 | "import numpy as np\n",
179 | "import matplotlib.pyplot as plt\n",
180 | "%matplotlib inline"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {
186 | "button": false,
187 | "deletable": true,
188 | "new_sheet": false,
189 | "run_control": {
190 | "read_only": false
191 | }
192 | },
193 | "source": [
194 | "Now let's read each file into their Dataframes:"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "button": false,
202 | "collapsed": false,
203 | "deletable": true,
204 | "jupyter": {
205 | "outputs_hidden": false
206 | },
207 | "new_sheet": false,
208 | "run_control": {
209 | "read_only": false
210 | }
211 | },
212 | "outputs": [],
213 | "source": [
214 | "#Storing the movie information into a pandas dataframe\n",
215 | "movies_df = pd.read_csv('movies.csv')\n",
216 | "#Storing the user information into a pandas dataframe\n",
217 | "ratings_df = pd.read_csv('ratings.csv')"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {
223 | "button": false,
224 | "deletable": true,
225 | "new_sheet": false,
226 | "run_control": {
227 | "read_only": false
228 | }
229 | },
230 | "source": [
231 | "Let's also take a peek at how each of them are organized:"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": null,
237 | "metadata": {
238 | "button": false,
239 | "collapsed": false,
240 | "deletable": true,
241 | "jupyter": {
242 | "outputs_hidden": false
243 | },
244 | "new_sheet": false,
245 | "run_control": {
246 | "read_only": false
247 | }
248 | },
249 | "outputs": [],
250 | "source": [
251 | "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n",
252 | "movies_df.head()"
253 | ]
254 | },
255 | {
256 | "cell_type": "markdown",
257 | "metadata": {
258 | "button": false,
259 | "deletable": true,
260 | "new_sheet": false,
261 | "run_control": {
262 | "read_only": false
263 | }
264 | },
265 | "source": [
266 | "So each movie has a unique ID, a title with its release year along with it (Which may contain unicode characters) and several different genres in the same field. Let's remove the year from the title column and place it into its own one by using the handy [extract](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html#pandas.Series.str.extract) function that Pandas has."
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {
272 | "button": false,
273 | "deletable": true,
274 | "new_sheet": false,
275 | "run_control": {
276 | "read_only": false
277 | }
278 | },
279 | "source": [
280 | "Let's remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column."
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {
287 | "button": false,
288 | "collapsed": false,
289 | "deletable": true,
290 | "jupyter": {
291 | "outputs_hidden": false
292 | },
293 | "new_sheet": false,
294 | "run_control": {
295 | "read_only": false
296 | }
297 | },
298 | "outputs": [],
299 | "source": [
300 | "#Using regular expressions to find a year stored between parentheses\n",
301 | "#We specify the parantheses so we don't conflict with movies that have years in their titles\n",
302 | "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n",
303 | "#Removing the parentheses\n",
304 | "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n",
305 | "#Removing the years from the 'title' column\n",
306 | "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n",
307 | "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n",
308 | "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())"
309 | ]
310 | },
311 | {
312 | "cell_type": "markdown",
313 | "metadata": {
314 | "button": false,
315 | "deletable": true,
316 | "new_sheet": false,
317 | "run_control": {
318 | "read_only": false
319 | }
320 | },
321 | "source": [
322 | "Let's look at the result!"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {
329 | "button": false,
330 | "collapsed": false,
331 | "deletable": true,
332 | "jupyter": {
333 | "outputs_hidden": false
334 | },
335 | "new_sheet": false,
336 | "run_control": {
337 | "read_only": false
338 | }
339 | },
340 | "outputs": [],
341 | "source": [
342 | "movies_df.head()"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {
348 | "button": false,
349 | "deletable": true,
350 | "new_sheet": false,
351 | "run_control": {
352 | "read_only": false
353 | }
354 | },
355 | "source": [
356 | "With that, let's also drop the genres column since we won't need it for this particular recommendation system."
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {
363 | "button": false,
364 | "collapsed": false,
365 | "deletable": true,
366 | "jupyter": {
367 | "outputs_hidden": false
368 | },
369 | "new_sheet": false,
370 | "run_control": {
371 | "read_only": false
372 | }
373 | },
374 | "outputs": [],
375 | "source": [
376 | "#Dropping the genres column\n",
377 | "movies_df = movies_df.drop('genres', 1)"
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {
383 | "button": false,
384 | "deletable": true,
385 | "new_sheet": false,
386 | "run_control": {
387 | "read_only": false
388 | }
389 | },
390 | "source": [
391 | "Here's the final movies dataframe:"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {
398 | "button": false,
399 | "collapsed": false,
400 | "deletable": true,
401 | "jupyter": {
402 | "outputs_hidden": false
403 | },
404 | "new_sheet": false,
405 | "run_control": {
406 | "read_only": false
407 | }
408 | },
409 | "outputs": [],
410 | "source": [
411 | "movies_df.head()"
412 | ]
413 | },
414 | {
415 | "cell_type": "markdown",
416 | "metadata": {
417 | "button": false,
418 | "deletable": true,
419 | "new_sheet": false,
420 | "run_control": {
421 | "read_only": false
422 | }
423 | },
424 | "source": [
425 | " "
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {
431 | "button": false,
432 | "deletable": true,
433 | "new_sheet": false,
434 | "run_control": {
435 | "read_only": false
436 | }
437 | },
438 | "source": [
439 | "Next, let's look at the ratings dataframe."
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {
446 | "button": false,
447 | "collapsed": false,
448 | "deletable": true,
449 | "jupyter": {
450 | "outputs_hidden": false
451 | },
452 | "new_sheet": false,
453 | "run_control": {
454 | "read_only": false
455 | }
456 | },
457 | "outputs": [],
458 | "source": [
459 | "ratings_df.head()"
460 | ]
461 | },
462 | {
463 | "cell_type": "markdown",
464 | "metadata": {
465 | "button": false,
466 | "deletable": true,
467 | "new_sheet": false,
468 | "run_control": {
469 | "read_only": false
470 | }
471 | },
472 | "source": [
473 | "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory."
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": null,
479 | "metadata": {
480 | "button": false,
481 | "collapsed": false,
482 | "deletable": true,
483 | "jupyter": {
484 | "outputs_hidden": false
485 | },
486 | "new_sheet": false,
487 | "run_control": {
488 | "read_only": false
489 | }
490 | },
491 | "outputs": [],
492 | "source": [
493 | "#Drop removes a specified row or column from a dataframe\n",
494 | "ratings_df = ratings_df.drop('timestamp', 1)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {
500 | "button": false,
501 | "deletable": true,
502 | "new_sheet": false,
503 | "run_control": {
504 | "read_only": false
505 | }
506 | },
507 | "source": [
508 | "Here's how the final ratings Dataframe looks like:"
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "metadata": {
515 | "button": false,
516 | "collapsed": false,
517 | "deletable": true,
518 | "jupyter": {
519 | "outputs_hidden": false
520 | },
521 | "new_sheet": false,
522 | "run_control": {
523 | "read_only": false
524 | },
525 | "scrolled": true
526 | },
527 | "outputs": [],
528 | "source": [
529 | "ratings_df.head()"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {
535 | "button": false,
536 | "deletable": true,
537 | "new_sheet": false,
538 | "run_control": {
539 | "read_only": false
540 | }
541 | },
542 | "source": [
543 | "\n",
544 | "\n",
545 | "\n",
546 | "# Collaborative Filtering"
547 | ]
548 | },
549 | {
550 | "cell_type": "markdown",
551 | "metadata": {
552 | "button": false,
553 | "deletable": true,
554 | "new_sheet": false,
555 | "run_control": {
556 | "read_only": false
557 | }
558 | },
559 | "source": [
560 | "Now, time to start our work on recommendation systems. \n",
561 | "\n",
562 | "The first technique we're going to take a look at is called __Collaborative Filtering__, which is also known as __User-User Filtering__. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the __Pearson Correlation Function__.\n",
563 | "\n",
564 | "\n",
565 | "\n",
566 | "\n",
567 | "The process for creating a User Based recommendation system is as follows:\n",
568 | "- Select a user with the movies the user has watched\n",
569 | "- Based on his rating to movies, find the top X neighbours \n",
570 | "- Get the watched movie record of the user for each neighbour.\n",
571 | "- Calculate a similarity score using some formula\n",
572 | "- Recommend the items with the highest score\n",
573 | "\n",
574 | "\n",
575 | "Let's begin by creating an input user to recommend movies to:\n",
576 | "\n",
577 | "Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ."
578 | ]
579 | },
580 | {
581 | "cell_type": "code",
582 | "execution_count": null,
583 | "metadata": {
584 | "button": false,
585 | "collapsed": false,
586 | "deletable": true,
587 | "jupyter": {
588 | "outputs_hidden": false
589 | },
590 | "new_sheet": false,
591 | "run_control": {
592 | "read_only": false
593 | }
594 | },
595 | "outputs": [],
596 | "source": [
597 | "userInput = [\n",
598 | " {'title':'Breakfast Club, The', 'rating':5},\n",
599 | " {'title':'Toy Story', 'rating':3.5},\n",
600 | " {'title':'Jumanji', 'rating':2},\n",
601 | " {'title':\"Pulp Fiction\", 'rating':5},\n",
602 | " {'title':'Akira', 'rating':4.5}\n",
603 | " ] \n",
604 | "inputMovies = pd.DataFrame(userInput)\n",
605 | "inputMovies"
606 | ]
607 | },
608 | {
609 | "cell_type": "markdown",
610 | "metadata": {
611 | "button": false,
612 | "deletable": true,
613 | "new_sheet": false,
614 | "run_control": {
615 | "read_only": false
616 | }
617 | },
618 | "source": [
619 | "#### Add movieId to input user\n",
620 | "With the input complete, let's extract the input movies's ID's from the movies dataframe and add them into it.\n",
621 | "\n",
622 | "We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space."
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "execution_count": null,
628 | "metadata": {
629 | "button": false,
630 | "collapsed": false,
631 | "deletable": true,
632 | "jupyter": {
633 | "outputs_hidden": false
634 | },
635 | "new_sheet": false,
636 | "run_control": {
637 | "read_only": false
638 | },
639 | "scrolled": true
640 | },
641 | "outputs": [],
642 | "source": [
643 | "#Filtering out the movies by title\n",
644 | "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n",
645 | "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n",
646 | "inputMovies = pd.merge(inputId, inputMovies)\n",
647 | "#Dropping information we won't use from the input dataframe\n",
648 | "inputMovies = inputMovies.drop('year', 1)\n",
649 | "#Final input dataframe\n",
650 | "#If a movie you added in above isn't here, then it might not be in the original \n",
651 | "#dataframe or it might spelled differently, please check capitalisation.\n",
652 | "inputMovies"
653 | ]
654 | },
655 | {
656 | "cell_type": "markdown",
657 | "metadata": {
658 | "button": false,
659 | "deletable": true,
660 | "new_sheet": false,
661 | "run_control": {
662 | "read_only": false
663 | }
664 | },
665 | "source": [
666 | "#### The users who has seen the same movies\n",
667 | "Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.\n"
668 | ]
669 | },
670 | {
671 | "cell_type": "code",
672 | "execution_count": null,
673 | "metadata": {
674 | "button": false,
675 | "collapsed": false,
676 | "deletable": true,
677 | "jupyter": {
678 | "outputs_hidden": false
679 | },
680 | "new_sheet": false,
681 | "run_control": {
682 | "read_only": false
683 | }
684 | },
685 | "outputs": [],
686 | "source": [
687 | "#Filtering out users that have watched movies that the input has watched and storing it\n",
688 | "userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]\n",
689 | "userSubset.head()"
690 | ]
691 | },
692 | {
693 | "cell_type": "markdown",
694 | "metadata": {
695 | "button": false,
696 | "deletable": true,
697 | "new_sheet": false,
698 | "run_control": {
699 | "read_only": false
700 | }
701 | },
702 | "source": [
703 | "We now group up the rows by user ID."
704 | ]
705 | },
706 | {
707 | "cell_type": "code",
708 | "execution_count": null,
709 | "metadata": {
710 | "button": false,
711 | "collapsed": false,
712 | "deletable": true,
713 | "jupyter": {
714 | "outputs_hidden": false
715 | },
716 | "new_sheet": false,
717 | "run_control": {
718 | "read_only": false
719 | }
720 | },
721 | "outputs": [],
722 | "source": [
723 | "#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter\n",
724 | "userSubsetGroup = userSubset.groupby(['userId'])"
725 | ]
726 | },
727 | {
728 | "cell_type": "markdown",
729 | "metadata": {
730 | "button": false,
731 | "deletable": true,
732 | "new_sheet": false,
733 | "run_control": {
734 | "read_only": false
735 | }
736 | },
737 | "source": [
738 | "lets look at one of the users, e.g. the one with userID=1130"
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": null,
744 | "metadata": {
745 | "button": false,
746 | "collapsed": false,
747 | "deletable": true,
748 | "jupyter": {
749 | "outputs_hidden": false
750 | },
751 | "new_sheet": false,
752 | "run_control": {
753 | "read_only": false
754 | }
755 | },
756 | "outputs": [],
757 | "source": [
758 | "userSubsetGroup.get_group(1130)"
759 | ]
760 | },
761 | {
762 | "cell_type": "markdown",
763 | "metadata": {
764 | "button": false,
765 | "deletable": true,
766 | "new_sheet": false,
767 | "run_control": {
768 | "read_only": false
769 | }
770 | },
771 | "source": [
772 | "Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user."
773 | ]
774 | },
775 | {
776 | "cell_type": "code",
777 | "execution_count": null,
778 | "metadata": {
779 | "button": false,
780 | "collapsed": false,
781 | "deletable": true,
782 | "jupyter": {
783 | "outputs_hidden": false
784 | },
785 | "new_sheet": false,
786 | "run_control": {
787 | "read_only": false
788 | }
789 | },
790 | "outputs": [],
791 | "source": [
792 | "#Sorting it so users with movie most in common with the input will have priority\n",
793 | "userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)"
794 | ]
795 | },
796 | {
797 | "cell_type": "markdown",
798 | "metadata": {
799 | "button": false,
800 | "deletable": true,
801 | "new_sheet": false,
802 | "run_control": {
803 | "read_only": false
804 | }
805 | },
806 | "source": [
807 | "Now lets look at the first user"
808 | ]
809 | },
810 | {
811 | "cell_type": "code",
812 | "execution_count": null,
813 | "metadata": {
814 | "button": false,
815 | "collapsed": false,
816 | "deletable": true,
817 | "jupyter": {
818 | "outputs_hidden": false
819 | },
820 | "new_sheet": false,
821 | "run_control": {
822 | "read_only": false
823 | }
824 | },
825 | "outputs": [],
826 | "source": [
827 | "userSubsetGroup[0:3]"
828 | ]
829 | },
830 | {
831 | "cell_type": "markdown",
832 | "metadata": {
833 | "button": false,
834 | "deletable": true,
835 | "new_sheet": false,
836 | "run_control": {
837 | "read_only": false
838 | }
839 | },
840 | "source": [
841 | "#### Similarity of users to input user\n",
842 | "Next, we are going to compare all users (not really all !!!) to our specified user and find the one that is most similar. \n",
843 | "we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. \n",
844 | "\n",
845 | "Why Pearson Correlation?\n",
846 | "\n",
847 | "Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .\n",
848 | "\n",
849 | "\n",
850 | "\n",
851 | "The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. \n",
852 | "\n",
853 | "In our case, a 1 means that the two users have similar tastes while a -1 means the opposite."
854 | ]
855 | },
856 | {
857 | "cell_type": "markdown",
858 | "metadata": {
859 | "button": false,
860 | "deletable": true,
861 | "new_sheet": false,
862 | "run_control": {
863 | "read_only": false
864 | }
865 | },
866 | "source": [
867 | "We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user."
868 | ]
869 | },
870 | {
871 | "cell_type": "code",
872 | "execution_count": null,
873 | "metadata": {
874 | "button": false,
875 | "collapsed": false,
876 | "deletable": true,
877 | "jupyter": {
878 | "outputs_hidden": false
879 | },
880 | "new_sheet": false,
881 | "run_control": {
882 | "read_only": false
883 | }
884 | },
885 | "outputs": [],
886 | "source": [
887 | "userSubsetGroup = userSubsetGroup[0:100]"
888 | ]
889 | },
890 | {
891 | "cell_type": "markdown",
892 | "metadata": {
893 | "button": false,
894 | "deletable": true,
895 | "new_sheet": false,
896 | "run_control": {
897 | "read_only": false
898 | }
899 | },
900 | "source": [
901 | "Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient\n"
902 | ]
903 | },
904 | {
905 | "cell_type": "code",
906 | "execution_count": null,
907 | "metadata": {
908 | "button": false,
909 | "collapsed": false,
910 | "deletable": true,
911 | "jupyter": {
912 | "outputs_hidden": false
913 | },
914 | "new_sheet": false,
915 | "run_control": {
916 | "read_only": false
917 | },
918 | "scrolled": true
919 | },
920 | "outputs": [],
921 | "source": [
922 | "#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient\n",
923 | "pearsonCorrelationDict = {}\n",
924 | "\n",
925 | "#For every user group in our subset\n",
926 | "for name, group in userSubsetGroup:\n",
927 | " #Let's start by sorting the input and current user group so the values aren't mixed up later on\n",
928 | " group = group.sort_values(by='movieId')\n",
929 | " inputMovies = inputMovies.sort_values(by='movieId')\n",
930 | " #Get the N for the formula\n",
931 | " nRatings = len(group)\n",
932 | " #Get the review scores for the movies that they both have in common\n",
933 | " temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]\n",
934 | " #And then store them in a temporary buffer variable in a list format to facilitate future calculations\n",
935 | " tempRatingList = temp_df['rating'].tolist()\n",
936 | " #Let's also put the current user group reviews in a list format\n",
937 | " tempGroupList = group['rating'].tolist()\n",
938 | " #Now let's calculate the pearson correlation between two users, so called, x and y\n",
939 | " Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)\n",
940 | " Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)\n",
941 | " Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)\n",
942 | " \n",
943 | " #If the denominator is different than zero, then divide, else, 0 correlation.\n",
944 | " if Sxx != 0 and Syy != 0:\n",
945 | " pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)\n",
946 | " else:\n",
947 | " pearsonCorrelationDict[name] = 0\n"
948 | ]
949 | },
950 | {
951 | "cell_type": "code",
952 | "execution_count": null,
953 | "metadata": {},
954 | "outputs": [],
955 | "source": [
956 | "pearsonCorrelationDict.items()"
957 | ]
958 | },
959 | {
960 | "cell_type": "code",
961 | "execution_count": null,
962 | "metadata": {},
963 | "outputs": [],
964 | "source": [
965 | "pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')\n",
966 | "pearsonDF.columns = ['similarityIndex']\n",
967 | "pearsonDF['userId'] = pearsonDF.index\n",
968 | "pearsonDF.index = range(len(pearsonDF))\n",
969 | "pearsonDF.head()"
970 | ]
971 | },
972 | {
973 | "cell_type": "markdown",
974 | "metadata": {
975 | "button": false,
976 | "deletable": true,
977 | "new_sheet": false,
978 | "run_control": {
979 | "read_only": false
980 | }
981 | },
982 | "source": [
983 | "#### The top x similar users to input user\n",
984 | "Now let's get the top 50 users that are most similar to the input."
985 | ]
986 | },
987 | {
988 | "cell_type": "code",
989 | "execution_count": null,
990 | "metadata": {
991 | "button": false,
992 | "collapsed": false,
993 | "deletable": true,
994 | "jupyter": {
995 | "outputs_hidden": false
996 | },
997 | "new_sheet": false,
998 | "run_control": {
999 | "read_only": false
1000 | }
1001 | },
1002 | "outputs": [],
1003 | "source": [
1004 | "topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]\n",
1005 | "topUsers.head()"
1006 | ]
1007 | },
1008 | {
1009 | "cell_type": "markdown",
1010 | "metadata": {
1011 | "button": false,
1012 | "deletable": true,
1013 | "new_sheet": false,
1014 | "run_control": {
1015 | "read_only": false
1016 | }
1017 | },
1018 | "source": [
1019 | "Now, let's start recommending movies to the input user.\n",
1020 | "\n",
1021 | "#### Rating of selected users to all movies\n",
1022 | "We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex\". This is achieved below by merging of these two tables."
1023 | ]
1024 | },
1025 | {
1026 | "cell_type": "code",
1027 | "execution_count": null,
1028 | "metadata": {
1029 | "button": false,
1030 | "collapsed": false,
1031 | "deletable": true,
1032 | "jupyter": {
1033 | "outputs_hidden": false
1034 | },
1035 | "new_sheet": false,
1036 | "run_control": {
1037 | "read_only": false
1038 | },
1039 | "scrolled": true
1040 | },
1041 | "outputs": [],
1042 | "source": [
1043 | "topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')\n",
1044 | "topUsersRating.head()"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "markdown",
1049 | "metadata": {
1050 | "button": false,
1051 | "deletable": true,
1052 | "new_sheet": false,
1053 | "run_control": {
1054 | "read_only": false
1055 | }
1056 | },
1057 | "source": [
1058 | "Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.\n",
1059 | "\n",
1060 | "We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:\n",
1061 | "\n",
1062 | "It shows the idea of all similar users to candidate movies for the input user:"
1063 | ]
1064 | },
1065 | {
1066 | "cell_type": "code",
1067 | "execution_count": null,
1068 | "metadata": {
1069 | "button": false,
1070 | "collapsed": false,
1071 | "deletable": true,
1072 | "jupyter": {
1073 | "outputs_hidden": false
1074 | },
1075 | "new_sheet": false,
1076 | "run_control": {
1077 | "read_only": false
1078 | }
1079 | },
1080 | "outputs": [],
1081 | "source": [
1082 | "#Multiplies the similarity by the user's ratings\n",
1083 | "topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']\n",
1084 | "topUsersRating.head()"
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": null,
1090 | "metadata": {
1091 | "button": false,
1092 | "collapsed": false,
1093 | "deletable": true,
1094 | "jupyter": {
1095 | "outputs_hidden": false
1096 | },
1097 | "new_sheet": false,
1098 | "run_control": {
1099 | "read_only": false
1100 | }
1101 | },
1102 | "outputs": [],
1103 | "source": [
1104 | "#Applies a sum to the topUsers after grouping it up by userId\n",
1105 | "tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]\n",
1106 | "tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']\n",
1107 | "tempTopUsersRating.head()"
1108 | ]
1109 | },
1110 | {
1111 | "cell_type": "code",
1112 | "execution_count": null,
1113 | "metadata": {
1114 | "button": false,
1115 | "collapsed": false,
1116 | "deletable": true,
1117 | "jupyter": {
1118 | "outputs_hidden": false
1119 | },
1120 | "new_sheet": false,
1121 | "run_control": {
1122 | "read_only": false
1123 | }
1124 | },
1125 | "outputs": [],
1126 | "source": [
1127 | "#Creates an empty dataframe\n",
1128 | "recommendation_df = pd.DataFrame()\n",
1129 | "#Now we take the weighted average\n",
1130 | "recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']\n",
1131 | "recommendation_df['movieId'] = tempTopUsersRating.index\n",
1132 | "recommendation_df.head()"
1133 | ]
1134 | },
1135 | {
1136 | "cell_type": "markdown",
1137 | "metadata": {
1138 | "button": false,
1139 | "deletable": true,
1140 | "new_sheet": false,
1141 | "run_control": {
1142 | "read_only": false
1143 | }
1144 | },
1145 | "source": [
1146 | "Now let's sort it and see the top 20 movies that the algorithm recommended!"
1147 | ]
1148 | },
1149 | {
1150 | "cell_type": "code",
1151 | "execution_count": null,
1152 | "metadata": {
1153 | "button": false,
1154 | "collapsed": false,
1155 | "deletable": true,
1156 | "jupyter": {
1157 | "outputs_hidden": false
1158 | },
1159 | "new_sheet": false,
1160 | "run_control": {
1161 | "read_only": false
1162 | }
1163 | },
1164 | "outputs": [],
1165 | "source": [
1166 | "recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)\n",
1167 | "recommendation_df.head(10)"
1168 | ]
1169 | },
1170 | {
1171 | "cell_type": "code",
1172 | "execution_count": null,
1173 | "metadata": {
1174 | "button": false,
1175 | "collapsed": false,
1176 | "deletable": true,
1177 | "jupyter": {
1178 | "outputs_hidden": false
1179 | },
1180 | "new_sheet": false,
1181 | "run_control": {
1182 | "read_only": false
1183 | },
1184 | "scrolled": true
1185 | },
1186 | "outputs": [],
1187 | "source": [
1188 | "movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]"
1189 | ]
1190 | },
1191 | {
1192 | "cell_type": "markdown",
1193 | "metadata": {
1194 | "button": false,
1195 | "deletable": true,
1196 | "new_sheet": false,
1197 | "run_control": {
1198 | "read_only": false
1199 | }
1200 | },
1201 | "source": [
1202 | "### Advantages and Disadvantages of Collaborative Filtering\n",
1203 | "\n",
1204 | "##### Advantages\n",
1205 | "* Takes other user's ratings into consideration\n",
1206 | "* Doesn't need to study or extract information from the recommended item\n",
1207 | "* Adapts to the user's interests which might change over time\n",
1208 | "\n",
1209 | "##### Disadvantages\n",
1210 | "* Approximation function can be slow\n",
1211 | "* There might be a low of amount of users to approximate\n",
1212 | "* Privacy issues when trying to learn the user's preferences"
1213 | ]
1214 | },
1215 | {
1216 | "cell_type": "markdown",
1217 | "metadata": {
1218 | "button": false,
1219 | "deletable": true,
1220 | "new_sheet": false,
1221 | "run_control": {
1222 | "read_only": false
1223 | }
1224 | },
1225 | "source": [
1226 | "
Want to learn more?
\n",
1227 | "\n",
1228 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
1229 | "\n",
1230 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
1231 | "\n",
1232 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### Table of contents\n",
24 | "\n",
25 | "
\n",
32 | " "
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "\n",
40 | "# Acquiring the Data"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "To acquire and extract the data, simply run the following Bash scripts: \n",
48 | "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage. \n",
49 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 1,
55 | "metadata": {
56 | "collapsed": false,
57 | "jupyter": {
58 | "outputs_hidden": false
59 | }
60 | },
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "--2019-07-11 16:36:32-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
67 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
68 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
69 | "HTTP request sent, awaiting response... 200 OK\n",
70 | "Length: 160301210 (153M) [application/zip]\n",
71 | "Saving to: ‘moviedataset.zip’\n",
72 | "\n",
73 | "moviedataset.zip 100%[===================>] 152.88M 19.4MB/s in 8.0s \n",
74 | "\n",
75 | "2019-07-11 16:36:41 (19.2 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]\n",
76 | "\n",
77 | "unziping ...\n",
78 | "Archive: moviedataset.zip\n",
79 | " inflating: links.csv \n",
80 | " inflating: movies.csv \n",
81 | " inflating: ratings.csv \n",
82 | " inflating: README.txt \n",
83 | " inflating: tags.csv \n"
84 | ]
85 | }
86 | ],
87 | "source": [
88 | "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
89 | "print('unziping ...')\n",
90 | "!unzip -o -j moviedataset.zip "
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "Now you're ready to start working with the data!"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "\n",
105 | "# Preprocessing"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "First, let's get all of the imports out of the way:"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 7,
118 | "metadata": {
119 | "collapsed": false,
120 | "jupyter": {
121 | "outputs_hidden": false
122 | }
123 | },
124 | "outputs": [],
125 | "source": [
126 | "#Dataframe manipulation library\n",
127 | "import pandas as pd\n",
128 | "#Math functions, we'll only need the sqrt function so let's import only that\n",
129 | "from math import sqrt\n",
130 | "import numpy as np\n",
131 | "import matplotlib.pyplot as plt\n",
132 | "%matplotlib inline"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "Now let's read each file into their Dataframes:"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 14,
145 | "metadata": {
146 | "collapsed": false,
147 | "jupyter": {
148 | "outputs_hidden": false
149 | }
150 | },
151 | "outputs": [
152 | {
153 | "data": {
154 | "text/html": [
155 | "
\n",
156 | "\n",
169 | "
\n",
170 | " \n",
171 | "
\n",
172 | "
\n",
173 | "
movieId
\n",
174 | "
title
\n",
175 | "
genres
\n",
176 | "
\n",
177 | " \n",
178 | " \n",
179 | "
\n",
180 | "
0
\n",
181 | "
1
\n",
182 | "
Toy Story (1995)
\n",
183 | "
Adventure|Animation|Children|Comedy|Fantasy
\n",
184 | "
\n",
185 | "
\n",
186 | "
1
\n",
187 | "
2
\n",
188 | "
Jumanji (1995)
\n",
189 | "
Adventure|Children|Fantasy
\n",
190 | "
\n",
191 | "
\n",
192 | "
2
\n",
193 | "
3
\n",
194 | "
Grumpier Old Men (1995)
\n",
195 | "
Comedy|Romance
\n",
196 | "
\n",
197 | "
\n",
198 | "
3
\n",
199 | "
4
\n",
200 | "
Waiting to Exhale (1995)
\n",
201 | "
Comedy|Drama|Romance
\n",
202 | "
\n",
203 | "
\n",
204 | "
4
\n",
205 | "
5
\n",
206 | "
Father of the Bride Part II (1995)
\n",
207 | "
Comedy
\n",
208 | "
\n",
209 | " \n",
210 | "
\n",
211 | "
"
212 | ],
213 | "text/plain": [
214 | " movieId title \\\n",
215 | "0 1 Toy Story (1995) \n",
216 | "1 2 Jumanji (1995) \n",
217 | "2 3 Grumpier Old Men (1995) \n",
218 | "3 4 Waiting to Exhale (1995) \n",
219 | "4 5 Father of the Bride Part II (1995) \n",
220 | "\n",
221 | " genres \n",
222 | "0 Adventure|Animation|Children|Comedy|Fantasy \n",
223 | "1 Adventure|Children|Fantasy \n",
224 | "2 Comedy|Romance \n",
225 | "3 Comedy|Drama|Romance \n",
226 | "4 Comedy "
227 | ]
228 | },
229 | "execution_count": 14,
230 | "metadata": {},
231 | "output_type": "execute_result"
232 | }
233 | ],
234 | "source": [
235 | "#Storing the movie information into a pandas dataframe\n",
236 | "movies_df = pd.read_csv('movies.csv')\n",
237 | "#Storing the user information into a pandas dataframe\n",
238 | "ratings_df = pd.read_csv('ratings.csv')\n",
239 | "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n",
240 | "movies_df.head()"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "Let's also remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column."
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 15,
253 | "metadata": {
254 | "collapsed": false,
255 | "jupyter": {
256 | "outputs_hidden": false
257 | }
258 | },
259 | "outputs": [
260 | {
261 | "data": {
262 | "text/html": [
263 | "
\n",
264 | "\n",
277 | "
\n",
278 | " \n",
279 | "
\n",
280 | "
\n",
281 | "
movieId
\n",
282 | "
title
\n",
283 | "
genres
\n",
284 | "
year
\n",
285 | "
\n",
286 | " \n",
287 | " \n",
288 | "
\n",
289 | "
0
\n",
290 | "
1
\n",
291 | "
Toy Story
\n",
292 | "
Adventure|Animation|Children|Comedy|Fantasy
\n",
293 | "
1995
\n",
294 | "
\n",
295 | "
\n",
296 | "
1
\n",
297 | "
2
\n",
298 | "
Jumanji
\n",
299 | "
Adventure|Children|Fantasy
\n",
300 | "
1995
\n",
301 | "
\n",
302 | "
\n",
303 | "
2
\n",
304 | "
3
\n",
305 | "
Grumpier Old Men
\n",
306 | "
Comedy|Romance
\n",
307 | "
1995
\n",
308 | "
\n",
309 | "
\n",
310 | "
3
\n",
311 | "
4
\n",
312 | "
Waiting to Exhale
\n",
313 | "
Comedy|Drama|Romance
\n",
314 | "
1995
\n",
315 | "
\n",
316 | "
\n",
317 | "
4
\n",
318 | "
5
\n",
319 | "
Father of the Bride Part II
\n",
320 | "
Comedy
\n",
321 | "
1995
\n",
322 | "
\n",
323 | " \n",
324 | "
\n",
325 | "
"
326 | ],
327 | "text/plain": [
328 | " movieId title \\\n",
329 | "0 1 Toy Story \n",
330 | "1 2 Jumanji \n",
331 | "2 3 Grumpier Old Men \n",
332 | "3 4 Waiting to Exhale \n",
333 | "4 5 Father of the Bride Part II \n",
334 | "\n",
335 | " genres year \n",
336 | "0 Adventure|Animation|Children|Comedy|Fantasy 1995 \n",
337 | "1 Adventure|Children|Fantasy 1995 \n",
338 | "2 Comedy|Romance 1995 \n",
339 | "3 Comedy|Drama|Romance 1995 \n",
340 | "4 Comedy 1995 "
341 | ]
342 | },
343 | "execution_count": 15,
344 | "metadata": {},
345 | "output_type": "execute_result"
346 | }
347 | ],
348 | "source": [
349 | "#Using regular expressions to find a year stored between parentheses\n",
350 | "#We specify the parantheses so we don't conflict with movies that have years in their titles\n",
351 | "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n",
352 | "#Removing the parentheses\n",
353 | "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n",
354 | "#Removing the years from the 'title' column\n",
355 | "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n",
356 | "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n",
357 | "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())\n",
358 | "movies_df.head()"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "With that, let's also split the values in the __Genres__ column into a __list of Genres__ to simplify future use. This can be achieved by applying Python's split string function on the correct column."
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": 16,
371 | "metadata": {
372 | "collapsed": false,
373 | "jupyter": {
374 | "outputs_hidden": false
375 | }
376 | },
377 | "outputs": [
378 | {
379 | "data": {
380 | "text/html": [
381 | "
\n",
382 | "\n",
395 | "
\n",
396 | " \n",
397 | "
\n",
398 | "
\n",
399 | "
movieId
\n",
400 | "
title
\n",
401 | "
genres
\n",
402 | "
year
\n",
403 | "
\n",
404 | " \n",
405 | " \n",
406 | "
\n",
407 | "
0
\n",
408 | "
1
\n",
409 | "
Toy Story
\n",
410 | "
[Adventure, Animation, Children, Comedy, Fantasy]
\n",
411 | "
1995
\n",
412 | "
\n",
413 | "
\n",
414 | "
1
\n",
415 | "
2
\n",
416 | "
Jumanji
\n",
417 | "
[Adventure, Children, Fantasy]
\n",
418 | "
1995
\n",
419 | "
\n",
420 | "
\n",
421 | "
2
\n",
422 | "
3
\n",
423 | "
Grumpier Old Men
\n",
424 | "
[Comedy, Romance]
\n",
425 | "
1995
\n",
426 | "
\n",
427 | "
\n",
428 | "
3
\n",
429 | "
4
\n",
430 | "
Waiting to Exhale
\n",
431 | "
[Comedy, Drama, Romance]
\n",
432 | "
1995
\n",
433 | "
\n",
434 | "
\n",
435 | "
4
\n",
436 | "
5
\n",
437 | "
Father of the Bride Part II
\n",
438 | "
[Comedy]
\n",
439 | "
1995
\n",
440 | "
\n",
441 | " \n",
442 | "
\n",
443 | "
"
444 | ],
445 | "text/plain": [
446 | " movieId title \\\n",
447 | "0 1 Toy Story \n",
448 | "1 2 Jumanji \n",
449 | "2 3 Grumpier Old Men \n",
450 | "3 4 Waiting to Exhale \n",
451 | "4 5 Father of the Bride Part II \n",
452 | "\n",
453 | " genres year \n",
454 | "0 [Adventure, Animation, Children, Comedy, Fantasy] 1995 \n",
455 | "1 [Adventure, Children, Fantasy] 1995 \n",
456 | "2 [Comedy, Romance] 1995 \n",
457 | "3 [Comedy, Drama, Romance] 1995 \n",
458 | "4 [Comedy] 1995 "
459 | ]
460 | },
461 | "execution_count": 16,
462 | "metadata": {},
463 | "output_type": "execute_result"
464 | }
465 | ],
466 | "source": [
467 | "#Every genre is separated by a | so we simply have to call the split function on |\n",
468 | "movies_df['genres'] = movies_df.genres.str.split('|')\n",
469 | "movies_df.head()"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system."
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": 18,
482 | "metadata": {
483 | "collapsed": false,
484 | "jupyter": {
485 | "outputs_hidden": false
486 | }
487 | },
488 | "outputs": [
489 | {
490 | "data": {
491 | "text/html": [
492 | "
\n",
493 | "\n",
506 | "
\n",
507 | " \n",
508 | "
\n",
509 | "
\n",
510 | "
movieId
\n",
511 | "
title
\n",
512 | "
genres
\n",
513 | "
year
\n",
514 | "
Adventure
\n",
515 | "
Animation
\n",
516 | "
Children
\n",
517 | "
Comedy
\n",
518 | "
Fantasy
\n",
519 | "
Romance
\n",
520 | "
...
\n",
521 | "
Horror
\n",
522 | "
Mystery
\n",
523 | "
Sci-Fi
\n",
524 | "
IMAX
\n",
525 | "
Documentary
\n",
526 | "
War
\n",
527 | "
Musical
\n",
528 | "
Western
\n",
529 | "
Film-Noir
\n",
530 | "
(no genres listed)
\n",
531 | "
\n",
532 | " \n",
533 | " \n",
534 | "
\n",
535 | "
0
\n",
536 | "
1
\n",
537 | "
Toy Story
\n",
538 | "
[Adventure, Animation, Children, Comedy, Fantasy]
\n",
539 | "
1995
\n",
540 | "
1.0
\n",
541 | "
1.0
\n",
542 | "
1.0
\n",
543 | "
1.0
\n",
544 | "
1.0
\n",
545 | "
0.0
\n",
546 | "
...
\n",
547 | "
0.0
\n",
548 | "
0.0
\n",
549 | "
0.0
\n",
550 | "
0.0
\n",
551 | "
0.0
\n",
552 | "
0.0
\n",
553 | "
0.0
\n",
554 | "
0.0
\n",
555 | "
0.0
\n",
556 | "
0.0
\n",
557 | "
\n",
558 | "
\n",
559 | "
1
\n",
560 | "
2
\n",
561 | "
Jumanji
\n",
562 | "
[Adventure, Children, Fantasy]
\n",
563 | "
1995
\n",
564 | "
1.0
\n",
565 | "
0.0
\n",
566 | "
1.0
\n",
567 | "
0.0
\n",
568 | "
1.0
\n",
569 | "
0.0
\n",
570 | "
...
\n",
571 | "
0.0
\n",
572 | "
0.0
\n",
573 | "
0.0
\n",
574 | "
0.0
\n",
575 | "
0.0
\n",
576 | "
0.0
\n",
577 | "
0.0
\n",
578 | "
0.0
\n",
579 | "
0.0
\n",
580 | "
0.0
\n",
581 | "
\n",
582 | "
\n",
583 | "
2
\n",
584 | "
3
\n",
585 | "
Grumpier Old Men
\n",
586 | "
[Comedy, Romance]
\n",
587 | "
1995
\n",
588 | "
0.0
\n",
589 | "
0.0
\n",
590 | "
0.0
\n",
591 | "
1.0
\n",
592 | "
0.0
\n",
593 | "
1.0
\n",
594 | "
...
\n",
595 | "
0.0
\n",
596 | "
0.0
\n",
597 | "
0.0
\n",
598 | "
0.0
\n",
599 | "
0.0
\n",
600 | "
0.0
\n",
601 | "
0.0
\n",
602 | "
0.0
\n",
603 | "
0.0
\n",
604 | "
0.0
\n",
605 | "
\n",
606 | "
\n",
607 | "
3
\n",
608 | "
4
\n",
609 | "
Waiting to Exhale
\n",
610 | "
[Comedy, Drama, Romance]
\n",
611 | "
1995
\n",
612 | "
0.0
\n",
613 | "
0.0
\n",
614 | "
0.0
\n",
615 | "
1.0
\n",
616 | "
0.0
\n",
617 | "
1.0
\n",
618 | "
...
\n",
619 | "
0.0
\n",
620 | "
0.0
\n",
621 | "
0.0
\n",
622 | "
0.0
\n",
623 | "
0.0
\n",
624 | "
0.0
\n",
625 | "
0.0
\n",
626 | "
0.0
\n",
627 | "
0.0
\n",
628 | "
0.0
\n",
629 | "
\n",
630 | "
\n",
631 | "
4
\n",
632 | "
5
\n",
633 | "
Father of the Bride Part II
\n",
634 | "
[Comedy]
\n",
635 | "
1995
\n",
636 | "
0.0
\n",
637 | "
0.0
\n",
638 | "
0.0
\n",
639 | "
1.0
\n",
640 | "
0.0
\n",
641 | "
0.0
\n",
642 | "
...
\n",
643 | "
0.0
\n",
644 | "
0.0
\n",
645 | "
0.0
\n",
646 | "
0.0
\n",
647 | "
0.0
\n",
648 | "
0.0
\n",
649 | "
0.0
\n",
650 | "
0.0
\n",
651 | "
0.0
\n",
652 | "
0.0
\n",
653 | "
\n",
654 | " \n",
655 | "
\n",
656 | "
5 rows × 24 columns
\n",
657 | "
"
658 | ],
659 | "text/plain": [
660 | " movieId title \\\n",
661 | "0 1 Toy Story \n",
662 | "1 2 Jumanji \n",
663 | "2 3 Grumpier Old Men \n",
664 | "3 4 Waiting to Exhale \n",
665 | "4 5 Father of the Bride Part II \n",
666 | "\n",
667 | " genres year Adventure \\\n",
668 | "0 [Adventure, Animation, Children, Comedy, Fantasy] 1995 1.0 \n",
669 | "1 [Adventure, Children, Fantasy] 1995 1.0 \n",
670 | "2 [Comedy, Romance] 1995 0.0 \n",
671 | "3 [Comedy, Drama, Romance] 1995 0.0 \n",
672 | "4 [Comedy] 1995 0.0 \n",
673 | "\n",
674 | " Animation Children Comedy Fantasy Romance ... Horror Mystery \\\n",
675 | "0 1.0 1.0 1.0 1.0 0.0 ... 0.0 0.0 \n",
676 | "1 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 \n",
677 | "2 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 \n",
678 | "3 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 \n",
679 | "4 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 \n",
680 | "\n",
681 | " Sci-Fi IMAX Documentary War Musical Western Film-Noir \\\n",
682 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
683 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
684 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
685 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
686 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
687 | "\n",
688 | " (no genres listed) \n",
689 | "0 0.0 \n",
690 | "1 0.0 \n",
691 | "2 0.0 \n",
692 | "3 0.0 \n",
693 | "4 0.0 \n",
694 | "\n",
695 | "[5 rows x 24 columns]"
696 | ]
697 | },
698 | "execution_count": 18,
699 | "metadata": {},
700 | "output_type": "execute_result"
701 | }
702 | ],
703 | "source": [
704 | "#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.\n",
705 | "moviesWithGenres_df = movies_df.copy()\n",
706 | "\n",
707 | "#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column\n",
708 | "for index, row in movies_df.iterrows():\n",
709 | " for genre in row['genres']:\n",
710 | " moviesWithGenres_df.at[index, genre] = 1\n",
711 | "#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre\n",
712 | "moviesWithGenres_df = moviesWithGenres_df.fillna(0)\n",
713 | "moviesWithGenres_df.head()"
714 | ]
715 | },
716 | {
717 | "cell_type": "markdown",
718 | "metadata": {},
719 | "source": [
720 | "Next, let's look at the ratings dataframe."
721 | ]
722 | },
723 | {
724 | "cell_type": "code",
725 | "execution_count": null,
726 | "metadata": {
727 | "collapsed": false,
728 | "jupyter": {
729 | "outputs_hidden": false
730 | }
731 | },
732 | "outputs": [],
733 | "source": [
734 | "ratings_df.head()"
735 | ]
736 | },
737 | {
738 | "cell_type": "markdown",
739 | "metadata": {},
740 | "source": [
741 | "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory."
742 | ]
743 | },
744 | {
745 | "cell_type": "code",
746 | "execution_count": null,
747 | "metadata": {
748 | "collapsed": false,
749 | "jupyter": {
750 | "outputs_hidden": false
751 | }
752 | },
753 | "outputs": [],
754 | "source": [
755 | "#Drop removes a specified row or column from a dataframe\n",
756 | "ratings_df = ratings_df.drop('timestamp', 1)\n",
757 | "ratings_df.head()"
758 | ]
759 | },
760 | {
761 | "cell_type": "markdown",
762 | "metadata": {},
763 | "source": [
764 | "\n",
765 | "# Content-Based recommendation system"
766 | ]
767 | },
768 | {
769 | "cell_type": "markdown",
770 | "metadata": {},
771 | "source": [
772 | "Now, let's take a look at how to implement __Content-Based__ or __Item-Item recommendation systems__. This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. In our case, we're going to try to figure out the input's favorite genres from the movies and ratings given.\n",
773 | "\n",
774 | "Let's begin by creating an input user to recommend movies to:\n",
775 | "\n",
776 | "Notice: To add more movies, simply increase the amount of elements in the __userInput__. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ."
777 | ]
778 | },
779 | {
780 | "cell_type": "code",
781 | "execution_count": null,
782 | "metadata": {
783 | "collapsed": false,
784 | "jupyter": {
785 | "outputs_hidden": false
786 | }
787 | },
788 | "outputs": [],
789 | "source": [
790 | "userInput = [\n",
791 | " {'title':'Breakfast Club, The', 'rating':5},\n",
792 | " {'title':'Toy Story', 'rating':3.5},\n",
793 | " {'title':'Jumanji', 'rating':2},\n",
794 | " {'title':\"Pulp Fiction\", 'rating':5},\n",
795 | " {'title':'Akira', 'rating':4.5}\n",
796 | " ] \n",
797 | "inputMovies = pd.DataFrame(userInput)\n",
798 | "inputMovies"
799 | ]
800 | },
801 | {
802 | "cell_type": "markdown",
803 | "metadata": {},
804 | "source": [
805 | "#### Add movieId to input user\n",
806 | "With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.\n",
807 | "\n",
808 | "We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space."
809 | ]
810 | },
811 | {
812 | "cell_type": "code",
813 | "execution_count": null,
814 | "metadata": {
815 | "collapsed": false,
816 | "jupyter": {
817 | "outputs_hidden": false
818 | },
819 | "scrolled": true
820 | },
821 | "outputs": [],
822 | "source": [
823 | "#Filtering out the movies by title\n",
824 | "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n",
825 | "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n",
826 | "inputMovies = pd.merge(inputId, inputMovies)\n",
827 | "#Dropping information we won't use from the input dataframe\n",
828 | "inputMovies = inputMovies.drop('genres', 1).drop('year', 1)\n",
829 | "#Final input dataframe\n",
830 | "#If a movie you added in above isn't here, then it might not be in the original \n",
831 | "#dataframe or it might spelled differently, please check capitalisation.\n",
832 | "inputMovies"
833 | ]
834 | },
835 | {
836 | "cell_type": "markdown",
837 | "metadata": {},
838 | "source": [
839 | "We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values."
840 | ]
841 | },
842 | {
843 | "cell_type": "code",
844 | "execution_count": null,
845 | "metadata": {
846 | "collapsed": false,
847 | "jupyter": {
848 | "outputs_hidden": false
849 | }
850 | },
851 | "outputs": [],
852 | "source": [
853 | "#Filtering out the movies from the input\n",
854 | "userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]\n",
855 | "userMovies"
856 | ]
857 | },
858 | {
859 | "cell_type": "markdown",
860 | "metadata": {},
861 | "source": [
862 | "We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns."
863 | ]
864 | },
865 | {
866 | "cell_type": "code",
867 | "execution_count": null,
868 | "metadata": {
869 | "collapsed": false,
870 | "jupyter": {
871 | "outputs_hidden": false
872 | }
873 | },
874 | "outputs": [],
875 | "source": [
876 | "#Resetting the index to avoid future issues\n",
877 | "userMovies = userMovies.reset_index(drop=True)\n",
878 | "#Dropping unnecessary issues due to save memory and to avoid issues\n",
879 | "userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n",
880 | "userGenreTable"
881 | ]
882 | },
883 | {
884 | "cell_type": "markdown",
885 | "metadata": {},
886 | "source": [
887 | "Now we're ready to start learning the input's preferences!\n",
888 | "\n",
889 | "To do this, we're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's \"dot\" function."
890 | ]
891 | },
892 | {
893 | "cell_type": "code",
894 | "execution_count": null,
895 | "metadata": {
896 | "collapsed": false,
897 | "jupyter": {
898 | "outputs_hidden": false
899 | }
900 | },
901 | "outputs": [],
902 | "source": [
903 | "inputMovies['rating']"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": null,
909 | "metadata": {
910 | "collapsed": false,
911 | "jupyter": {
912 | "outputs_hidden": false
913 | }
914 | },
915 | "outputs": [],
916 | "source": [
917 | "#Dot produt to get weights\n",
918 | "userProfile = userGenreTable.transpose().dot(inputMovies['rating'])\n",
919 | "#The user profile\n",
920 | "userProfile"
921 | ]
922 | },
923 | {
924 | "cell_type": "markdown",
925 | "metadata": {},
926 | "source": [
927 | "Now, we have the weights for every of the user's preferences. This is known as the User Profile. Using this, we can recommend movies that satisfy the user's preferences."
928 | ]
929 | },
930 | {
931 | "cell_type": "markdown",
932 | "metadata": {},
933 | "source": [
934 | "Let's start by extracting the genre table from the original dataframe:"
935 | ]
936 | },
937 | {
938 | "cell_type": "code",
939 | "execution_count": null,
940 | "metadata": {
941 | "collapsed": false,
942 | "jupyter": {
943 | "outputs_hidden": false
944 | }
945 | },
946 | "outputs": [],
947 | "source": [
948 | "#Now let's get the genres of every movie in our original dataframe\n",
949 | "genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])\n",
950 | "#And drop the unnecessary information\n",
951 | "genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n",
952 | "genreTable.head()"
953 | ]
954 | },
955 | {
956 | "cell_type": "code",
957 | "execution_count": null,
958 | "metadata": {
959 | "collapsed": false,
960 | "jupyter": {
961 | "outputs_hidden": false
962 | }
963 | },
964 | "outputs": [],
965 | "source": [
966 | "genreTable.shape"
967 | ]
968 | },
969 | {
970 | "cell_type": "markdown",
971 | "metadata": {},
972 | "source": [
973 | "With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it."
974 | ]
975 | },
976 | {
977 | "cell_type": "code",
978 | "execution_count": null,
979 | "metadata": {
980 | "collapsed": false,
981 | "jupyter": {
982 | "outputs_hidden": false
983 | }
984 | },
985 | "outputs": [],
986 | "source": [
987 | "#Multiply the genres by the weights and then take the weighted average\n",
988 | "recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())\n",
989 | "recommendationTable_df.head()"
990 | ]
991 | },
992 | {
993 | "cell_type": "code",
994 | "execution_count": null,
995 | "metadata": {
996 | "collapsed": false,
997 | "jupyter": {
998 | "outputs_hidden": false
999 | }
1000 | },
1001 | "outputs": [],
1002 | "source": [
1003 | "#Sort our recommendations in descending order\n",
1004 | "recommendationTable_df = recommendationTable_df.sort_values(ascending=False)\n",
1005 | "#Just a peek at the values\n",
1006 | "recommendationTable_df.head()"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "markdown",
1011 | "metadata": {},
1012 | "source": [
1013 | "Now here's the recommendation table!"
1014 | ]
1015 | },
1016 | {
1017 | "cell_type": "code",
1018 | "execution_count": null,
1019 | "metadata": {
1020 | "collapsed": false,
1021 | "jupyter": {
1022 | "outputs_hidden": false
1023 | },
1024 | "scrolled": true
1025 | },
1026 | "outputs": [],
1027 | "source": [
1028 | "#The final recommendation table\n",
1029 | "movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "markdown",
1034 | "metadata": {},
1035 | "source": [
1036 | "### Advantages and Disadvantages of Content-Based Filtering\n",
1037 | "\n",
1038 | "##### Advantages\n",
1039 | "* Learns user's preferences\n",
1040 | "* Highly personalized for the user\n",
1041 | "\n",
1042 | "##### Disadvantages\n",
1043 | "* Doesn't take into account what others think of the item, so low quality item recommendations might happen\n",
1044 | "* Extracting data is not always intuitive\n",
1045 | "* Determining what characteristics of the item the user dislikes or likes is not always obvious"
1046 | ]
1047 | },
1048 | {
1049 | "cell_type": "markdown",
1050 | "metadata": {},
1051 | "source": [
1052 | "
Want to learn more?
\n",
1053 | "\n",
1054 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n",
1055 | "\n",
1056 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n",
1057 | "\n",
1058 | "
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.