In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
\n\n\n\n",
33 | "metadata": {}
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "source": "# Instructions\n",
38 | "metadata": {}
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "source": "In this notebook, you will practice all the classification algorithms that we have learned in this course.\n\n\nBelow, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n\nWe will use some of the algorithms taught in the course, specifically:\n\n1. Linear Regression\n2. KNN\n3. Decision Trees\n4. Logistic Regression\n5. SVM\n\nWe will evaluate our models using:\n\n1. Accuracy Score\n2. Jaccard Index\n3. F1-Score\n4. LogLoss\n5. Mean Absolute Error\n6. Mean Squared Error\n7. R2-Score\n\nFinally, you will use your models to generate the report at the end. \n",
43 | "metadata": {}
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "source": "# About The Dataset\n",
48 | "metadata": {}
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "source": "The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n\nThe dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n\n\n",
53 | "metadata": {}
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "source": "This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n\n| Field | Description | Unit | Type |\n| ------------- | ----------------------------------------------------- | --------------- | ------ |\n| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n| Location | Location of the Observation | Location | object |\n| MinTemp | Minimum temperature | Celsius | float |\n| MaxTemp | Maximum temperature | Celsius | float |\n| Rainfall | Amount of rainfall | Millimeters | float |\n| Evaporation | Amount of evaporation | Millimeters | float |\n| Sunshine | Amount of bright sunshine | hours | float |\n| WindGustDir | Direction of the strongest gust | Compass Points | object |\n| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n| Humidity9am | Humidity at 9am | Percent | float |\n| Humidity3pm | Humidity at 3pm | Percent | float |\n| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n| Temp9am | Temperature at 9am | Celsius | float |\n| Temp3pm | Temperature at 3pm | Celsius | float |\n| RainToday | If there was rain today | Yes/No | object |\n| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n\nColumn definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n\n",
58 | "metadata": {}
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "source": "## **Import the required libraries**\n",
63 | "metadata": {}
64 | },
65 | {
66 | "cell_type": "code",
67 | "source": "# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\"",
68 | "metadata": {
69 | "trusted": true
70 | },
71 | "outputs": [],
72 | "execution_count": null
73 | },
74 | {
75 | "cell_type": "code",
76 | "source": "# Surpress warnings:\ndef warn(*args, **kwargs):\n pass\nimport warnings\nwarnings.warn = warn",
77 | "metadata": {
78 | "trusted": true
79 | },
80 | "outputs": [],
81 | "execution_count": null
82 | },
83 | {
84 | "cell_type": "code",
85 | "source": "import pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn import preprocessing\nimport numpy as np\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn import svm\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import log_loss\nfrom sklearn.metrics import confusion_matrix, accuracy_score\nimport sklearn.metrics as metrics",
86 | "metadata": {
87 | "trusted": true
88 | },
89 | "outputs": [],
90 | "execution_count": null
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "source": "### Importing the Dataset\n",
95 | "metadata": {}
96 | },
97 | {
98 | "cell_type": "code",
99 | "source": "from pyodide.http import pyfetch\n\nasync def download(url, filename):\n response = await pyfetch(url)\n if response.status == 200:\n with open(filename, \"wb\") as f:\n f.write(await response.bytes())",
100 | "metadata": {
101 | "trusted": true
102 | },
103 | "outputs": [],
104 | "execution_count": null
105 | },
106 | {
107 | "cell_type": "code",
108 | "source": "path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'",
109 | "metadata": {
110 | "trusted": true
111 | },
112 | "outputs": [],
113 | "execution_count": null
114 | },
115 | {
116 | "cell_type": "code",
117 | "source": "await download(path, \"Weather_Data.csv\")\nfilename =\"Weather_Data.csv\"",
118 | "metadata": {
119 | "trusted": true
120 | },
121 | "outputs": [],
122 | "execution_count": null
123 | },
124 | {
125 | "cell_type": "code",
126 | "source": "df = pd.read_csv(\"Weather_Data.csv\")",
127 | "metadata": {
128 | "trusted": true
129 | },
130 | "outputs": [],
131 | "execution_count": null
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "source": "> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n",
136 | "metadata": {}
137 | },
138 | {
139 | "cell_type": "code",
140 | "source": "#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n#df = pd.read_csv(filepath)",
141 | "metadata": {
142 | "trusted": true
143 | },
144 | "outputs": [],
145 | "execution_count": null
146 | },
147 | {
148 | "cell_type": "code",
149 | "source": "df.head()",
150 | "metadata": {
151 | "trusted": true
152 | },
153 | "outputs": [],
154 | "execution_count": null
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "source": "### Data Preprocessing\n",
159 | "metadata": {}
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "source": "#### One Hot Encoding\n",
164 | "metadata": {}
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "source": "First, we need to perform one hot encoding to convert categorical variables to binary variables.\n",
169 | "metadata": {}
170 | },
171 | {
172 | "cell_type": "code",
173 | "source": "df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])",
174 | "metadata": {
175 | "trusted": true
176 | },
177 | "outputs": [],
178 | "execution_count": null
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "source": "Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n",
183 | "metadata": {}
184 | },
185 | {
186 | "cell_type": "code",
187 | "source": "df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)",
188 | "metadata": {
189 | "trusted": true
190 | },
191 | "outputs": [],
192 | "execution_count": null
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "source": "### Training Data and Test Data\n",
197 | "metadata": {}
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "source": "Now, we set our 'features' or x values and our Y or target variable.\n",
202 | "metadata": {}
203 | },
204 | {
205 | "cell_type": "code",
206 | "source": "df_sydney_processed.drop('Date',axis=1,inplace=True)",
207 | "metadata": {
208 | "trusted": true
209 | },
210 | "outputs": [],
211 | "execution_count": null
212 | },
213 | {
214 | "cell_type": "code",
215 | "source": "df_sydney_processed = df_sydney_processed.astype(float)",
216 | "metadata": {
217 | "trusted": true
218 | },
219 | "outputs": [],
220 | "execution_count": null
221 | },
222 | {
223 | "cell_type": "code",
224 | "source": "features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\nY = df_sydney_processed['RainTomorrow']",
225 | "metadata": {
226 | "trusted": true
227 | },
228 | "outputs": [],
229 | "execution_count": null
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "source": "### Linear Regression\n",
234 | "metadata": {}
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "source": "#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n",
239 | "metadata": {}
240 | },
241 | {
242 | "cell_type": "code",
243 | "source": "# Split the dataset into training and testing sets\nx_train, y_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)\n\n# Output the shapes of the resulting datasets to confirm the split\nprint(\"X_train shape:\", X_train.shape)\nprint(\"X_test shape:\", X_test.shape)\nprint(\"y_train shape:\", y_train.shape)\nprint(\"y_test shape:\", y_test.shape)\n\n\n",
244 | "metadata": {
245 | "trusted": true
246 | },
247 | "outputs": [],
248 | "execution_count": null
249 | },
250 | {
251 | "cell_type": "code",
252 | "source": "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)",
253 | "metadata": {
254 | "trusted": true
255 | },
256 | "outputs": [],
257 | "execution_count": null
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "source": "#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n",
262 | "metadata": {}
263 | },
264 | {
265 | "cell_type": "code",
266 | "source": "# Step 1: Import the LinearRegression model from sklearn\nfrom sklearn.linear_model import LinearRegression\n\n# Step 2: Create an instance of the LinearRegression model\nLinearReg = LinearRegression()\n\n# Step 3: Train the Linear Regression model using the training data\nLinearReg.fit(x_train, y_train)\n\n# Output the coefficients and intercept of the trained model\nprint(\"Coefficients:\", LinearReg.coef_)\nprint(\"Intercept:\", LinearReg.intercept_)\n",
267 | "metadata": {
268 | "trusted": true
269 | },
270 | "outputs": [],
271 | "execution_count": null
272 | },
273 | {
274 | "cell_type": "code",
275 | "source": "LinearReg = LinearRegression()",
276 | "metadata": {
277 | "trusted": true
278 | },
279 | "outputs": [],
280 | "execution_count": null
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "source": "#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n",
285 | "metadata": {}
286 | },
287 | {
288 | "cell_type": "code",
289 | "source": "# Step 1: Use the predict method on the testing data\npredictions = LinearReg.predict(X_test)\n\n# Step 2: Output the predictions\nprint(predictions)\n",
290 | "metadata": {
291 | "trusted": true
292 | },
293 | "outputs": [],
294 | "execution_count": null
295 | },
296 | {
297 | "cell_type": "code",
298 | "source": "predictions = LinearReg.predict(x_test)",
299 | "metadata": {
300 | "trusted": true
301 | },
302 | "outputs": [],
303 | "execution_count": null
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "source": "#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n",
308 | "metadata": {}
309 | },
310 | {
311 | "cell_type": "code",
312 | "source": "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n\n# Calculate the Mean Absolute Error\nmae = mean_absolute_error(y_test, predictions)\n\n# Calculate the Mean Squared Error\nmse = mean_squared_error(y_test, predictions)\n\n# Calculate the R2 Score\nr2 = r2_score(y_test, predictions)\n\n# Output the metrics\nprint(\"Mean Absolute Error (MAE):\", mae)\nprint(\"Mean Squared Error (MSE):\", mse)\nprint(\"R2 Score:\", r2)\n",
313 | "metadata": {
314 | "trusted": true
315 | },
316 | "outputs": [],
317 | "execution_count": null
318 | },
319 | {
320 | "cell_type": "code",
321 | "source": "LinearRegression_MAE = mae\nLinearRegression_MSE = mse\nLinearRegression_R2 = r2",
322 | "metadata": {
323 | "trusted": true
324 | },
325 | "outputs": [],
326 | "execution_count": null
327 | },
328 | {
329 | "cell_type": "markdown",
330 | "source": "#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n",
331 | "metadata": {}
332 | },
333 | {
334 | "cell_type": "code",
335 | "source": "import pandas as pd\n\n# Create a dictionary with the metrics\nmetrics_dict = {\n 'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'R2 Score'],\n 'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]\n}\n\n# Convert the dictionary to a DataFrame\nmetrics_df = pd.DataFrame(metrics_dict)\n\n# Display the DataFrame\nmetrics_df",
336 | "metadata": {
337 | "trusted": true
338 | },
339 | "outputs": [],
340 | "execution_count": null
341 | },
342 | {
343 | "cell_type": "code",
344 | "source": "Report = metrics_df",
345 | "metadata": {
346 | "trusted": true
347 | },
348 | "outputs": [],
349 | "execution_count": null
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "source": "### KNN\n",
354 | "metadata": {}
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "source": "#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n",
359 | "metadata": {}
360 | },
361 | {
362 | "cell_type": "code",
363 | "source": "from sklearn.neighbors import KNeighborsClassifier\n\n# Create the KNN model with n_neighbors set to 4\nKNN = KNeighborsClassifier(n_neighbors=4)\n\n# Train the model using the training data\nKNN.fit(x_train, y_train)\n",
364 | "metadata": {
365 | "trusted": true
366 | },
367 | "outputs": [],
368 | "execution_count": null
369 | },
370 | {
371 | "cell_type": "code",
372 | "source": "KNN = KNeighborsClassifier(n_neighbors=4)\n",
373 | "metadata": {
374 | "trusted": true
375 | },
376 | "outputs": [],
377 | "execution_count": null
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "source": "#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n",
382 | "metadata": {}
383 | },
384 | {
385 | "cell_type": "code",
386 | "source": "# Use the predict method to make predictions on the testing data\npredictions_knn = KNN.predict(x_test)\n",
387 | "metadata": {
388 | "trusted": true
389 | },
390 | "outputs": [],
391 | "execution_count": null
392 | },
393 | {
394 | "cell_type": "code",
395 | "source": "predictions = predictions_knn ",
396 | "metadata": {
397 | "trusted": true
398 | },
399 | "outputs": [],
400 | "execution_count": null
401 | },
402 | {
403 | "cell_type": "markdown",
404 | "source": "#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n",
405 | "metadata": {}
406 | },
407 | {
408 | "cell_type": "code",
409 | "source": "# Calculate accuracy\nknn_accuracy = accuracy_score(y_test, predictions)\n\n# Calculate F1-score\nknn_f1_score = f1_score(y_test, predictions)\n\n# Calculate Jaccard index\nknn_jaccard_index = jaccard_score(y_test, predictions)\n",
410 | "metadata": {
411 | "trusted": true
412 | },
413 | "outputs": [],
414 | "execution_count": null
415 | },
416 | {
417 | "cell_type": "code",
418 | "source": "KNN_Accuracy_Score = knn_accuracy\nKNN_JaccardIndex =knn_jaccard_index \nKNN_F1_Score = knn_f1_score",
419 | "metadata": {
420 | "trusted": true
421 | },
422 | "outputs": [],
423 | "execution_count": null
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "source": "### Decision Tree\n",
428 | "metadata": {}
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "source": "#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n",
433 | "metadata": {}
434 | },
435 | {
436 | "cell_type": "code",
437 | "source": "from sklearn.tree import DecisionTreeClassifier\n\n# Create the Decision Tree model\nTree = DecisionTreeClassifier()\n\n# Train the model using the training data\nTree.fit(x_train, y_train)\n",
438 | "metadata": {
439 | "trusted": true
440 | },
441 | "outputs": [],
442 | "execution_count": null
443 | },
444 | {
445 | "cell_type": "code",
446 | "source": "Tree = DecisionTreeClassifier()\n# Train the model using the training data\nTree.fit(x_train, y_train)",
447 | "metadata": {
448 | "trusted": true
449 | },
450 | "outputs": [],
451 | "execution_count": null
452 | },
453 | {
454 | "cell_type": "markdown",
455 | "source": "#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n",
456 | "metadata": {}
457 | },
458 | {
459 | "cell_type": "code",
460 | "source": "# Use the predict method on the testing data\npredictions = Tree.predict(x_test)\n",
461 | "metadata": {
462 | "trusted": true
463 | },
464 | "outputs": [],
465 | "execution_count": null
466 | },
467 | {
468 | "cell_type": "code",
469 | "source": "predictions = Tree.predict(x_test)\n",
470 | "metadata": {
471 | "trusted": true
472 | },
473 | "outputs": [],
474 | "execution_count": null
475 | },
476 | {
477 | "cell_type": "markdown",
478 | "source": "#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n",
479 | "metadata": {}
480 | },
481 | {
482 | "cell_type": "code",
483 | "source": "# Calculate the accuracy score\ntree_accuracy = accuracy_score(y_test, predictions)\n\n# Calculate the Jaccard Index\ntree_jaccard_index = jaccard_score(y_test, predictions)\n\n# Calculate the F1-Score\ntree_f1_score = f1_score(y_test, predictions)\n\n# Display the results\nprint(f\"Accuracy Score: {tree_accuracy}\")\nprint(f\"Jaccard Index: {tree_jaccard_index}\")\nprint(f\"F1-Score: {tree_f1_score}\")\n",
484 | "metadata": {
485 | "trusted": true
486 | },
487 | "outputs": [],
488 | "execution_count": null
489 | },
490 | {
491 | "cell_type": "code",
492 | "source": "Tree_Accuracy_Score = tree_accuracy\nTree_JaccardIndex = tree_jaccard_index\nTree_F1_Score = tree_f1_score ",
493 | "metadata": {
494 | "trusted": true
495 | },
496 | "outputs": [],
497 | "execution_count": null
498 | },
499 | {
500 | "cell_type": "markdown",
501 | "source": "### Logistic Regression\n",
502 | "metadata": {}
503 | },
504 | {
505 | "cell_type": "markdown",
506 | "source": "#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n",
507 | "metadata": {}
508 | },
509 | {
510 | "cell_type": "code",
511 | "source": "from sklearn.model_selection import train_test_split\n\n# Split the data\nx_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)\n",
512 | "metadata": {
513 | "trusted": true
514 | },
515 | "outputs": [],
516 | "execution_count": null
517 | },
518 | {
519 | "cell_type": "code",
520 | "source": "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)",
521 | "metadata": {
522 | "trusted": true
523 | },
524 | "outputs": [],
525 | "execution_count": null
526 | },
527 | {
528 | "cell_type": "markdown",
529 | "source": "#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n",
530 | "metadata": {}
531 | },
532 | {
533 | "cell_type": "code",
534 | "source": "from sklearn.linear_model import LogisticRegression\n\n# Create the LogisticRegression model\nLR = LogisticRegression(solver='liblinear')\n\n# Train the model using the training data\nLR.fit(x_train, y_train)\n",
535 | "metadata": {
536 | "trusted": true
537 | },
538 | "outputs": [],
539 | "execution_count": null
540 | },
541 | {
542 | "cell_type": "code",
543 | "source": "LR = LogisticRegression(solver='liblinear')\n\n# Train the model using the training data\nLR.fit(x_train, y_train)\n",
544 | "metadata": {
545 | "trusted": true
546 | },
547 | "outputs": [],
548 | "execution_count": null
549 | },
550 | {
551 | "cell_type": "markdown",
552 | "source": "#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n",
553 | "metadata": {}
554 | },
555 | {
556 | "cell_type": "code",
557 | "source": "# Use the predict method to make predictions on the testing data\npredictions = LR.predict(x_test)\n\n# Use the predict_proba method to get the probability estimates\npredict_proba = LR.predict_proba(x_test)\n\n# Output the predictions and probabilities\nprint(\"Predictions:\", predictions)\nprint(\"Predicted Probabilities:\", predict_proba)\n",
558 | "metadata": {
559 | "trusted": true
560 | },
561 | "outputs": [],
562 | "execution_count": null
563 | },
564 | {
565 | "cell_type": "code",
566 | "source": "predictions = LR.predict(x_test)",
567 | "metadata": {
568 | "trusted": true
569 | },
570 | "outputs": [],
571 | "execution_count": null
572 | },
573 | {
574 | "cell_type": "code",
575 | "source": "predict_proba = LR.predict_proba(x_test)",
576 | "metadata": {
577 | "trusted": true
578 | },
579 | "outputs": [],
580 | "execution_count": null
581 | },
582 | {
583 | "cell_type": "markdown",
584 | "source": "#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n",
585 | "metadata": {}
586 | },
587 | {
588 | "cell_type": "code",
589 | "source": "from sklearn.metrics import accuracy_score, f1_score, jaccard_score, log_loss\n\n# Calculate accuracy score\naccuracy = accuracy_score(y_test, predictions)\n\n# Calculate F1-score\nf1 = f1_score(y_test, predictions)\n\n# Calculate Jaccard index\njaccard = jaccard_score(y_test, predictions)\n\n# Calculate Log Loss using the predicted probabilities\n# For binary classification, use the probabilities for the positive class\nlogloss = log_loss(y_test, predict_proba)\n\n# Output the metrics\nprint(\"Accuracy Score:\", accuracy)\nprint(\"F1 Score:\", f1)\nprint(\"Jaccard Index:\", jaccard)\nprint(\"Log Loss:\", logloss)\n",
590 | "metadata": {
591 | "trusted": true
592 | },
593 | "outputs": [],
594 | "execution_count": null
595 | },
596 | {
597 | "cell_type": "code",
598 | "source": "LR_Accuracy_Score = accuracy\nLR_JaccardIndex = jaccard\nLR_F1_Score = f1\nLR_Log_Loss = logloss",
599 | "metadata": {
600 | "trusted": true
601 | },
602 | "outputs": [],
603 | "execution_count": null
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "source": "### SVM\n",
608 | "metadata": {}
609 | },
610 | {
611 | "cell_type": "markdown",
612 | "source": "#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n",
613 | "metadata": {}
614 | },
615 | {
616 | "cell_type": "code",
617 | "source": "from sklearn import svm\n\n# Create the SVM model\nSVM = svm.SVC()\n\n# Train the model using the training data\nSVM.fit(x_train, y_train)\n",
618 | "metadata": {
619 | "trusted": true
620 | },
621 | "outputs": [],
622 | "execution_count": null
623 | },
624 | {
625 | "cell_type": "code",
626 | "source": "SVM = svm.SVC()\nSVM.fit(x_train, y_train)",
627 | "metadata": {
628 | "trusted": true
629 | },
630 | "outputs": [],
631 | "execution_count": null
632 | },
633 | {
634 | "cell_type": "markdown",
635 | "source": "#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n",
636 | "metadata": {}
637 | },
638 | {
639 | "cell_type": "code",
640 | "source": "# Use the predict method on the testing data\npredictions_svm = SVM.predict(x_test)\n\n# Output the predictions\nprint(predictions_svm)\n",
641 | "metadata": {
642 | "trusted": true
643 | },
644 | "outputs": [],
645 | "execution_count": null
646 | },
647 | {
648 | "cell_type": "code",
649 | "source": "predictions = SVM.predict(x_test)",
650 | "metadata": {
651 | "trusted": true
652 | },
653 | "outputs": [],
654 | "execution_count": null
655 | },
656 | {
657 | "cell_type": "markdown",
658 | "source": "#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n",
659 | "metadata": {}
660 | },
661 | {
662 | "cell_type": "code",
663 | "source": "from sklearn.metrics import accuracy_score, f1_score, jaccard_score, log_loss\n\n# Calculate accuracy score\naccuracy_svm = accuracy_score(y_test, predictions_svm)\n\n# Calculate F1-score\nf1_svm = f1_score(y_test, predictions_svm)\n\n# Calculate Jaccard index\njaccard_svm = jaccard_score(y_test, predictions_svm)\n\n# Note: Log Loss requires probability estimates, so if using SVM you need to use predict_proba for this metric\n# For SVM, you need to use a decision function for probabilities or a model that supports probability estimates.\n# Assuming SVM with probability=True for demonstration\n# If `SVM` was trained with `probability=True` use the following:\npredict_proba_svm = SVM.decision_function(x_test)\nlogloss_svm = log_loss(y_test, predict_proba_svm)\n\n# Output the metrics\nprint(\"SVM Accuracy Score:\", accuracy_svm)\nprint(\"SVM F1 Score:\", f1_svm)\nprint(\"SVM Jaccard Index:\", jaccard_svm)\nprint(\"SVM Log Loss:\", logloss_svm)\n\n\nSVM_Accuracy_Score = accuracy_svm\nSVM_JaccardIndex = jaccard_svm\nSVM_F1_Score = f1_svm",
664 | "metadata": {
665 | "trusted": true
666 | },
667 | "outputs": [],
668 | "execution_count": null
669 | },
670 | {
671 | "cell_type": "markdown",
672 | "source": "### Report\n",
673 | "metadata": {}
674 | },
675 | {
676 | "cell_type": "markdown",
677 | "source": "#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n\n\\*LogLoss is only for Logistic Regression Model\n",
678 | "metadata": {}
679 | },
680 | {
681 | "cell_type": "code",
682 | "source": "import pandas as pd\n\n# Create a dictionary with the metrics for all models\nmetrics_dict = {\n 'Model': ['Linear Regression', 'KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n 'Accuracy': [LinearRegression_Accuracy_Score, KNN_Accuracy_Score, Tree_Accuracy_Score, LogisticRegression_Accuracy_Score, accuracy_svm],\n 'Jaccard Index': [None, KNN_JaccardIndex, Tree_JaccardIndex, LogisticRegression_JaccardIndex, jaccard_svm],\n 'F1 Score': [None, KNN_F1_Score, Tree_F1_Score, LogisticRegression_F1_Score, f1_svm],\n 'Log Loss': [None, None, None, LogisticRegression_Log_Loss, None] # Log Loss is only for Logistic Regression\n}\n\n# Convert the dictionary to a DataFrame\nmetrics_df = pd.DataFrame(metrics_dict)\n\n# Display the DataFrame\nprint(metrics_df)\n\n\nReport = metrics_df",
683 | "metadata": {
684 | "trusted": true
685 | },
686 | "outputs": [],
687 | "execution_count": null
688 | },
689 | {
690 | "cell_type": "markdown",
691 | "source": "
How to submit
\n\n
Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n\n
This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n",
692 | "metadata": {}
693 | },
694 | {
695 | "cell_type": "markdown",
696 | "source": "
About the Authors:
\n\nJoseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n\n### Other Contributors\n\n[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n",
697 | "metadata": {}
698 | },
699 | {
700 | "cell_type": "markdown",
701 | "source": "## Change Log\n\n| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ------------- | --------------------------- |\n| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n\n##
\n",
35 | "\n",
36 | "\n",
37 | "\n"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "id": "62e2bfb2-08bd-41de-a704-c72028242793",
43 | "metadata": {},
44 | "source": [
45 | "# Instructions\n"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "id": "c06281fa-8a02-493f-9b69-4b6738e6c8eb",
51 | "metadata": {},
52 | "source": [
53 | "In this notebook, you will practice all the classification algorithms that we have learned in this course.\n",
54 | "\n",
55 | "\n",
56 | "Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n",
57 | "\n",
58 | "We will use some of the algorithms taught in the course, specifically:\n",
59 | "\n",
60 | "1. Linear Regression\n",
61 | "2. KNN\n",
62 | "3. Decision Trees\n",
63 | "4. Logistic Regression\n",
64 | "5. SVM\n",
65 | "\n",
66 | "We will evaluate our models using:\n",
67 | "\n",
68 | "1. Accuracy Score\n",
69 | "2. Jaccard Index\n",
70 | "3. F1-Score\n",
71 | "4. LogLoss\n",
72 | "5. Mean Absolute Error\n",
73 | "6. Mean Squared Error\n",
74 | "7. R2-Score\n",
75 | "\n",
76 | "Finally, you will use your models to generate the report at the end. \n"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "id": "9d4ee051-f50c-4ce5-aba1-167ffc8f5648",
82 | "metadata": {},
83 | "source": [
84 | "# About The Dataset\n"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "id": "4e4d2b57-e9af-4a7d-a7f9-b8c25660ba78",
90 | "metadata": {},
91 | "source": [
92 | "The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n",
93 | "\n",
94 | "The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n",
95 | "\n",
96 | "\n"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "id": "4b2d517d-9973-438d-8d32-dff28ad6ce84",
102 | "metadata": {},
103 | "source": [
104 | "This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n",
105 | "\n",
106 | "| Field | Description | Unit | Type |\n",
107 | "| ------------- | ----------------------------------------------------- | --------------- | ------ |\n",
108 | "| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n",
109 | "| Location | Location of the Observation | Location | object |\n",
110 | "| MinTemp | Minimum temperature | Celsius | float |\n",
111 | "| MaxTemp | Maximum temperature | Celsius | float |\n",
112 | "| Rainfall | Amount of rainfall | Millimeters | float |\n",
113 | "| Evaporation | Amount of evaporation | Millimeters | float |\n",
114 | "| Sunshine | Amount of bright sunshine | hours | float |\n",
115 | "| WindGustDir | Direction of the strongest gust | Compass Points | object |\n",
116 | "| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n",
117 | "| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n",
118 | "| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n",
119 | "| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n",
120 | "| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n",
121 | "| Humidity9am | Humidity at 9am | Percent | float |\n",
122 | "| Humidity3pm | Humidity at 3pm | Percent | float |\n",
123 | "| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n",
124 | "| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n",
125 | "| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n",
126 | "| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n",
127 | "| Temp9am | Temperature at 9am | Celsius | float |\n",
128 | "| Temp3pm | Temperature at 3pm | Celsius | float |\n",
129 | "| RainToday | If there was rain today | Yes/No | object |\n",
130 | "| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n",
131 | "\n",
132 | "Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n",
133 | "\n"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "id": "3ad995f0-a174-49a5-aaef-76294021d5d4",
139 | "metadata": {},
140 | "source": [
141 | "## **Import the required libraries**\n"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "id": "38dca360-78ed-407c-9f48-26b405bf8695",
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n",
152 | "# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n",
153 | "# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\""
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "id": "ece29267-503d-4de0-8c69-f905815d57a3",
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "# Surpress warnings:\n",
164 | "def warn(*args, **kwargs):\n",
165 | " pass\n",
166 | "import warnings\n",
167 | "warnings.warn = warn"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "id": "2344f678-d444-4a13-bd7f-d730f954116f",
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "import pandas as pd\n",
178 | "from sklearn.linear_model import LogisticRegression\n",
179 | "from sklearn.linear_model import LinearRegression\n",
180 | "from sklearn import preprocessing\n",
181 | "import numpy as np\n",
182 | "from sklearn.neighbors import KNeighborsClassifier\n",
183 | "from sklearn.model_selection import train_test_split\n",
184 | "from sklearn.neighbors import KNeighborsClassifier\n",
185 | "from sklearn.tree import DecisionTreeClassifier\n",
186 | "from sklearn import svm\n",
187 | "from sklearn.metrics import jaccard_score\n",
188 | "from sklearn.metrics import f1_score\n",
189 | "from sklearn.metrics import log_loss\n",
190 | "from sklearn.metrics import confusion_matrix, accuracy_score\n",
191 | "import sklearn.metrics as metrics"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "id": "2bdad242-edb6-4a5b-8471-f5918f3ecab7",
197 | "metadata": {},
198 | "source": [
199 | "### Importing the Dataset\n"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "id": "fa02651f-de37-4666-8ad4-b4ddc8d0b0ac",
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "from pyodide.http import pyfetch\n",
210 | "\n",
211 | "async def download(url, filename):\n",
212 | " response = await pyfetch(url)\n",
213 | " if response.status == 200:\n",
214 | " with open(filename, \"wb\") as f:\n",
215 | " f.write(await response.bytes())"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "id": "4d127aff-9339-42f9-bdfc-f2e17a19dd4c",
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "id": "5f770f7d-f967-4ae4-a96a-a6c8c0a02bcc",
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "await download(path, \"Weather_Data.csv\")\n",
236 | "filename =\"Weather_Data.csv\""
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "id": "f87df4cd-160e-4878-bff0-73f1289f5b90",
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "df = pd.read_csv(\"Weather_Data.csv\")"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "id": "6a7acb6d-0e0e-4321-acd2-8096b272c39d",
252 | "metadata": {},
253 | "source": [
254 | "> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "id": "f9c77ad8-8b85-4e82-af47-f63163f889c3",
261 | "metadata": {},
262 | "outputs": [],
263 | "source": [
264 | "#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n",
265 | "#df = pd.read_csv(filepath)"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "id": "2fd31b01-a2bf-4263-8ff8-6fdce1990047",
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "df.head()"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "id": "eb2f4134-ab8b-48d8-aaab-85ce530aee65",
281 | "metadata": {},
282 | "source": [
283 | "### Data Preprocessing\n"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "id": "c70975f9-cae8-4cc6-be94-dbc1a880d3c9",
289 | "metadata": {},
290 | "source": [
291 | "#### One Hot Encoding\n"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "id": "cfadd018-a23d-4985-9eb3-8ed0d30abd52",
297 | "metadata": {},
298 | "source": [
299 | "First, we need to perform one hot encoding to convert categorical variables to binary variables.\n"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "id": "55968fd3-0422-4766-98fd-9397e0006e3e",
306 | "metadata": {},
307 | "outputs": [],
308 | "source": [
309 | "df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "id": "e354a6fc-8c8b-499d-8444-5011a5146b1a",
315 | "metadata": {},
316 | "source": [
317 | "Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "id": "77f75277-a3ca-4ccc-a5b7-f95b2491cc9b",
324 | "metadata": {},
325 | "outputs": [],
326 | "source": [
327 | "df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)"
328 | ]
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "id": "88ab6c18-f36a-408c-8510-fdf831abc53b",
333 | "metadata": {},
334 | "source": [
335 | "### Training Data and Test Data\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "id": "2a25156c-4080-4b15-a45c-9d4884ddce06",
341 | "metadata": {},
342 | "source": [
343 | "Now, we set our 'features' or x values and our Y or target variable.\n"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "id": "3077604d-a2f3-4e24-88ff-64b5ce69d6f0",
350 | "metadata": {},
351 | "outputs": [],
352 | "source": [
353 | "df_sydney_processed.drop('Date',axis=1,inplace=True)"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": null,
359 | "id": "e1b66dd7-5bb7-4739-96b8-374d8e89269e",
360 | "metadata": {},
361 | "outputs": [],
362 | "source": [
363 | "df_sydney_processed = df_sydney_processed.astype(float)"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": null,
369 | "id": "29857426-177a-4c87-8982-4fdca4ff4d78",
370 | "metadata": {},
371 | "outputs": [],
372 | "source": [
373 | "features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\n",
374 | "Y = df_sydney_processed['RainTomorrow']"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "id": "1be81f61-64c3-43d0-89a8-22ff1a9339cf",
380 | "metadata": {},
381 | "source": [
382 | "### Linear Regression\n"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "id": "60256d7f-4d49-4aed-b2df-dc44a5b0791c",
388 | "metadata": {},
389 | "source": [
390 | "#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": null,
396 | "id": "843e32d6-7bdd-4d12-b10a-de70bed0e974",
397 | "metadata": {},
398 | "outputs": [],
399 | "source": [
400 | "#Enter Your Code and Execute\n",
401 | "from sklearn.model_selection import train_test_split\n",
402 | "\n",
403 | "# Splitting the data into training and testing sets\n",
404 | "X_train, X_test, Y_train, Y_test = train_test_split(features, Y, test_size=0.2, random_state=10)"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": null,
410 | "id": "c38f2196-bb38-4322-a012-9a27e6a9d8d8",
411 | "metadata": {},
412 | "outputs": [],
413 | "source": [
414 | "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)\n"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "id": "b144e3a9-6bd2-4e75-8cae-50e1e3b3fb17",
420 | "metadata": {},
421 | "source": [
422 | "#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": null,
428 | "id": "22b77c93-ecf2-4c54-8ad4-496ef5396695",
429 | "metadata": {},
430 | "outputs": [],
431 | "source": [
432 | "#Enter Your Code and Execute"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": null,
438 | "id": "a4293ee5-572a-45cb-a090-5f30af826630",
439 | "metadata": {},
440 | "outputs": [],
441 | "source": [
442 | "from sklearn import linear_model\n",
443 | "LinearReg = linear_model.LinearRegression()\n",
444 | "LinearReg.fit(x_train,y_train)"
445 | ]
446 | },
447 | {
448 | "cell_type": "markdown",
449 | "id": "1aa0f086-fefc-4e44-8aa2-822aa7828ad6",
450 | "metadata": {},
451 | "source": [
452 | "#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"
453 | ]
454 | },
455 | {
456 | "cell_type": "code",
457 | "execution_count": null,
458 | "id": "a63dfa3d-a957-48dc-a067-93e9b3d11431",
459 | "metadata": {},
460 | "outputs": [],
461 | "source": [
462 | "#Enter Your Code and Execute"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": null,
468 | "id": "d6bdf734-7705-4590-aa54-6b93e20a9da7",
469 | "metadata": {},
470 | "outputs": [],
471 | "source": [
472 | "predictions = LinearReg.predict(x_test)"
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "id": "ea13e307-0ac9-4c7c-874e-440cee795d95",
478 | "metadata": {},
479 | "source": [
480 | "#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": null,
486 | "id": "cf2408d4-4932-487d-85f2-913c2efce09f",
487 | "metadata": {},
488 | "outputs": [],
489 | "source": [
490 | "#Enter Your Code and Execute"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "id": "aba34a58-a974-467b-8bad-10a0fd0e88d3",
497 | "metadata": {},
498 | "outputs": [],
499 | "source": [
500 | "import numpy as np\n",
501 | "from sklearn.metrics import r2_score\n",
502 | "\n",
503 | "LinearRegression_MAE = np.mean(np.absolute(predictions - y_test)\n",
504 | "LinearRegression_MSE =np.mean((predictions - y_test)**2)\n",
505 | "LinearRegression_R2 = r2_score(y_test , predictions) "
506 | ]
507 | },
508 | {
509 | "cell_type": "markdown",
510 | "id": "4552ab70-ec8a-4455-8c5f-f75f6e2e2771",
511 | "metadata": {},
512 | "source": [
513 | "#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n"
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "id": "cc932bbd-9528-45b5-b85a-8c9a0b9560ed",
520 | "metadata": {},
521 | "outputs": [],
522 | "source": [
523 | "#Enter Your Code and Execute"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "id": "edd964f0-d1aa-4e3b-9a52-577656477438",
530 | "metadata": {},
531 | "outputs": [],
532 | "source": [
533 | "from pandas import pd\n",
534 | "Report =pd.DataFrame({'Metric': ['MAE', 'MSE', 'R²'],'Linear Regression': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]})"
535 | ]
536 | },
537 | {
538 | "cell_type": "markdown",
539 | "id": "55351393-abee-4af8-8944-f0079265cd56",
540 | "metadata": {},
541 | "source": [
542 | "### KNN\n"
543 | ]
544 | },
545 | {
546 | "cell_type": "markdown",
547 | "id": "b7e38ebb-7442-4980-b883-e89c6de0d351",
548 | "metadata": {},
549 | "source": [
550 | "#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n"
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": null,
556 | "id": "213be9cb-8c88-4099-b023-def7c0e31c6a",
557 | "metadata": {},
558 | "outputs": [],
559 | "source": [
560 | "#Enter Your Code and Execute\n",
561 | "from sklearn.neighbors import KNeighborsClassifier"
562 | ]
563 | },
564 | {
565 | "cell_type": "code",
566 | "execution_count": null,
567 | "id": "f03fe0e5-ac9f-42cd-a107-d5d46b16ae4d",
568 | "metadata": {},
569 | "outputs": [],
570 | "source": [
571 | "k = 4\n",
572 | "#Train Model and Predict \n",
573 | "KNN = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)\n"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "id": "0ef93a31-0d67-4fa5-809b-a765b13f5888",
579 | "metadata": {},
580 | "source": [
581 | "#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": null,
587 | "id": "cf386d08-521b-418b-a5d6-601318bd8e93",
588 | "metadata": {},
589 | "outputs": [],
590 | "source": [
591 | "#Enter Your Code and Execute"
592 | ]
593 | },
594 | {
595 | "cell_type": "code",
596 | "execution_count": null,
597 | "id": "8f88815e-92fd-4920-983c-326502b8bc29",
598 | "metadata": {},
599 | "outputs": [],
600 | "source": [
601 | "predictions = KNN.predict(x_test)\n",
602 | "predictions[0:]"
603 | ]
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "id": "9913f102-99a8-4af3-858d-1a55a2d49259",
608 | "metadata": {},
609 | "source": [
610 | "#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"
611 | ]
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": null,
616 | "id": "02eb3dd1-0c03-407a-9042-d446bd5ce557",
617 | "metadata": {},
618 | "outputs": [],
619 | "source": [
620 | "#Enter Your Code and Execute\n",
621 | "from sklearn import metrics"
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": null,
627 | "id": "47ddb040-001c-4c7f-907a-e0fe69d2bfb7",
628 | "metadata": {},
629 | "outputs": [],
630 | "source": [
631 | "KNN_Accuracy_Score = metrics.accuracy_score(y_test, predictions) \n",
632 | "KNN_JaccardIndex = metrics.jaccard_score(y_test, predictions)\n",
633 | "KNN_F1_Score = metrics.f1_score(y_test, predictions)"
634 | ]
635 | },
636 | {
637 | "cell_type": "markdown",
638 | "id": "b1b49d21-0f4b-4737-a15c-88574afb6dc5",
639 | "metadata": {},
640 | "source": [
641 | "### Decision Tree\n"
642 | ]
643 | },
644 | {
645 | "cell_type": "markdown",
646 | "id": "07aedd58-1090-48ed-8fe9-38c64f4492ad",
647 | "metadata": {},
648 | "source": [
649 | "#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n"
650 | ]
651 | },
652 | {
653 | "cell_type": "code",
654 | "execution_count": null,
655 | "id": "b61b238f-7880-4cd9-9764-b69b5c63f352",
656 | "metadata": {},
657 | "outputs": [],
658 | "source": [
659 | "#Enter Your Code and Execute"
660 | ]
661 | },
662 | {
663 | "cell_type": "code",
664 | "execution_count": null,
665 | "id": "f4a9ea81-7336-4cab-aa98-22cb6c6e79ca",
666 | "metadata": {},
667 | "outputs": [],
668 | "source": [
669 | "\n",
670 | "from sklearn.tree import DecisionTreeClassifier\n",
671 | "Tree = DecisionTreeClassifier()\n",
672 | "Tree.fit(x_train, y_train)"
673 | ]
674 | },
675 | {
676 | "cell_type": "markdown",
677 | "id": "d79279ba-1e22-45c9-a6a7-e1a39a61c512",
678 | "metadata": {},
679 | "source": [
680 | "#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"
681 | ]
682 | },
683 | {
684 | "cell_type": "code",
685 | "execution_count": null,
686 | "id": "5c99f9a8-2fa2-48f3-83e9-8781dc1b0d61",
687 | "metadata": {},
688 | "outputs": [],
689 | "source": [
690 | "#Enter Your Code and Execute"
691 | ]
692 | },
693 | {
694 | "cell_type": "code",
695 | "execution_count": null,
696 | "id": "4f6ea989-4778-47c8-9bc0-f6fae679f3c3",
697 | "metadata": {},
698 | "outputs": [],
699 | "source": [
700 | "predictions = Tree.predict(x_test)\n",
701 | "predictions[0:]"
702 | ]
703 | },
704 | {
705 | "cell_type": "markdown",
706 | "id": "f19bae36-1072-4baf-bcf8-3c098eeb1915",
707 | "metadata": {},
708 | "source": [
709 | "#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": null,
715 | "id": "0c47c2e2-14c7-4e5a-aaea-86212a0fa032",
716 | "metadata": {},
717 | "outputs": [],
718 | "source": [
719 | "#Enter Your Code and Execute\n",
720 | "from sklearn import metrics"
721 | ]
722 | },
723 | {
724 | "cell_type": "code",
725 | "execution_count": null,
726 | "id": "edb09a5f-c3e2-4c4f-924b-85f6ffe96729",
727 | "metadata": {},
728 | "outputs": [],
729 | "source": [
730 | "Tree_Accuracy_Score = metrics.accuracy_score(y_test, predictions)\n",
731 | "Tree_JaccardIndex = metrics.jaccard_score(y_test, predictions)\n",
732 | "Tree_F1_Score = metrics.f1_score(y_test, predictions)"
733 | ]
734 | },
735 | {
736 | "cell_type": "markdown",
737 | "id": "f2905933-3b27-4ece-a80a-b35744f54b5f",
738 | "metadata": {},
739 | "source": [
740 | "### Logistic Regression\n"
741 | ]
742 | },
743 | {
744 | "cell_type": "markdown",
745 | "id": "490cdcfd-14aa-417d-b3ed-29577d13f7da",
746 | "metadata": {},
747 | "source": [
748 | "#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n"
749 | ]
750 | },
751 | {
752 | "cell_type": "code",
753 | "execution_count": null,
754 | "id": "31bb6aa1-c399-4aed-89eb-4800b1854f0f",
755 | "metadata": {},
756 | "outputs": [],
757 | "source": [
758 | "#Enter Your Code and Execute\n",
759 | "from sklearn.model_selection import train_test_split"
760 | ]
761 | },
762 | {
763 | "cell_type": "code",
764 | "execution_count": null,
765 | "id": "f7f13536-93e4-4edd-9c75-49a6d342f446",
766 | "metadata": {},
767 | "outputs": [],
768 | "source": [
769 | "x_train, x_test, y_train, y_test = (features, Y, test_size=0.2, random_state=1 )"
770 | ]
771 | },
772 | {
773 | "cell_type": "markdown",
774 | "id": "cd2f53d8-3983-4581-8363-f11950b80b85",
775 | "metadata": {},
776 | "source": [
777 | "#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": null,
783 | "id": "d8bf2bf5-8c9b-4250-beca-591e1a62dd1d",
784 | "metadata": {},
785 | "outputs": [],
786 | "source": [
787 | "#Enter Your Code and Execute"
788 | ]
789 | },
790 | {
791 | "cell_type": "code",
792 | "execution_count": null,
793 | "id": "57ef08e5-9b64-4337-92a6-9af7f98a81a4",
794 | "metadata": {},
795 | "outputs": [],
796 | "source": [
797 | "from sklearn.linear_model import LogisticRegression\n",
798 | "from sklearn.metrics import confusion_matrix\n",
799 | "LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)\n",
800 | "LR"
801 | ]
802 | },
803 | {
804 | "cell_type": "markdown",
805 | "id": "cdaf1cdd-61de-46e4-a252-6641aa13998b",
806 | "metadata": {},
807 | "source": [
808 | "#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n"
809 | ]
810 | },
811 | {
812 | "cell_type": "code",
813 | "execution_count": null,
814 | "id": "421725a3-8a77-4239-b6ee-3f7053ad6807",
815 | "metadata": {},
816 | "outputs": [],
817 | "source": [
818 | "#Enter Your Code and Execute"
819 | ]
820 | },
821 | {
822 | "cell_type": "code",
823 | "execution_count": null,
824 | "id": "b2aad2f1-d2b7-4267-ab2e-1913d6b681a8",
825 | "metadata": {},
826 | "outputs": [],
827 | "source": [
828 | "predictions = LR.predict(x_test)"
829 | ]
830 | },
831 | {
832 | "cell_type": "code",
833 | "execution_count": null,
834 | "id": "855f934b-4098-4e23-b5dd-b1ebff6c02b8",
835 | "metadata": {},
836 | "outputs": [],
837 | "source": [
838 | "predict_proba = LR.precict_proba(x_test)"
839 | ]
840 | },
841 | {
842 | "cell_type": "markdown",
843 | "id": "08f3dc13-8be2-4d1e-9866-ff97141ae692",
844 | "metadata": {},
845 | "source": [
846 | "#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"
847 | ]
848 | },
849 | {
850 | "cell_type": "code",
851 | "execution_count": null,
852 | "id": "0be54747-7fa3-46e6-a4cf-bd5e820ac616",
853 | "metadata": {},
854 | "outputs": [],
855 | "source": [
856 | "#Enter Your Code and Execute"
857 | ]
858 | },
859 | {
860 | "cell_type": "code",
861 | "execution_count": null,
862 | "id": "6dc992b9-a3f8-49b5-a988-90118971013f",
863 | "metadata": {},
864 | "outputs": [],
865 | "source": [
866 | "from sklearn.metrics import accuracy_score,jaccard_score,f1_score,log_loss\n",
867 | "LR_Accuracy_Score = accuracy_score(y_test, predictions)\n",
868 | "LR_JaccardIndex = jaccard_score(y_test, predictions)\n",
869 | "LR_F1_Score = f1_score(y_test, predictions)\n",
870 | "LR_Log_Loss = log_loss(y_test, predict_proba)\n"
871 | ]
872 | },
873 | {
874 | "cell_type": "markdown",
875 | "id": "0c7326ae-5aa6-4666-b4d6-0705e5bcb771",
876 | "metadata": {},
877 | "source": [
878 | "### SVM\n"
879 | ]
880 | },
881 | {
882 | "cell_type": "markdown",
883 | "id": "920bae21-8886-4705-b6b1-85c1ca4506ee",
884 | "metadata": {},
885 | "source": [
886 | "#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n"
887 | ]
888 | },
889 | {
890 | "cell_type": "code",
891 | "execution_count": null,
892 | "id": "4ed2651e-3dd8-46bd-8a31-e7efa095a5dc",
893 | "metadata": {},
894 | "outputs": [],
895 | "source": [
896 | "#Enter Your Code and Execute"
897 | ]
898 | },
899 | {
900 | "cell_type": "code",
901 | "execution_count": null,
902 | "id": "55d94ee3-60bb-4307-8fae-8d87dfd0f5ad",
903 | "metadata": {},
904 | "outputs": [],
905 | "source": [
906 | "from sklearn import SVC\n",
907 | "SVM = SVC()\n",
908 | "SVM.fit(x_train, y_train)"
909 | ]
910 | },
911 | {
912 | "cell_type": "markdown",
913 | "id": "755cb519-2721-4674-9d21-85a154fde994",
914 | "metadata": {},
915 | "source": [
916 | "#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"
917 | ]
918 | },
919 | {
920 | "cell_type": "code",
921 | "execution_count": null,
922 | "id": "de56e316-aaca-4ed9-89eb-1d69140ff04c",
923 | "metadata": {},
924 | "outputs": [],
925 | "source": [
926 | "#Enter Your Code and Execute"
927 | ]
928 | },
929 | {
930 | "cell_type": "code",
931 | "execution_count": null,
932 | "id": "cb98d313-75b6-4bea-b79c-efec4b9e412c",
933 | "metadata": {},
934 | "outputs": [],
935 | "source": [
936 | "predictions = SVM.predict(x_test)"
937 | ]
938 | },
939 | {
940 | "cell_type": "markdown",
941 | "id": "961ccca3-1fac-476a-93d2-39d6c8b3905b",
942 | "metadata": {},
943 | "source": [
944 | "#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"
945 | ]
946 | },
947 | {
948 | "cell_type": "code",
949 | "execution_count": null,
950 | "id": "34922618-6a7d-494c-a1b6-5f515f29a801",
951 | "metadata": {},
952 | "outputs": [],
953 | "source": [
954 | "from sklearn.metrics import accuracy_score,jaccard_score,f1_score\n",
955 | "SVM_Accuracy_Score = accuracy_score(y_test, predictions)\n",
956 | "SVM_JaccardIndex = jaccard_score(y_test, predictions)\n",
957 | "SVM_F1_Score = f1_score(y_test, predictions)\n"
958 | ]
959 | },
960 | {
961 | "cell_type": "markdown",
962 | "id": "4e02f921-2696-4a0b-b9b6-cfc89a55f77d",
963 | "metadata": {},
964 | "source": [
965 | "### Report\n"
966 | ]
967 | },
968 | {
969 | "cell_type": "markdown",
970 | "id": "1f696bf7-a40a-404b-af35-b9a66f6304d6",
971 | "metadata": {},
972 | "source": [
973 | "#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n",
974 | "\n",
975 | "\\*LogLoss is only for Logistic Regression Model\n"
976 | ]
977 | },
978 | {
979 | "cell_type": "code",
980 | "execution_count": null,
981 | "id": "f7cc9f99-9da8-48e1-916e-fd642e28b773",
982 | "metadata": {},
983 | "outputs": [],
984 | "source": [
985 | "Report = pd.DataFrame({\n",
986 | " 'Model': ['Logistic Regression', 'SVM'],\n",
987 | " 'Accuracy': [LR_Accuracy_Score, SVM_Accuracy_Score],\n",
988 | " 'Jaccard Index': [LR_JaccardIndex, SVM_JaccardIndex],\n",
989 | " 'F1 Score': [LR_F1_Score, SVM_F1_Score],\n",
990 | " 'Log Loss': [LR_Log_Loss, 'N/A'] # Log Loss is not applicable to SVM\n",
991 | "})"
992 | ]
993 | },
994 | {
995 | "cell_type": "markdown",
996 | "id": "d7463336-6b5d-4e9e-97a2-86fdf095a9f0",
997 | "metadata": {},
998 | "source": [
999 | "
How to submit
\n",
1000 | "\n",
1001 | "
Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n",
1002 | "\n",
1003 | "
This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n"
1004 | ]
1005 | },
1006 | {
1007 | "cell_type": "markdown",
1008 | "id": "b7708c87-cdca-4b2c-9edb-829d8ea8a477",
1009 | "metadata": {},
1010 | "source": [
1011 | "
About the Authors:
\n",
1012 | "\n",
1013 | "Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n",
1014 | "\n",
1015 | "### Other Contributors\n",
1016 | "\n",
1017 | "[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n"
1018 | ]
1019 | },
1020 | {
1021 | "cell_type": "markdown",
1022 | "id": "a993db4f-58c6-4192-a296-294459698ae3",
1023 | "metadata": {},
1024 | "source": [
1025 | "## Change Log\n",
1026 | "\n",
1027 | "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
1028 | "| ----------------- | ------- | ------------- | --------------------------- |\n",
1029 | "| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n",
1030 | "\n",
1031 | "##
\n","\n","\n","\n"]},{"cell_type":"markdown","id":"62e2bfb2-08bd-41de-a704-c72028242793","metadata":{},"source":["# Instructions\n"]},{"cell_type":"markdown","id":"c06281fa-8a02-493f-9b69-4b6738e6c8eb","metadata":{},"source":["In this notebook, you will practice all the classification algorithms that we have learned in this course.\n","\n","\n","Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n","\n","We will use some of the algorithms taught in the course, specifically:\n","\n","1. Linear Regression\n","2. KNN\n","3. Decision Trees\n","4. Logistic Regression\n","5. SVM\n","\n","We will evaluate our models using:\n","\n","1. Accuracy Score\n","2. Jaccard Index\n","3. F1-Score\n","4. LogLoss\n","5. Mean Absolute Error\n","6. Mean Squared Error\n","7. R2-Score\n","\n","Finally, you will use your models to generate the report at the end. \n"]},{"cell_type":"markdown","id":"9d4ee051-f50c-4ce5-aba1-167ffc8f5648","metadata":{},"source":["# About The Dataset\n"]},{"cell_type":"markdown","id":"4e4d2b57-e9af-4a7d-a7f9-b8c25660ba78","metadata":{},"source":["The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n","\n","The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n","\n","\n"]},{"cell_type":"markdown","id":"4b2d517d-9973-438d-8d32-dff28ad6ce84","metadata":{},"source":["This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n","\n","| Field | Description | Unit | Type |\n","| ------------- | ----------------------------------------------------- | --------------- | ------ |\n","| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n","| Location | Location of the Observation | Location | object |\n","| MinTemp | Minimum temperature | Celsius | float |\n","| MaxTemp | Maximum temperature | Celsius | float |\n","| Rainfall | Amount of rainfall | Millimeters | float |\n","| Evaporation | Amount of evaporation | Millimeters | float |\n","| Sunshine | Amount of bright sunshine | hours | float |\n","| WindGustDir | Direction of the strongest gust | Compass Points | object |\n","| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n","| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n","| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n","| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n","| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n","| Humidity9am | Humidity at 9am | Percent | float |\n","| Humidity3pm | Humidity at 3pm | Percent | float |\n","| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n","| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n","| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n","| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n","| Temp9am | Temperature at 9am | Celsius | float |\n","| Temp3pm | Temperature at 3pm | Celsius | float |\n","| RainToday | If there was rain today | Yes/No | object |\n","| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n","\n","Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n","\n"]},{"cell_type":"markdown","id":"3ad995f0-a174-49a5-aaef-76294021d5d4","metadata":{},"source":["## **Import the required libraries**\n"]},{"cell_type":"code","execution_count":null,"id":"38dca360-78ed-407c-9f48-26b405bf8695","metadata":{},"outputs":[],"source":["# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n","# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n","# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\""]},{"cell_type":"code","execution_count":null,"id":"ece29267-503d-4de0-8c69-f905815d57a3","metadata":{},"outputs":[],"source":["# Surpress warnings:\n","def warn(*args, **kwargs):\n"," pass\n","import warnings\n","warnings.warn = warn"]},{"cell_type":"code","execution_count":null,"id":"2344f678-d444-4a13-bd7f-d730f954116f","metadata":{},"outputs":[],"source":["import pandas as pd\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.linear_model import LinearRegression\n","from sklearn import preprocessing\n","import numpy as np\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.model_selection import train_test_split\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn import svm\n","from sklearn.metrics import jaccard_score\n","from sklearn.metrics import f1_score\n","from sklearn.metrics import log_loss\n","from sklearn.metrics import confusion_matrix, accuracy_score\n","import sklearn.metrics as metrics"]},{"cell_type":"markdown","metadata":{},"source":[]},{"cell_type":"markdown","id":"2bdad242-edb6-4a5b-8471-f5918f3ecab7","metadata":{},"source":["### Importing the Dataset\n"]},{"cell_type":"code","execution_count":null,"id":"fa02651f-de37-4666-8ad4-b4ddc8d0b0ac","metadata":{},"outputs":[],"source":["from pyodide.http import pyfetch\n","\n","async def download(url, filename):\n"," response = await pyfetch(url)\n"," if response.status == 200:\n"," with open(filename, \"wb\") as f:\n"," f.write(await response.bytes())"]},{"cell_type":"code","execution_count":null,"id":"4d127aff-9339-42f9-bdfc-f2e17a19dd4c","metadata":{},"outputs":[],"source":["path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'"]},{"cell_type":"code","execution_count":null,"id":"5f770f7d-f967-4ae4-a96a-a6c8c0a02bcc","metadata":{},"outputs":[],"source":["await download(path, \"Weather_Data.csv\")\n","filename =\"Weather_Data.csv\""]},{"cell_type":"code","execution_count":null,"id":"f87df4cd-160e-4878-bff0-73f1289f5b90","metadata":{},"outputs":[],"source":["df = pd.read_csv(\"Weather_Data.csv\")"]},{"cell_type":"markdown","id":"6a7acb6d-0e0e-4321-acd2-8096b272c39d","metadata":{},"source":["> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n"]},{"cell_type":"code","execution_count":null,"id":"f9c77ad8-8b85-4e82-af47-f63163f889c3","metadata":{},"outputs":[],"source":["#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n","#df = pd.read_csv(filepath)"]},{"cell_type":"code","execution_count":null,"id":"2fd31b01-a2bf-4263-8ff8-6fdce1990047","metadata":{},"outputs":[],"source":["df.head()"]},{"cell_type":"markdown","id":"eb2f4134-ab8b-48d8-aaab-85ce530aee65","metadata":{},"source":["### Data Preprocessing\n"]},{"cell_type":"markdown","id":"c70975f9-cae8-4cc6-be94-dbc1a880d3c9","metadata":{},"source":["#### One Hot Encoding\n"]},{"cell_type":"markdown","id":"cfadd018-a23d-4985-9eb3-8ed0d30abd52","metadata":{},"source":["First, we need to perform one hot encoding to convert categorical variables to binary variables.\n"]},{"cell_type":"code","execution_count":null,"id":"55968fd3-0422-4766-98fd-9397e0006e3e","metadata":{},"outputs":[],"source":["df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])"]},{"cell_type":"markdown","id":"e354a6fc-8c8b-499d-8444-5011a5146b1a","metadata":{},"source":["Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n"]},{"cell_type":"code","execution_count":null,"id":"77f75277-a3ca-4ccc-a5b7-f95b2491cc9b","metadata":{},"outputs":[],"source":["df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)"]},{"cell_type":"markdown","id":"88ab6c18-f36a-408c-8510-fdf831abc53b","metadata":{},"source":["### Training Data and Test Data\n"]},{"cell_type":"markdown","id":"2a25156c-4080-4b15-a45c-9d4884ddce06","metadata":{},"source":["Now, we set our 'features' or x values and our Y or target variable.\n"]},{"cell_type":"code","execution_count":null,"id":"3077604d-a2f3-4e24-88ff-64b5ce69d6f0","metadata":{},"outputs":[],"source":["df_sydney_processed.drop('Date',axis=1,inplace=True)"]},{"cell_type":"code","execution_count":null,"id":"e1b66dd7-5bb7-4739-96b8-374d8e89269e","metadata":{},"outputs":[],"source":["df_sydney_processed = df_sydney_processed.astype(float)"]},{"cell_type":"code","execution_count":null,"id":"29857426-177a-4c87-8982-4fdca4ff4d78","metadata":{},"outputs":[],"source":["features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\n","Y = df_sydney_processed['RainTomorrow']"]},{"cell_type":"markdown","id":"1be81f61-64c3-43d0-89a8-22ff1a9339cf","metadata":{},"source":["### Linear Regression\n"]},{"cell_type":"markdown","id":"60256d7f-4d49-4aed-b2df-dc44a5b0791c","metadata":{},"source":["#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n"]},{"cell_type":"code","execution_count":null,"id":"843e32d6-7bdd-4d12-b10a-de70bed0e974","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute\n","\n"]},{"cell_type":"code","execution_count":null,"id":"c38f2196-bb38-4322-a012-9a27e6a9d8d8","metadata":{},"outputs":[{"ename":"","evalue":"","output_type":"error","traceback":["\u001b[1;31mRunning cells with 'Python 3.11.4' requires the ipykernel package.\n","\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n","\u001b[1;31mCommand: 'c:/Users/sid41/AppData/Local/Programs/Python/Python311/python.exe -m pip install ipykernel -U --user --force-reinstall'"]}],"source":["x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)"]},{"cell_type":"markdown","id":"b144e3a9-6bd2-4e75-8cae-50e1e3b3fb17","metadata":{},"source":["#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"22b77c93-ecf2-4c54-8ad4-496ef5396695","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"a4293ee5-572a-45cb-a090-5f30af826630","metadata":{},"outputs":[],"source":["LinearReg = LinearRegression()\n","LinearReg.fit(x_train, y_train)"]},{"cell_type":"markdown","id":"1aa0f086-fefc-4e44-8aa2-822aa7828ad6","metadata":{},"source":["#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"a63dfa3d-a957-48dc-a067-93e9b3d11431","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"d6bdf734-7705-4590-aa54-6b93e20a9da7","metadata":{},"outputs":[],"source":["predictions = LinearReg.predict(x_test)"]},{"cell_type":"markdown","id":"ea13e307-0ac9-4c7c-874e-440cee795d95","metadata":{},"source":["#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"cf2408d4-4932-487d-85f2-913c2efce09f","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"aba34a58-a974-467b-8bad-10a0fd0e88d3","metadata":{},"outputs":[],"source":["from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n","\n","LinearRegression_MAE = mean_absolute_error(y_test, predictions)\n","LinearRegression_MSE = mean_squared_error(y_test, predictions)\n","LinearRegression_R2 = r2_score(y_test, predictions)\n"]},{"cell_type":"markdown","id":"4552ab70-ec8a-4455-8c5f-f75f6e2e2771","metadata":{},"source":["#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n"]},{"cell_type":"code","execution_count":null,"id":"cc932bbd-9528-45b5-b85a-8c9a0b9560ed","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"edd964f0-d1aa-4e3b-9a52-577656477438","metadata":{},"outputs":[],"source":["report_data = {\n"," 'Model': ['Linear Regression', 'KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n"," 'Accuracy': [LinearReg_Accuracy, KNN_Accuracy, Tree_Accuracy, LR_Accuracy, SVM_Accuracy],\n"," 'Jaccard Index': [LinearReg_Jaccard, KNN_Jaccard, Tree_Jaccard, LR_Jaccard, SVM_Jaccard],\n"," 'F1 Score': [LinearReg_F1, KNN_F1, Tree_F1, LR_F1, SVM_F1],\n"," 'Log Loss': [None, None, None, LR_LogLoss, None] # Only for Logistic Regression\n","}\n","\n","report_df = pd.DataFrame(report_data)\n","print(report_df)"]},{"cell_type":"markdown","id":"55351393-abee-4af8-8944-f0079265cd56","metadata":{},"source":["### KNN\n"]},{"cell_type":"markdown","id":"b7e38ebb-7442-4980-b883-e89c6de0d351","metadata":{},"source":["#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n"]},{"cell_type":"code","execution_count":null,"id":"213be9cb-8c88-4099-b023-def7c0e31c6a","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f03fe0e5-ac9f-42cd-a107-d5d46b16ae4d","metadata":{},"outputs":[],"source":["from sklearn.neighbors import KNeighborsClassifier\n","\n","KNN = KNeighborsClassifier(n_neighbors=4)\n","KNN.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"0ef93a31-0d67-4fa5-809b-a765b13f5888","metadata":{},"source":["#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"cf386d08-521b-418b-a5d6-601318bd8e93","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"8f88815e-92fd-4920-983c-326502b8bc29","metadata":{},"outputs":[],"source":["predictions = KNN.predict(x_test)"]},{"cell_type":"markdown","id":"9913f102-99a8-4af3-858d-1a55a2d49259","metadata":{},"source":["#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"02eb3dd1-0c03-407a-9042-d446bd5ce557","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"47ddb040-001c-4c7f-907a-e0fe69d2bfb7","metadata":{},"outputs":[],"source":["from sklearn.metrics import accuracy_score, jaccard_score, f1_score\n","\n","KNN_Accuracy_Score = accuracy_score(y_test, predictions)\n","KNN_JaccardIndex = jaccard_score(y_test, predictions)\n","KNN_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"KNN Accuracy Score: {KNN_Accuracy_Score}\")\n","print(f\"KNN Jaccard Index: {KNN_JaccardIndex}\")\n","print(f\"KNN F1 Score: {KNN_F1_Score}\")\n"]},{"cell_type":"markdown","id":"b1b49d21-0f4b-4737-a15c-88574afb6dc5","metadata":{},"source":["### Decision Tree\n"]},{"cell_type":"markdown","id":"07aedd58-1090-48ed-8fe9-38c64f4492ad","metadata":{},"source":["#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"b61b238f-7880-4cd9-9764-b69b5c63f352","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f4a9ea81-7336-4cab-aa98-22cb6c6e79ca","metadata":{},"outputs":[],"source":["from sklearn.tree import DecisionTreeClassifier\n","\n","Tree = DecisionTreeClassifier()\n","Tree.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"d79279ba-1e22-45c9-a6a7-e1a39a61c512","metadata":{},"source":["#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"5c99f9a8-2fa2-48f3-83e9-8781dc1b0d61","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":2,"id":"4f6ea989-4778-47c8-9bc0-f6fae679f3c3","metadata":{},"outputs":[{"ename":"NameError","evalue":"name 'Tree' is not defined","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[2], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m predictions \u001b[38;5;241m=\u001b[39m \u001b[43mTree\u001b[49m\u001b[38;5;241m.\u001b[39mpredict(x_test)\n","\u001b[1;31mNameError\u001b[0m: name 'Tree' is not defined"]}],"source":["predictions = Tree.predict(x_test)"]},{"cell_type":"markdown","id":"f19bae36-1072-4baf-bcf8-3c098eeb1915","metadata":{},"source":["#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"0c47c2e2-14c7-4e5a-aaea-86212a0fa032","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"edb09a5f-c3e2-4c4f-924b-85f6ffe96729","metadata":{},"outputs":[],"source":["Tree_Accuracy_Score = accuracy_score(y_test, predictions)\n","Tree_JaccardIndex = jaccard_score(y_test, predictions)\n","Tree_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"Decision Tree Accuracy Score: {Tree_Accuracy_Score}\")\n","print(f\"Decision Tree Jaccard Index: {Tree_JaccardIndex}\")\n","print(f\"Decision Tree F1 Score: {Tree_F1_Score}\")"]},{"cell_type":"markdown","id":"f2905933-3b27-4ece-a80a-b35744f54b5f","metadata":{},"source":["### Logistic Regression\n"]},{"cell_type":"markdown","id":"490cdcfd-14aa-417d-b3ed-29577d13f7da","metadata":{},"source":["#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n"]},{"cell_type":"code","execution_count":null,"id":"31bb6aa1-c399-4aed-89eb-4800b1854f0f","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f7f13536-93e4-4edd-9c75-49a6d342f446","metadata":{},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","\n","x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)\n"]},{"cell_type":"markdown","id":"cd2f53d8-3983-4581-8363-f11950b80b85","metadata":{},"source":["#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n"]},{"cell_type":"code","execution_count":null,"id":"d8bf2bf5-8c9b-4250-beca-591e1a62dd1d","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"57ef08e5-9b64-4337-92a6-9af7f98a81a4","metadata":{},"outputs":[],"source":["from sklearn.linear_model import LogisticRegression\n","\n","LR = LogisticRegression(solver='liblinear')\n","LR.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"cdaf1cdd-61de-46e4-a252-6641aa13998b","metadata":{},"source":["#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n"]},{"cell_type":"code","execution_count":null,"id":"421725a3-8a77-4239-b6ee-3f7053ad6807","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"b2aad2f1-d2b7-4267-ab2e-1913d6b681a8","metadata":{},"outputs":[],"source":["predictions = LR.predict(x_test)"]},{"cell_type":"code","execution_count":null,"id":"855f934b-4098-4e23-b5dd-b1ebff6c02b8","metadata":{},"outputs":[],"source":["predict_proba = LR.predict_proba(x_test)\n"]},{"cell_type":"markdown","id":"08f3dc13-8be2-4d1e-9866-ff97141ae692","metadata":{},"source":["#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"0be54747-7fa3-46e6-a4cf-bd5e820ac616","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"6dc992b9-a3f8-49b5-a988-90118971013f","metadata":{},"outputs":[],"source":["from sklearn.metrics import log_loss\n","\n","LR_Accuracy_Score = accuracy_score(y_test, predictions)\n","LR_JaccardIndex = jaccard_score(y_test, predictions)\n","LR_F1_Score = f1_score(y_test, predictions)\n","LR_Log_Loss = log_loss(y_test, predict_proba)\n","\n","print(f\"Logistic Regression Accuracy Score: {LR_Accuracy_Score}\")\n","print(f\"Logistic Regression Jaccard Index: {LR_JaccardIndex}\")\n","print(f\"Logistic Regression F1 Score: {LR_F1_Score}\")\n","print(f\"Logistic Regression Log Loss: {LR_Log_Loss}\")\n"]},{"cell_type":"markdown","id":"0c7326ae-5aa6-4666-b4d6-0705e5bcb771","metadata":{},"source":["### SVM\n"]},{"cell_type":"markdown","id":"920bae21-8886-4705-b6b1-85c1ca4506ee","metadata":{},"source":["#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"4ed2651e-3dd8-46bd-8a31-e7efa095a5dc","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"55d94ee3-60bb-4307-8fae-8d87dfd0f5ad","metadata":{},"outputs":[],"source":["from sklearn.svm import SVC\n","\n","SVM = SVC()\n","SVM.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"755cb519-2721-4674-9d21-85a154fde994","metadata":{},"source":["#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"de56e316-aaca-4ed9-89eb-1d69140ff04c","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"cb98d313-75b6-4bea-b79c-efec4b9e412c","metadata":{},"outputs":[],"source":["predictions = SVM.predict(x_test)"]},{"cell_type":"markdown","id":"961ccca3-1fac-476a-93d2-39d6c8b3905b","metadata":{},"source":["#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"34922618-6a7d-494c-a1b6-5f515f29a801","metadata":{},"outputs":[],"source":["SVM_Accuracy_Score = accuracy_score(y_test, predictions)\n","SVM_JaccardIndex = jaccard_score(y_test, predictions)\n","SVM_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"SVM Accuracy Score: {SVM_Accuracy_Score}\")\n","print(f\"SVM Jaccard Index: {SVM_JaccardIndex}\")\n","print(f\"SVM F1 Score: {SVM_F1_Score}\")\n"]},{"cell_type":"markdown","id":"4e02f921-2696-4a0b-b9b6-cfc89a55f77d","metadata":{},"source":["### Report\n"]},{"cell_type":"markdown","id":"1f696bf7-a40a-404b-af35-b9a66f6304d6","metadata":{},"source":["#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n","\n","\\*LogLoss is only for Logistic Regression Model\n"]},{"cell_type":"code","execution_count":null,"id":"f7cc9f99-9da8-48e1-916e-fd642e28b773","metadata":{},"outputs":[],"source":["import pandas as pd\n","\n","Report = pd.DataFrame({\n"," 'Model': ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n"," 'Accuracy': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],\n"," 'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],\n"," 'F1-Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],\n"," 'LogLoss': [None, None, LR_Log_Loss, None] # Only Logistic Regression uses LogLoss\n","})\n","\n","print(Report)\n"]},{"cell_type":"markdown","id":"d7463336-6b5d-4e9e-97a2-86fdf095a9f0","metadata":{},"source":["
How to submit
\n","\n","
Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n","\n","
This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n"]},{"cell_type":"markdown","id":"b7708c87-cdca-4b2c-9edb-829d8ea8a477","metadata":{},"source":["
About the Authors:
\n","\n","Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n","\n","### Other Contributors\n","\n","[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n"]},{"cell_type":"markdown","id":"a993db4f-58c6-4192-a296-294459698ae3","metadata":{},"source":["## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ------------- | --------------------------- |\n","| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n","\n","##
"
871 | ]
872 | },
873 | "metadata": {},
874 | "execution_count": 9
875 | }
876 | ]
877 | },
878 | {
879 | "cell_type": "markdown",
880 | "source": [
881 | "Finally lets split our data into a training and testing dataset using train_test_split from sklearn.model_selection"
882 | ],
883 | "metadata": {
884 | "id": "MOgldE4Cbzca"
885 | }
886 | },
887 | {
888 | "cell_type": "code",
889 | "source": [
890 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)"
891 | ],
892 | "metadata": {
893 | "id": "ac3sCRS7b0zZ"
894 | },
895 | "execution_count": 10,
896 | "outputs": []
897 | },
898 | {
899 | "cell_type": "markdown",
900 | "source": [
901 | "**Create Regression Tree**\n",
902 | "\n",
903 | "---\n",
904 | "\n",
905 | "* Regression Trees are implemented using DecisionTreeRegressor from sklearn.tree\n",
906 | "\n",
907 | "The important parameters of DecisionTreeRegressor are\n",
908 | "\n",
909 | "* criterion: {\"mse\", \"friedman_mse\", \"mae\", \"poisson\"} - The function used to measure error\n",
910 | "\n",
911 | "* max_depth - The max depth the tree can be\n",
912 | "\n",
913 | "* min_samples_split - The minimum number of samples required to split a node\n",
914 | "\n",
915 | "* min_samples_leaf - The minimum number of samples that a leaf can contain\n",
916 | "\n",
917 | "* max_features: {\"auto\", \"sqrt\", \"log2\"} - The number of feature we examine looking for the best one, used to speed up training"
918 | ],
919 | "metadata": {
920 | "id": "e5HuJdOZb3d4"
921 | }
922 | },
923 | {
924 | "cell_type": "markdown",
925 | "source": [
926 | "First lets start by creating a DecisionTreeRegressor object, setting the criterion parameter to mse for Mean Squared Error"
927 | ],
928 | "metadata": {
929 | "id": "Go5P-LqvcF0q"
930 | }
931 | },
932 | {
933 | "cell_type": "code",
934 | "source": [
935 | "# regression_tree = DecisionTreeRegressor(criterion = 'mse')\n",
936 | "regression_tree = DecisionTreeRegressor(criterion = 'squared_error')"
937 | ],
938 | "metadata": {
939 | "id": "9iuZaWyxcGuL"
940 | },
941 | "execution_count": 21,
942 | "outputs": []
943 | },
944 | {
945 | "cell_type": "markdown",
946 | "source": [],
947 | "metadata": {
948 | "id": "xqdIjV1ecckQ"
949 | }
950 | },
951 | {
952 | "cell_type": "markdown",
953 | "source": [
954 | "**Training**\n",
955 | "\n",
956 | "---\n",
957 | "\n"
958 | ],
959 | "metadata": {
960 | "id": "kr8OIoVFcKJo"
961 | }
962 | },
963 | {
964 | "cell_type": "markdown",
965 | "source": [
966 | "Now lets train our model using the fit method on the DecisionTreeRegressor object providing our training data"
967 | ],
968 | "metadata": {
969 | "id": "7s6YlYl8cM6u"
970 | }
971 | },
972 | {
973 | "cell_type": "code",
974 | "source": [
975 | "regression_tree.fit(X_train, Y_train)"
976 | ],
977 | "metadata": {
978 | "colab": {
979 | "base_uri": "https://localhost:8080/",
980 | "height": 74
981 | },
982 | "id": "zdbto3UOcO7P",
983 | "outputId": "ca433c9c-53e3-41a9-df90-0054433d6fa2"
984 | },
985 | "execution_count": 22,
986 | "outputs": [
987 | {
988 | "output_type": "execute_result",
989 | "data": {
990 | "text/plain": [
991 | "DecisionTreeRegressor()"
992 | ],
993 | "text/html": [
994 | "
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.