├── .gitignore ├── Classification ├── Decision-Trees-drug.ipynb ├── Faster_Taxi_Tip_Prediction.ipynb ├── K-Nearest-neighbors-CustCat.ipynb ├── Regression_Trees.ipynb └── classification_tree_svm.ipynb ├── Clustering └── Clus-K-Means-Customer-Seg.ipynb ├── Linear Classification ├── Clas-SVM-cancer-project.ipynb ├── GridSearchCV.html ├── Logistic-Regression-churn.ipynb └── Multi-class_Classification.ipynb ├── README.md └── Regression ├── Mulitple-Linear-Regression-Co2.ipynb └── Simple-Linear-Regression-Co2.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | practice/ 2 | -------------------------------------------------------------------------------- /Classification/Decision-Trees-drug.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"f076aa51-8721-4294-852f-dc57e0faf672","metadata":{},"outputs":[],"source":["

\n"," \n"," $\"Skills$ \n"," \n","

\n","\n","# Decision Trees\n","\n","Estimated time needed: **15** minutes\n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Develop a classification model using Decision Tree Algorithm\n"]},{"cell_type":"markdown","id":"9988b362-7147-46f3-a9ef-cee28326050d","metadata":{},"outputs":[],"source":["In this lab exercise, you will learn a popular machine learning algorithm, Decision Trees. You will use this classification algorithm to build a model from the historical data of patients, and their response to different medications. Then you will use the trained decision tree to predict the class of an unknown patient, or to find a proper drug for a new patient.\n"]},{"cell_type":"markdown","id":"c7cfbef6-4631-468f-9faa-878afac837da","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

\n","
\n","

\n"]},{"cell_type":"markdown","id":"289ad36e-04ea-422f-a363-ba73031205a6","metadata":{},"outputs":[],"source":["Import the Following Libraries:\n","\n","

numpy (as np)
pandas
DecisionTreeClassifier from sklearn.tree

\n"]},{"cell_type":"markdown","id":"44a2ca3a-0e0e-4a60-9036-f7999a3b6b50","metadata":{},"outputs":[],"source":["if you uisng you own version comment out\n"]},{"cell_type":"code","id":"35a957ad-2210-40d8-b122-dbad4bc2d703","metadata":{},"outputs":[],"source":["# Surpress warnings:\ndef warn(*args, **kwargs):\n pass\nimport warnings\nwarnings.warn = warn"]},{"cell_type":"code","id":"94166aa2-9f99-4210-b56e-01119f654594","metadata":{},"outputs":[],"source":["import sys\nimport numpy as np \nimport pandas as pd\nfrom sklearn.tree import DecisionTreeClassifier\nimport sklearn.tree as tree"]},{"cell_type":"markdown","id":"9ac3b5a6-8ec3-404a-a6f5-e614822d5e53","metadata":{},"outputs":[],"source":["

\n","

About the dataset

\n"," Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. \n","
\n","
\n"," Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.\n","
\n","
\n"," It is a sample of multiclass classifier, and you can use the training part of the dataset \n"," to build a decision tree, and then use it to predict the class of an unknown patient, or to prescribe a drug to a new patient.\n","

\n"]},{"cell_type":"markdown","id":"04386aa4-c572-439f-a9ae-a8922d1b9117","metadata":{},"outputs":[],"source":["

\n","

Downloading the Data

\n"," To download the data, we will use pandas library to read itdirectly into a dataframe from IBM Object Storage.\n","

\n"]},{"cell_type":"code","id":"eb55716d-bbe7-4438-a67e-e155689500a6","metadata":{},"outputs":[],"source":["my_data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv', delimiter=\",\")\nmy_data.head()"]},{"cell_type":"markdown","id":"d34e503f-791f-4b8e-8768-9071114a35f7","metadata":{},"outputs":[],"source":["**Did you know?** When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","id":"abd637ad-3fbc-4cd6-b204-ff36a450f35e","metadata":{},"outputs":[],"source":["

\n","

Practice

\n"," What is the size of data? \n","

\n"]},{"cell_type":"code","id":"8d1ed758-dfb4-47f5-99bd-8e86527cb5e4","metadata":{},"outputs":[],"source":["# write your code here\n\n\n"]},{"cell_type":"markdown","id":"892dd3c3-2127-4d2a-8249-81db0348996a","metadata":{},"outputs":[],"source":["

Click here for the solution

\n","\n","```python\n","my_data.shape\n","\n","```\n","\n","

\n"]},{"cell_type":"markdown","id":"0100583b-6523-48eb-a674-1bc3ec153b52","metadata":{},"outputs":[],"source":["

\n","

Pre-processing

\n","

\n"]},{"cell_type":"markdown","id":"ce559a26-7dec-4d31-8cf4-672ffe89bc0c","metadata":{},"outputs":[],"source":["Using my_data as the Drug.csv data read by pandas, declare the following variables:
\n","\n","

X as the Feature Matrix (data of my_data)
y as the response vector (target)

\n"]},{"cell_type":"markdown","id":"fa1dd16e-313f-450c-8fe9-de3be740e4f9","metadata":{},"outputs":[],"source":["Remove the column containing the target name since it doesn't contain numeric values.\n"]},{"cell_type":"code","id":"5344e3dc-7b2b-4d4d-874e-9fe72b1eacef","metadata":{},"outputs":[],"source":["X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values\nX[0:5]\n"]},{"cell_type":"markdown","id":"4cbd957e-cd50-41fe-9c98-5256dd5171fc","metadata":{},"outputs":[],"source":["As you may figure out, some features in this dataset are categorical, such as **Sex** or **BP**. Unfortunately, Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using the **LabelEncoder() method**\n","to convert the categorical variable into dummy/indicator variables.\n"]},{"cell_type":"code","id":"4bd9880f-93a8-4c50-8da8-862a6e52b377","metadata":{},"outputs":[],"source":["from sklearn import preprocessing\nle_sex = preprocessing.LabelEncoder()\nle_sex.fit(['F','M'])\nX[:,1] = le_sex.transform(X[:,1]) \n\n\nle_BP = preprocessing.LabelEncoder()\nle_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])\nX[:,2] = le_BP.transform(X[:,2])\n\n\nle_Chol = preprocessing.LabelEncoder()\nle_Chol.fit([ 'NORMAL', 'HIGH'])\nX[:,3] = le_Chol.transform(X[:,3]) \n\nX[0:5]\n"]},{"cell_type":"markdown","id":"db6d82bc-6878-49ce-b6c1-e6ce1e408692","metadata":{},"outputs":[],"source":["Now we can fill the target variable.\n"]},{"cell_type":"code","id":"2ff65941-1a6b-4548-b605-701ba873b9f1","metadata":{},"outputs":[],"source":["y = my_data[\"Drug\"]\ny[0:5]"]},{"cell_type":"markdown","id":"54dfc234-6476-486c-9a69-5c9651bcb48f","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Setting up the Decision Tree

\n"," We will be using train/test split on our decision tree. Let's import train_test_split from sklearn.cross_validation.\n","

\n"]},{"cell_type":"code","id":"fa69ddb6-3c32-4090-87fd-bd19ea0952f5","metadata":{},"outputs":[],"source":["from sklearn.model_selection import train_test_split"]},{"cell_type":"markdown","id":"94161ff1-f396-4866-b424-0c4a35bf7a17","metadata":{},"outputs":[],"source":["Now train_test_split will return 4 different parameters. We will name them:
\n","X_trainset, X_testset, y_trainset, y_testset

\n","The train_test_split will need the parameters:
\n","X, y, test_size=0.3, and random_state=3.

\n","The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.\n"]},{"cell_type":"code","id":"07fc6d56-56e2-49b8-b454-a780ae1e24ff","metadata":{},"outputs":[],"source":["X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)"]},{"cell_type":"markdown","id":"41f1a713-7e46-4dfa-8e56-f4d99499dcc5","metadata":{},"outputs":[],"source":["

Practice

\n","Print the shape of X_trainset and y_trainset. Ensure that the dimensions match.\n"]},{"cell_type":"code","id":"1128e3ab-2532-479a-bcd8-44ef4fc779b8","metadata":{},"outputs":[],"source":["# your code\n\n"]},{"cell_type":"markdown","id":"48b12a31-ad3e-47bb-a303-8948fa376d9e","metadata":{},"outputs":[],"source":["

Click here for the solution

\n","\n","```python\n","print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))\n","\n","```\n","\n","

\n"]},{"cell_type":"markdown","id":"e2cc60a8-5a95-4519-9e95-52019cb0bb0d","metadata":{},"outputs":[],"source":["Print the shape of X_testset and y_testset. Ensure that the dimensions match.\n"]},{"cell_type":"code","id":"b2e5c903-9f91-40a9-a06b-407305040c7f","metadata":{},"outputs":[],"source":["# your code\n\n"]},{"cell_type":"markdown","id":"54e8f9a7-431c-43a7-9497-f7d80f7fb60a","metadata":{},"outputs":[],"source":["

Click here for the solution

\n","\n","```python\n","print('Shape of X training set {}'.format(X_testset.shape),'&',' Size of Y training set {}'.format(y_testset.shape))\n","\n","```\n","\n","

\n"]},{"cell_type":"markdown","id":"7db98bdf-d418-4a52-af48-8c1b5fc6e220","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Modeling

\n"," We will first create an instance of the DecisionTreeClassifier called drugTree.
\n"," Inside of the classifier, specify criterion=\"entropy\" so we can see the information gain of each node.\n","

\n"]},{"cell_type":"code","id":"e0f7c2bf-8cfb-4719-960f-7c4047d2c907","metadata":{},"outputs":[],"source":["drugTree = DecisionTreeClassifier(criterion=\"entropy\", max_depth = 4)\ndrugTree # it shows the default parameters"]},{"cell_type":"markdown","id":"912f9dee-3c09-425e-a2d8-a17f9e80fc7f","metadata":{},"outputs":[],"source":["Next, we will fit the data with the training feature matrix X_trainset and training response vector y_trainset \n"]},{"cell_type":"code","id":"ce6d1b4c-fb38-4883-90e5-4479079cdccd","metadata":{},"outputs":[],"source":["drugTree.fit(X_trainset,y_trainset)"]},{"cell_type":"markdown","id":"95565da6-b489-4079-91d9-6951decc920b","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Prediction

\n"," Let's make some predictions on the testing dataset and store it into a variable called predTree.\n","

\n"]},{"cell_type":"code","id":"0075aacc-8614-4687-a212-2accdb97cfbd","metadata":{},"outputs":[],"source":["predTree = drugTree.predict(X_testset)"]},{"cell_type":"markdown","id":"44fbcfa3-178f-409c-9503-431a08aa1668","metadata":{},"outputs":[],"source":["You can print out predTree and y_testset if you want to visually compare the predictions to the actual values.\n"]},{"cell_type":"code","id":"edfa7c65-4cb8-4628-a178-7eda96b79a14","metadata":{},"outputs":[],"source":["print (predTree [0:5])\nprint (y_testset [0:5])\n"]},{"cell_type":"markdown","id":"e3b70361-e2e2-4f39-bde2-50a2cbbc7623","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Evaluation

\n"," Next, let's import metrics from sklearn and check the accuracy of our model.\n","

\n"]},{"cell_type":"code","id":"7830517c-4c14-4371-954b-61136f0852ad","metadata":{},"outputs":[],"source":["from sklearn import metrics\nimport matplotlib.pyplot as plt\nprint(\"DecisionTrees's Accuracy: \", metrics.accuracy_score(y_testset, predTree))"]},{"cell_type":"markdown","id":"98965c4a-ca91-46d3-8577-1d297dbd6ec1","metadata":{},"outputs":[],"source":["**Accuracy classification score** computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.\n","\n","In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n"]},{"cell_type":"markdown","id":"6e52b91b-8aa2-4af0-bb9f-2e9ec06bc4f1","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Visualization

\n","\n","Let's visualize the tree\n","\n","

\n"]},{"cell_type":"code","id":"d03eb273-2934-4b1c-a66b-6cce952be3a5","metadata":{},"outputs":[],"source":["# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before\n#!conda install -c conda-forge pydotplus -y\n#!conda install -c conda-forge python-graphviz -y\n\n#After executing the code below, a file named 'tree.png' would be generated which contains the decision tree image."]},{"cell_type":"code","id":"06636890-7ee5-40d2-bbd4-27d22f3c9162","metadata":{},"outputs":[],"source":["from sklearn.tree import export_graphviz\nexport_graphviz(drugTree, out_file='tree.dot', filled=True, feature_names=['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K'])\n!dot -Tpng tree.dot -o tree.png\n"]},{"cell_type":"markdown","id":"ce9c5f0d-48fd-4d87-9649-eaadf3108ff5","metadata":{},"outputs":[],"source":["

Want to learn more?

\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n"]},{"cell_type":"markdown","id":"0bd1756a-d2ae-4930-93fa-e8de74723000","metadata":{},"outputs":[],"source":["### Thank you for completing this lab!\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","Richard Ye\n","\n","## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ---------- | ------------------------------------------------ |\n","| 2022-05-24 | 2.3 | Richard Ye | Fixed ability to work in JupyterLite and locally |\n","| 2020-11-20 | 2.2 | Lakshmi | Changed import statement of StringIO |\n","| 2020-11-03 | 2.1 | Lakshmi | Changed URL of the csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","##

© IBM Corporation 2020. All rights reserved.

\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Classification/Faster_Taxi_Tip_Prediction.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"754df58e-9522-4366-bbdd-07f1f270c53a","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n"]},{"cell_type":"markdown","id":"fcc96a9c-98be-4cf5-a2dc-b1bc459e130e","metadata":{},"outputs":[],"source":["# Taxi Tip Prediction using Scikit-Learn and Snap ML\n"]},{"cell_type":"markdown","id":"7b99875b-1dc1-4d11-828a-9515d82fe4c6","metadata":{},"outputs":[],"source":["Estimated time needed: 30 minutes\n"]},{"cell_type":"markdown","id":"9878c9d4-e6d1-4747-813b-93050837b322","metadata":{},"outputs":[],"source":["In this exercise session you will consolidate your machine learning (ML) modeling skills by using a popular regression model: Decision Tree. You will use a real dataset to train such a model. The dataset includes information about taxi tip and was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). You will use the trained model to predict the amount of tip paid. \n","\n","In the current exercise session, you will practice not only the Scikit-Learn Python interface, but also the Python API offered by the Snap Machine Learning (Snap ML) library. Snap ML is a high-performance IBM library for ML modeling. It provides highly-efficient CPU/GPU implementations of linear models and tree-based models. Snap ML not only accelerates ML algorithms through system awareness, but it also offers novel ML algorithms with best-in-class accuracy. For more information, please visit https://www.zurich.ibm.com/snapml/.\n"]},{"cell_type":"markdown","id":"a9bd7a14-bb27-4666-acb4-9f9309355bdd","metadata":{},"outputs":[],"source":["## Objectives\n"]},{"cell_type":"markdown","id":"358d0c1c-2cee-439c-abb5-051c898fcb38","metadata":{},"outputs":[],"source":["After completing this lab you will be able to:\n"]},{"cell_type":"markdown","id":"e6238628-3700-4b60-be9c-dd38bba88032","metadata":{},"outputs":[],"source":["* Perform basic data preprocessing using Scikit-Learn\n","* Model a regression task using the Scikit-Learn and Snap ML Python APIs\n","* Train a Decision Tree Regressor model using Scikit-Learn and Snap ML\n","* Run inference and assess the quality of the trained models\n"]},{"cell_type":"markdown","id":"a68ad47a-2b88-4507-9bbd-7222e6c04363","metadata":{},"outputs":[],"source":["## Table of Contents\n"]},{"cell_type":"markdown","id":"bd322df2-2d92-4279-ae89-6e18190d522b","metadata":{},"outputs":[],"source":["
\n","
\n","
Introduction
\n","
Import Libraries
\n","
Dataset Analysis
\n","
Dataset Preprocessing
\n","
Dataset Train/Test Split
\n","
Build a Decision Tree Regressor model with Scikit-Learn
\n","
Build a Decision Tree Regressor model with Snap ML
\n","
Evaluate the Scikit-Learn and Snap ML Decision Tree Regressors
\n","
\n","
\n","
\n","
\n"]},{"cell_type":"markdown","id":"90813eb3-7995-4ca4-b2d4-c14dd9420849","metadata":{},"outputs":[],"source":["
\n","
Introduction
\n","
The dataset used in this exercise session is publicly available here: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (all rights reserved by Taxi & Limousine Commission(TLC), City of New York). The TLC Yellow Taxi Trip Records of June, 2019 are used in this notebook. The prediction of the tip amount can be modeled as a regression problem. To train the model you can use part of the input dataset and the remaining data can be used to assess the quality of the trained model. First, let's download the dataset.\n","
\n","
\n"]},{"cell_type":"code","id":"f5232e30-c688-412d-acc3-b0e949e0cbf6","metadata":{},"outputs":[],"source":["# download June 2020 TLC Yellow Taxi Trip records\n!wget -nc https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/yellow_tripdata_2019-06.csv "]},{"cell_type":"code","id":"5a70fb83-e1ad-44c4-b1fa-3e7400ab1cd0","metadata":{},"outputs":[],"source":["# download June 2020 TLC Yellow Taxi Trip records\n# Uncomment the next line, if working locally\n#!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/yellow_tripdata_2019-06.csv "]},{"cell_type":"markdown","id":"bcb8605f-0e25-4c15-8e20-cc4b5c06134b","metadata":{},"outputs":[],"source":["Did you know? When it comes to Machine Learning, you will most likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","id":"48ff0f93-0aa5-44fa-8425-fae89f164d1b","metadata":{},"outputs":[],"source":["
\n","
Import Libraries
\n","
\n"]},{"cell_type":"code","id":"4037734a-e643-49d6-8b12-846861b47a27","metadata":{},"outputs":[],"source":["# Snap ML is available on PyPI. To install it simply run the pip command below.\n!pip install snapml==1.8.2"]},{"cell_type":"code","id":"d1c47463-ec81-4817-9792-78564a0f8104","metadata":{},"outputs":[],"source":["# Import the libraries we need to use in this lab\nfrom future import print_function\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n%matplotlib inline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import normalize, StandardScaler, MinMaxScaler\nfrom sklearn.utils.class_weight import compute_sample_weight\nfrom sklearn.metrics import mean_squared_error\nimport time\nimport warnings\nimport gc, sys\nwarnings.filterwarnings('ignore')"]},{"cell_type":"markdown","id":"84d3ba10-da07-4ce6-ba49-96ec0010360e","metadata":{},"outputs":[],"source":["
\n","
Dataset Analysis
\n","
\n"]},{"cell_type":"markdown","id":"a3c25570-5897-4776-99e5-df22f4d88adb","metadata":{},"outputs":[],"source":["In this section you will read the dataset in a Pandas dataframe and visualize its content. You will also look at some data statistics.\n","\n","Note: A Pandas dataframe is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure. For more information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html.\n"]},{"cell_type":"code","id":"a6cd86f6-e1ca-4aa3-a7e7-c37a6482c5ff","metadata":{},"outputs":[],"source":["# read the input data\nraw_data = pd.read_csv('yellow_tripdata_2019-06.csv')\nprint(\"There are \" + str(len(raw_data)) + \" observations in the dataset.\")\nprint(\"There are \" + str(len(raw_data.columns)) + \" variables in the dataset.\")\n\n# display first rows in the dataset\nraw_data.head()"]},{"cell_type":"markdown","id":"d936e4ec-08d8-4af9-a46f-ec9cbae4c56e","metadata":{},"outputs":[],"source":["Each row in the dataset represents a taxi trip. As shown above, each row has 18 variables. One variable is called tip_amount and represents the target variable. Your objective will be to train a model that uses the other variables to predict the value of the tip_amount variable. Let's first clean the dataset and retrieve basic statistics about the target variable.\n"]},{"cell_type":"code","id":"c9b46639-f373-4fee-b3f0-224b80075bc0","metadata":{},"outputs":[],"source":["# some trips report 0 tip. it is assumed that these tips were paid in cash.\n# for this study we drop all these rows\nraw_data = raw_data[raw_data['tip_amount'] > 0]\n\n# we also remove some outliers, namely those where the tip was larger than the fare cost\nraw_data = raw_data[(raw_data['tip_amount'] <= raw_data['fare_amount'])]\n\n# we remove trips with very large fare cost\nraw_data = raw_data[((raw_data['fare_amount'] >=2) & (raw_data['fare_amount'] < 200))]\n\n# we drop variables that include the target variable in it, namely the total_amount\nclean_data = raw_data.drop(['total_amount'], axis=1)\n\n# release memory occupied by raw_data as we do not need it anymore\n# we are dealing with a large dataset, thus we need to make sure we do not run out of memory\ndel raw_data\ngc.collect()\n\n# print the number of trips left in the dataset\nprint(\"There are \" + str(len(clean_data)) + \" observations in the dataset.\")\nprint(\"There are \" + str(len(clean_data.columns)) + \" variables in the dataset.\")\n\nplt.hist(clean_data.tip_amount.values, 16, histtype='bar', facecolor='g')\nplt.show()\n\nprint(\"Minimum amount value is \", np.min(clean_data.tip_amount.values))\nprint(\"Maximum amount value is \", np.max(clean_data.tip_amount.values))\nprint(\"90% of the trips have a tip amount less or equal than \", np.percentile(clean_data.tip_amount.values, 90))"]},{"cell_type":"code","id":"dd9d4b72-7d8e-408f-acc1-f4b537332f08","metadata":{},"outputs":[],"source":["# display first rows in the dataset\nclean_data.head()"]},{"cell_type":"markdown","id":"875ea433-4dee-4a08-9fe0-6f3c382c12a8","metadata":{},"outputs":[],"source":["By looking at the dataset in more detail, we see that it contains information such as pick-up and drop-off dates/times, pick-up and drop-off locations, payment types, driver-reported passenger counts etc. Before actually training a ML model, we will need to preprocess the data. We need to transform the data in a format that will be correctly handled by the models. For instance, we need to encode the categorical features.\n"]},{"cell_type":"markdown","id":"1a031178-6028-411e-9249-db470c2ad4c5","metadata":{},"outputs":[],"source":["
\n","
Dataset Preprocessing
\n","
\n"]},{"cell_type":"markdown","id":"48bfdd20-8ce2-4b1f-b7de-5d19db94aca5","metadata":{},"outputs":[],"source":["In this subsection you will prepare the data for training. \n"]},{"cell_type":"code","id":"42549b60-367a-4fce-8fd4-51350b600b0a","metadata":{},"outputs":[],"source":["\n# Convert 'tpep_dropoff_datetime' and 'tpep_pickup_datetime' columns to datetime objects\nclean_data['tpep_dropoff_datetime'] = pd.to_datetime(clean_data['tpep_dropoff_datetime'])\nclean_data['tpep_pickup_datetime'] = pd.to_datetime(clean_data['tpep_pickup_datetime'])\n\n# Extract pickup and dropoff hour\nclean_data['pickup_hour'] = clean_data['tpep_pickup_datetime'].dt.hour\nclean_data['dropoff_hour'] = clean_data['tpep_dropoff_datetime'].dt.hour\n\n# Extract pickup and dropoff day of the week (0 = Monday, 6 = Sunday)\nclean_data['pickup_day'] = clean_data['tpep_pickup_datetime'].dt.weekday\nclean_data['dropoff_day'] = clean_data['tpep_dropoff_datetime'].dt.weekday\n\n# Calculate trip time in seconds\nclean_data['trip_time'] = (clean_data['tpep_dropoff_datetime'] - clean_data['tpep_pickup_datetime']).dt.total_seconds()\n\n# Ideally use the full dataset for this exercise.\n# However, if you run into out-of-memory issues due to the data size, reduce it.\n# For instance, in this example, we use only the first 200,000 samples.\nfirst_n_rows = 200000\nclean_data = clean_data.head(first_n_rows)"]},{"cell_type":"code","id":"40ed6661-c397-4565-9bfb-8d0c00bca8f1","metadata":{},"outputs":[],"source":["# drop the pickup and dropoff datetimes\nclean_data = clean_data.drop(['tpep_pickup_datetime', 'tpep_dropoff_datetime'], axis=1)\n\n# some features are categorical, we need to encode them\n# to encode them we use one-hot encoding from the Pandas package\nget_dummy_col = [\"VendorID\",\"RatecodeID\",\"store_and_fwd_flag\",\"PULocationID\", \"DOLocationID\",\"payment_type\", \"pickup_hour\", \"dropoff_hour\", \"pickup_day\", \"dropoff_day\"]\nproc_data = pd.get_dummies(clean_data, columns = get_dummy_col)\n\n# release memory occupied by clean_data as we do not need it anymore\n# we are dealing with a large dataset, thus we need to make sure we do not run out of memory\ndel clean_data\ngc.collect()"]},{"cell_type":"code","id":"a2dd57d3-1e29-47d7-8d75-73368cae8031","metadata":{},"outputs":[],"source":["# extract the labels from the dataframe\ny = proc_data[['tip_amount']].values.astype('float32')\n\n# drop the target variable from the feature matrix\nproc_data = proc_data.drop(['tip_amount'], axis=1)\n\n# get the feature matrix used for training\nX = proc_data.values\n\n# normalize the feature matrix\nX = normalize(X, axis=1, norm='l1', copy=False)\n\n# print the shape of the features matrix and the labels vector\nprint('X.shape=', X.shape, 'y.shape=', y.shape)"]},{"cell_type":"markdown","id":"4d9fc3c3-4955-48ba-83aa-4fefe74c335c","metadata":{},"outputs":[],"source":["
\n","
Dataset Train/Test Split
\n","
\n"]},{"cell_type":"markdown","id":"f1d090f7-f4e4-4593-9b6c-30710329c4ed","metadata":{},"outputs":[],"source":["Now that the dataset is ready for building the classification models, you need to first divide the pre-processed dataset into a subset to be used for training the model (the train set) and a subset to be used for evaluating the quality of the model (the test set).\n"]},{"cell_type":"code","id":"a9f71096-5710-4f07-9678-badeffce6d85","metadata":{},"outputs":[],"source":["X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\nprint('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)\nprint('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)"]},{"cell_type":"markdown","id":"fb26b884-ca67-4a0f-a8f6-61a4b2d87469","metadata":{},"outputs":[],"source":["
\n","
Build a Decision Tree Regressor model with Scikit-Learn
\n","
\n"]},{"cell_type":"code","id":"d1d7794e-6478-4b6b-9f71-944e4f568a39","metadata":{},"outputs":[],"source":["# import the Decision Tree Regression Model from scikit-learn\nfrom sklearn.tree import DecisionTreeRegressor\n\n# for reproducible output across multiple function calls, set random_state to a given integer value\nsklearn_dt = DecisionTreeRegressor(max_depth=8, random_state=35)\n\n# train a Decision Tree Regressor using scikit-learn\nt0 = time.time()\nsklearn_dt.fit(X_train, y_train)\nsklearn_time = time.time()-t0\nprint(\"[Scikit-Learn] Training time (s): {0:.5f}\".format(sklearn_time))"]},{"cell_type":"markdown","id":"01201666-9a1a-4299-a21b-994c16da06c6","metadata":{},"outputs":[],"source":["
\n","
Build a Decision Tree Regressor model with Snap ML
\n","
\n"]},{"cell_type":"code","id":"e405dcbe-e831-47b6-8b5e-8f0ea6bb8fda","metadata":{},"outputs":[],"source":["# import the Decision Tree Regressor Model from Snap ML\nfrom snapml import DecisionTreeRegressor\n\n# in contrast to sklearn's Decision Tree, Snap ML offers multi-threaded CPU/GPU training \n# to use the GPU, one needs to set the use_gpu parameter to True\n# snapml_dt = DecisionTreeRegressor(max_depth=4, random_state=45, use_gpu=True)\n\n# to set the number of CPU threads used at training time, one needs to set the n_jobs parameter\n# for reproducible output across multiple function calls, set random_state to a given integer value\nsnapml_dt = DecisionTreeRegressor(max_depth=8, random_state=45, n_jobs=4)\n\n# train a Decision Tree Regressor model using Snap ML\nt0 = time.time()\nsnapml_dt.fit(X_train, y_train)\nsnapml_time = time.time()-t0\nprint(\"[Snap ML] Training time (s): {0:.5f}\".format(snapml_time))"]},{"cell_type":"markdown","id":"f5e2fabb-f4a1-4a3a-b647-ad6fbf4eb5a4","metadata":{},"outputs":[],"source":["
\n","
Evaluate the Scikit-Learn and Snap ML Decision Tree Regressor Models
\n","
\n"]},{"cell_type":"code","id":"f0c9c9b4-90c2-4316-ab48-5f44717fb816","metadata":{},"outputs":[],"source":["# Snap ML vs Scikit-Learn training speedup\ntraining_speedup = sklearn_time/snapml_time\nprint('[Decision Tree Regressor] Snap ML vs. Scikit-Learn speedup : {0:.2f}x '.format(training_speedup))\n\n# run inference using the sklearn model\nsklearn_pred = sklearn_dt.predict(X_test)\n\n# evaluate mean squared error on the test dataset\nsklearn_mse = mean_squared_error(y_test, sklearn_pred)\nprint('[Scikit-Learn] MSE score : {0:.3f}'.format(sklearn_mse))\n\n# run inference using the Snap ML model\nsnapml_pred = snapml_dt.predict(X_test)\n\n# evaluate mean squared error on the test dataset\nsnapml_mse = mean_squared_error(y_test, snapml_pred)\nprint('[Snap ML] MSE score : {0:.3f}'.format(snapml_mse))"]},{"cell_type":"markdown","id":"6bde2e0b-a0c9-4f86-b82b-142af1448ae7","metadata":{},"outputs":[],"source":["As shown above both decision tree models provide the same score on the test dataset. However Snap ML runs the training routine faster than Scikit-Learn. This is one of the advantages of using Snap ML: acceleration of training of classical machine learning models, such as linear and tree-based models. For more Snap ML examples, please visit https://github.com/IBM/snapml-examples. Moreover, as shown above, not only is Snap ML seemlessly accelerating scikit-learn applications, but the library's Python API is also compatible with scikit-learn metrics and data preprocessors.\n"]},{"cell_type":"markdown","id":"46c658c1-7ddd-4909-8806-9111702e9f6f","metadata":{},"outputs":[],"source":["## Practice\n"]},{"cell_type":"markdown","id":"fb72a990-5486-4919-8e63-df70177b7736","metadata":{},"outputs":[],"source":["Lets train a `SnapML` `Decision Tree Regressor` with the `max_depth` parameter set to `12`, `random_state` set to `45`, and `n_jobs` set to `4` and compare its Mean Squared Error to the decision tree regressor we trained previously\n"]},{"cell_type":"markdown","id":"16de0b7b-f15a-4d45-9f6d-db6f248baee5","metadata":{},"outputs":[],"source":["Start by creating and training the decision tree\n"]},{"cell_type":"code","id":"025eeb4f-669d-4d51-b9e4-e15672389809","metadata":{},"outputs":[],"source":[""]},{"cell_type":"markdown","id":"7343e6a7-27f4-4672-a484-1b3e210462ed","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","tree = DecisionTreeRegressor(max_depth=12, random_state=45, n_jobs=4)\n","\n","tree.fit(X_train, y_train)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"9fdf13f9-44cb-4c4d-aad1-5927d8be7a78","metadata":{},"outputs":[],"source":["Now calculate the Mean Squared Error on the test data\n"]},{"cell_type":"code","id":"2088afe7-9e64-41ac-8a59-86d46885d024","metadata":{},"outputs":[],"source":[""]},{"cell_type":"markdown","id":"fb844bdb-5794-4582-a035-67da6b7bcff7","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","pred = tree.predict(X_test)\n","\n","print(\"MSE: \", mean_squared_error(y_test, pred))\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"860e331b-7609-4455-be18-5b3b98cb8626","metadata":{},"outputs":[],"source":["We learned that increasing the `max_depth` parameter to `12` increases the MSE\n"]},{"cell_type":"markdown","id":"13f4a818-28b9-4ec8-b182-9a338c62a796","metadata":{},"outputs":[],"source":["## Authors\n"]},{"cell_type":"markdown","id":"7b2c099b-c841-464e-b97a-d0d0536fa592","metadata":{},"outputs":[],"source":["Andreea Anghel\n"]},{"cell_type":"markdown","id":"e2152d4f-e7a4-4668-b038-1b28f9215bbc","metadata":{},"outputs":[],"source":["### Other Contributors\n"]},{"cell_type":"markdown","id":"507d7ccc-02f5-435b-871a-0ea8e0a34cec","metadata":{},"outputs":[],"source":["Sangeeth Keeriyadath \n","\n","Joseph Santarcangelo\n","\n","Azim Hirjani\n"]},{"cell_type":"markdown","id":"f4392bd5-72e2-4e31-bb7d-de821d7c99b1","metadata":{},"outputs":[],"source":["## Change Log\n"]},{"cell_type":"markdown","id":"49429e6a-42b1-4470-836d-13ca20a32755","metadata":{},"outputs":[],"source":["| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2021-08-31 | 0.1 | AAN | Created Lab Content |\n"]},{"cell_type":"markdown","id":"104853c8-3aef-4e66-b7d6-4e228d820a89","metadata":{},"outputs":[],"source":[" Copyright © 2021 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Classification/K-Nearest-neighbors-CustCat.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","# K-Nearest Neighbors\n","\n","Estimated time needed: 25 minutes\n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use K Nearest neighbors to classify data\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["In this Lab you will load a customer dataset, fit the data, and use K-Nearest Neighbors to predict a data point. But what is K-Nearest Neighbors?\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["K-Nearest Neighbors is a supervised learning algorithm. Where the data is 'trained' with data points corresponding to their classification. To predict the class of a given data point, it takes into account the classes of the 'K' nearest data points and chooses the class in which the majority of the 'K' nearest data points belong to as the predicted class.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Here's an visualization of the K-Nearest Neighbors algorithm.\n","\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["In this case, we have data points of Class A and B. We want to predict what the star (test data point) is. If we consider a k value of 3 (3 nearest data points), we will obtain a prediction of Class B. Yet if we consider a k value of 6, we will obtain a prediction of Class A.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["In this sense, it is important to consider the value of k. Hopefully from this diagram, you should get a sense of what the K-Nearest Neighbors algorithm is. It considers the 'K' Nearest Neighbors (data points) when it predicts the classification of the test point.\n"]},{"cell_type":"markdown","metadata":{},"source":["

\n","\n","

\n","

\n","
\n","

\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!pip install scikit-learn==0.23.1"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's load required libraries\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import numpy as np\n","import matplotlib.pyplot as plt\n","import pandas as pd\n","import numpy as np\n","from sklearn import preprocessing\n","%matplotlib inline"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

\n","

About the dataset

\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Imagine a telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups. If demographic data can be used to predict group membership, the company can customize offers for individual prospective customers. It is a classification problem. That is, given the dataset, with predefined labels, we need to build a model to be used to predict class of a new or unknown case.\n","\n","The example focuses on using demographic data, such as region, age, and marital, to predict usage patterns.\n","\n","The target field, called **custcat**, has four possible values that correspond to the four customer groups, as follows:\n","1- Basic Service\n","2- E-Service\n","3- Plus Service\n","4- Total Service\n","\n","Our objective is to build a classifier, to predict the class of unknown cases. We will use a specific type of classification called K nearest neighbour.\n"]},{"cell_type":"markdown","metadata":{},"source":["**Did you know?** When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Load Data \n"]},{"cell_type":"markdown","metadata":{},"source":["Let's read the data using pandas library and print the first five rows.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/teleCust1000t.csv')\n","df.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

\n","

Data Visualization and Analysis

\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["#### Let’s see how many of each class is in our data set\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df['custcat'].value_counts()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["#### 281 Plus Service, 266 Basic-service, 236 Total Service, and 217 E-Service customers\n"]},{"cell_type":"markdown","metadata":{},"source":["You can easily explore your data using visualization techniques:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["df.hist(column='income', bins=50)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Feature set\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's define feature sets, X:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["df.columns"]},{"cell_type":"markdown","metadata":{},"source":["To use scikit-learn library, we have to convert the Pandas data frame to a Numpy array:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["X = df[['region', 'tenure','age', 'marital', 'address', 'income', 'ed', 'employ','retire', 'gender', 'reside']] .values #.astype(float)\n","X[0:5]\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["What are our labels?\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["y = df['custcat'].values\n","y[0:5]"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Normalize Data\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on the distance of data points:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))\n","X[0:5]"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Train Test Split\n","\n","Out of Sample Accuracy is the percentage of correct predictions that the model makes on data that the model has NOT been trained on. Doing a train and test on the same dataset will most likely have low out-of-sample accuracy, due to the likelihood of our model overfitting.\n","\n","It is important that our models have a high, out-of-sample accuracy, because the purpose of any model, of course, is to make correct predictions on unknown data. So how can we improve out-of-sample accuracy? One way is to use an evaluation approach called Train/Test Split.\n","Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set.\n","\n","This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that has been used to train the model. It is more realistic for the real world problems.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n","print ('Train set:', X_train.shape, y_train.shape)\n","print ('Test set:', X_test.shape, y_test.shape)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

\n","

Classification

\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

K nearest neighbor (KNN)

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["#### Import library\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Classifier implementing the k-nearest neighbors vote.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["from sklearn.neighbors import KNeighborsClassifier"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Training\n","\n","Let's start the algorithm with k=4 for now:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["k = 4\n","#Train Model and Predict \n","neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)\n","neigh"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Predicting\n","\n","We can use the model to make predictions on the test set:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["yhat = neigh.predict(X_test)\n","yhat[0:5]"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Accuracy evaluation\n","\n","In multilabel classification, **accuracy classification score** is a function that computes subset accuracy. This function is equal to the jaccard_score function. Essentially, it calculates how closely the actual labels and predicted labels are matched in the test set.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn import metrics\n","print(\"Train set Accuracy: \", metrics.accuracy_score(y_train, neigh.predict(X_train)))\n","print(\"Test set Accuracy: \", metrics.accuracy_score(y_test, yhat))"]},{"cell_type":"markdown","metadata":{},"source":["## Practice\n","\n","Can you build the model again, but this time with k=6?\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# write your code here\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["

Click here for the solution

\n","\n","```python\n","k = 6\n","neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)\n","yhat6 = neigh6.predict(X_test)\n","print(\"Train set Accuracy: \", metrics.accuracy_score(y_train, neigh6.predict(X_train)))\n","print(\"Test set Accuracy: \", metrics.accuracy_score(y_test, yhat6))\n","\n","```\n","\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["#### What about other K?\n","\n","K in KNN, is the number of nearest neighbors to examine. It is supposed to be specified by the user. So, how can we choose right value for K?\n","The general solution is to reserve a part of your data for testing the accuracy of the model. Then choose k =1, use the training part for modeling, and calculate the accuracy of prediction using all samples in your test set. Repeat this process, increasing the k, and see which k is the best for your model.\n","\n","We can calculate the accuracy of KNN for different values of k.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["Ks = 10\n","mean_acc = np.zeros((Ks-1))\n","std_acc = np.zeros((Ks-1))\n","\n","for n in range(1,Ks):\n"," \n"," #Train Model and Predict \n"," neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)\n"," yhat=neigh.predict(X_test)\n"," mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)\n","\n"," \n"," std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])\n","\n","mean_acc"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["#### Plot the model accuracy for a different number of neighbors.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.plot(range(1,Ks),mean_acc,'g')\n","plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)\n","plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color=\"green\")\n","plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))\n","plt.ylabel('Accuracy ')\n","plt.xlabel('Number of Neighbors (K)')\n","plt.tight_layout()\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["print( \"The best accuracy was with\", mean_acc.max(), \"with k=\", mean_acc.argmax()+1) "]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Want to learn more?

\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n"]},{"cell_type":"markdown","metadata":{},"source":["### Thank you for completing this lab!\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ---------- | ---------------------------------- |\n","| 2021-01-21 | 2.4 | Lakshmi | Updated sklearn library |\n","| 2020-11-20 | 2.3 | Lakshmi | Removed unused imports |\n","| 2020-11-17 | 2.2 | Lakshmi | Changed plot function of KNN |\n","| 2020-11-03 | 2.1 | Lakshmi | Changed URL of csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","##

© IBM Corporation 2020. All rights reserved.

\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Classification/Regression_Trees.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"c0ad8755-8b97-4ba2-8077-feaab4e67120","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n"]},{"cell_type":"markdown","id":"ed575bab-02c5-4413-a2c5-32f19b057fd0","metadata":{},"outputs":[],"source":["# Regression Trees\n"]},{"cell_type":"markdown","id":"8499c478-5cf0-48c1-9682-75b6202a6f92","metadata":{},"outputs":[],"source":["Estimated time needed: 20 minutes\n"]},{"cell_type":"markdown","id":"833a3b9c-19d3-4f1f-8b92-144e630473a4","metadata":{},"outputs":[],"source":["In this lab you will learn how to implement regression trees using ScikitLearn. We will show what parameters are important, how to train a regression tree, and finally how to determine our regression trees accuracy.\n"]},{"cell_type":"markdown","id":"5aeef315-a125-4fd2-8c2c-6dada7823042","metadata":{},"outputs":[],"source":["## Objectives\n"]},{"cell_type":"markdown","id":"154e23dc-d519-4036-b761-a2dee369e7be","metadata":{},"outputs":[],"source":["After completing this lab you will be able to:\n"]},{"cell_type":"markdown","id":"a82124ea-309b-4eaf-8b7e-e1ddb9d14080","metadata":{},"outputs":[],"source":["* Train a Regression Tree\n","* Evaluate a Regression Trees Performance\n"]},{"cell_type":"markdown","id":"cf0184a5-3fef-466f-9379-9d1b7d49075b","metadata":{},"outputs":[],"source":["----\n"]},{"cell_type":"markdown","id":"5595e267-c0cb-4b2b-80c4-a5a9773de994","metadata":{},"outputs":[],"source":["## Setup\n"]},{"cell_type":"markdown","id":"9fb5180c-fb19-4067-8c6d-fc341d14b68f","metadata":{},"outputs":[],"source":["For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n"]},{"cell_type":"code","id":"f867f02c-e05f-4a4a-84e6-2af7b6c62dd9","metadata":{},"outputs":[],"source":["# Install libraries not already in the environment using pip\n#!pip install pandas==1.3.4\n#!pip install sklearn==0.20.1"]},{"cell_type":"code","id":"42e748ae-a5c5-4fb3-8bb9-e98ff2eab62c","metadata":{},"outputs":[],"source":["# Pandas will allow us to create a dataframe of the data so it can be used and manipulated\nimport pandas as pd\n# Regression Tree Algorithm\nfrom sklearn.tree import DecisionTreeRegressor\n# Split our data into a training and testing data\nfrom sklearn.model_selection import train_test_split"]},{"cell_type":"markdown","id":"f0bb43f4-b155-42e9-901d-5dc3f46aaf8e","metadata":{},"outputs":[],"source":["## About the Dataset\n"]},{"cell_type":"markdown","id":"f1bed413-9dbd-466e-87b1-9a501559c152","metadata":{},"outputs":[],"source":["Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.\n","\n","The dataset had information on areas/towns not individual houses, the features are\n","\n","CRIM: Crime per capita\n","\n","ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.\n","\n","INDUS: Proportion of non-retail business acres per town\n","\n","CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n","\n","NOX: Nitric oxides concentration (parts per 10 million)\n","\n","RM: Average number of rooms per dwelling\n","\n","AGE: Proportion of owner-occupied units built prior to 1940\n","\n","DIS: Weighted distances to ﬁve Boston employment centers\n","\n","RAD: Index of accessibility to radial highways\n","\n","TAX: Full-value property-tax rate per $10,000\n","\n","PTRAIO: Pupil-teacher ratio by town\n","\n","LSTAT: Percent lower status of the population\n","\n","MEDV: Median value of owner-occupied homes in $1000s\n"]},{"cell_type":"markdown","id":"ebf41f58-6465-44c8-a975-2ffd1c2eb73a","metadata":{},"outputs":[],"source":["## Read the Data\n"]},{"cell_type":"markdown","id":"39549fe4-5d5d-40b9-a879-df93660d224c","metadata":{},"outputs":[],"source":["Lets read in the data we have downloaded\n"]},{"cell_type":"code","id":"906b2cb2-6509-4360-8559-6bb86fb5f1c9","metadata":{},"outputs":[],"source":["data = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv\")"]},{"cell_type":"code","id":"7f29b4cc-962d-4540-9560-fe504c47deda","metadata":{},"outputs":[],"source":["data.head()"]},{"cell_type":"markdown","id":"c6188f8a-fce7-4123-8038-551cfa701450","metadata":{},"outputs":[],"source":["Now lets learn about the size of our data, there are 506 rows and 13 columns\n"]},{"cell_type":"code","id":"b049b9f5-b0b9-4d3c-9ca4-d47151690d96","metadata":{},"outputs":[],"source":["data.shape"]},{"cell_type":"markdown","id":"bf668c7a-31ab-494c-8b3d-7196aa7d7db5","metadata":{},"outputs":[],"source":["Most of the data is valid, there are rows with missing values which we will deal with in pre-processing\n"]},{"cell_type":"code","id":"190d749b-fd37-4da5-9e47-17896dc58448","metadata":{},"outputs":[],"source":["data.isna().sum()"]},{"cell_type":"markdown","id":"8ac246a4-2bf2-41ef-ae20-c8fb94afa75a","metadata":{},"outputs":[],"source":["## Data Pre-Processing\n"]},{"cell_type":"markdown","id":"d0c70b68-6744-44ae-b488-63d7d60cb2cd","metadata":{},"outputs":[],"source":["First lets drop the rows with missing values because we have enough data in our dataset\n"]},{"cell_type":"code","id":"bd28e2c0-5b0b-411f-9012-226426b5c9e1","metadata":{},"outputs":[],"source":["data.dropna(inplace=True)"]},{"cell_type":"markdown","id":"7a79790b-8e9f-46cf-8b93-79008c0ff44c","metadata":{},"outputs":[],"source":["Now we can see our dataset has no missing values\n"]},{"cell_type":"code","id":"879c5378-6fb2-45ed-992a-7447634e1f4e","metadata":{},"outputs":[],"source":["data.isna().sum()"]},{"cell_type":"markdown","id":"226398cb-146e-4f0c-9244-c31b974e6d7d","metadata":{},"outputs":[],"source":["Lets split the dataset into our features and what we are predicting (target)\n"]},{"cell_type":"code","id":"39ca2cd0-99e5-420c-8cdb-ec0755a05267","metadata":{},"outputs":[],"source":["X = data.drop(columns=[\"MEDV\"])\nY = data[\"MEDV\"]"]},{"cell_type":"code","id":"82688cb1-3bd3-4ec3-8707-b647beed6ed0","metadata":{},"outputs":[],"source":["X.head()"]},{"cell_type":"code","id":"0d4ee0af-d37f-4e79-b09e-eea46f0fe598","metadata":{},"outputs":[],"source":["Y.head()"]},{"cell_type":"markdown","id":"75ec2742-888e-499a-aea2-ac255ac1b58d","metadata":{},"outputs":[],"source":["Finally lets split our data into a training and testing dataset using `train_test_split` from `sklearn.model_selection`\n"]},{"cell_type":"code","id":"6d3ce35e-e2a0-41ad-a506-bd99068338a6","metadata":{},"outputs":[],"source":["X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)"]},{"cell_type":"markdown","id":"bcc9dfe5-68c9-4e98-bcfb-bc9dcfaedb84","metadata":{},"outputs":[],"source":["## Create Regression Tree\n"]},{"cell_type":"markdown","id":"62bd6dcf-3762-4b68-9861-5c3f9aef85dd","metadata":{},"outputs":[],"source":["Regression Trees are implemented using `DecisionTreeRegressor` from `sklearn.tree`\n","\n","The important parameters of `DecisionTreeRegressor` are\n","\n","`criterion`: {\"mse\", \"friedman_mse\", \"mae\", \"poisson\"} - The function used to measure error\n","\n","`max_depth` - The max depth the tree can be\n","\n","`min_samples_split` - The minimum number of samples required to split a node\n","\n","`min_samples_leaf` - The minimum number of samples that a leaf can contain\n","\n","`max_features`: {\"auto\", \"sqrt\", \"log2\"} - The number of feature we examine looking for the best one, used to speed up training\n"]},{"cell_type":"markdown","id":"4f6fd6ac-e155-4f6f-8144-83db47accb86","metadata":{},"outputs":[],"source":["First lets start by creating a `DecisionTreeRegressor` object, setting the `criterion` parameter to `mse` for Mean Squared Error\n"]},{"cell_type":"code","id":"7c41b5fc-cc6a-4617-8c97-6a668ae686e9","metadata":{},"outputs":[],"source":["regression_tree = DecisionTreeRegressor(criterion = 'mse')"]},{"cell_type":"markdown","id":"bed5ee57-d52c-4c72-b30e-7afb1e3ea793","metadata":{},"outputs":[],"source":["## Training\n"]},{"cell_type":"markdown","id":"f47f87cb-d4a3-4094-9960-cf62ed2f7911","metadata":{},"outputs":[],"source":["Now lets train our model using the `fit` method on the `DecisionTreeRegressor` object providing our training data\n"]},{"cell_type":"code","id":"4d57ac88-08f6-41b7-acab-41a5ca5b5792","metadata":{},"outputs":[],"source":["regression_tree.fit(X_train, Y_train)"]},{"cell_type":"markdown","id":"309d352e-1185-46ab-8cbc-96b5f14e3f6e","metadata":{},"outputs":[],"source":["## Evaluation\n"]},{"cell_type":"markdown","id":"6b32ac69-16c7-4a6c-b038-5ec72b58ffab","metadata":{},"outputs":[],"source":["To evaluate our dataset we will use the `score` method of the `DecisionTreeRegressor` object providing our testing data, this number is the $R^2$ value which indicates the coefficient of determination\n"]},{"cell_type":"code","id":"5db8c08b-bb6d-40e7-b40e-6554c8f5cb13","metadata":{},"outputs":[],"source":["regression_tree.score(X_test, Y_test)"]},{"cell_type":"markdown","id":"31274527-03bc-4b0a-bfe1-016c2f27963c","metadata":{},"outputs":[],"source":["We can also find the average error in our testing set which is the average error in median home value prediction\n"]},{"cell_type":"code","id":"29d19c03-bac9-4bf7-b602-a007c6cbedcc","metadata":{},"outputs":[],"source":["prediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()1000)"]},{"cell_type":"markdown","id":"15c3e467-a253-4532-9dc7-a73a53bfdca9","metadata":{},"outputs":[],"source":["## Excercise\n"]},{"cell_type":"markdown","id":"a41af9c9-2d60-47f7-810b-7a2001cecb81","metadata":{},"outputs":[],"source":["Train a regression tree using the `criterion` `mae` then report its $R^2$ value and average error\n"]},{"cell_type":"code","id":"9ca0b569-36bf-40ec-b8ae-9b509c14eb75","metadata":{},"outputs":[],"source":[""]},{"cell_type":"markdown","id":"06a58777-9654-4c0d-b125-17338ad2bb26","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python\n","regression_tree = DecisionTreeRegressor(criterion = \"mae\")\n","\n","regression_tree.fit(X_train, Y_train)\n","\n","print(regression_tree.score(X_test, Y_test))\n","\n","prediction = regression_tree.predict(X_test)\n","\n","print(\"$\",(prediction - Y_test).abs().mean()1000)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"5dd5011c-7d00-418d-a42c-c0ff3f6bdec3","metadata":{},"outputs":[],"source":["## Authors\n"]},{"cell_type":"markdown","id":"1772ad6a-d234-4101-bd32-161aac675ce8","metadata":{},"outputs":[],"source":["Azim Hirjani\n"]},{"cell_type":"markdown","id":"aed4efe8-1d3f-47f7-8982-cc43fa3f8308","metadata":{},"outputs":[],"source":["## Change Log\n"]},{"cell_type":"markdown","id":"46d17e22-4fba-4f3c-b912-0180a78fe52b","metadata":{},"outputs":[],"source":["|Date (YYYY-MM-DD)|Version|Changed By|Change Description|\n","|-|-|-|-|\n","|2020-07-20|0.2|Azim|Modified Multiple Areas|\n","|2020-07-17|0.1|Azim|Created Lab Template|\n"]},{"cell_type":"markdown","id":"c51c0ff1-980c-475a-87ed-2d32a9e1a209","metadata":{},"outputs":[],"source":["Copyright © 2020 IBM Corporation. All rights reserved.\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Classification/classification_tree_svm.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"0855f111-4f67-4e0b-8446-0efc131b5801","metadata":{},"outputs":[],"source":["\n"," $\"cognitiveclass.ai$ \n","\n"]},{"cell_type":"markdown","id":"1d21d6d9-bb1e-44f8-bd26-82857cfb4bf2","metadata":{},"outputs":[],"source":["# Credit Card Fraud Detection using Scikit-Learn and Snap ML\n"]},{"cell_type":"markdown","id":"14bce88c-f3b2-4f42-b58c-9548741e5864","metadata":{},"outputs":[],"source":["Estimated time needed: 30 minutes\n"]},{"cell_type":"markdown","id":"71744302-bc96-4a6a-833d-0d24786c25b0","metadata":{},"outputs":[],"source":["In this exercise session you will consolidate your machine learning (ML) modeling skills by using two popular classification models to recognize fraudulent credit card transactions. These models are: Decision Tree and Support Vector Machine. You will use a real dataset to train each of these models. The dataset includes information about \n","transactions made by credit cards in September 2013 by European cardholders. You will use the trained model to assess if a credit card transaction is legitimate or not.\n","\n","In the current exercise session, you will practice not only the Scikit-Learn Python interface, but also the Python API offered by the Snap Machine Learning (Snap ML) library. Snap ML is a high-performance IBM library for ML modeling. It provides highly-efficient CPU/GPU implementations of linear models and tree-based models. Snap ML not only accelerates ML algorithms through system awareness, but it also offers novel ML algorithms with best-in-class accuracy. For more information, please visit [snapml](https://ibm.biz/BdPfxy) information page.\n"]},{"cell_type":"markdown","id":"307a1271-c240-445c-bfc0-4b45343045fb","metadata":{},"outputs":[],"source":["## Objectives\n"]},{"cell_type":"markdown","id":"478f0e09-219a-45c7-98d0-a92cce9325c4","metadata":{},"outputs":[],"source":["After completing this lab you will be able to:\n"]},{"cell_type":"markdown","id":"a5f254a3-2b05-46b1-b18e-fe7677c0d989","metadata":{},"outputs":[],"source":["* Perform basic data preprocessing in Python\n","* Model a classification task using the Scikit-Learn and Snap ML Python APIs\n","* Train Suppport Vector Machine and Decision Tree models using Scikit-Learn and Snap ML\n","* Run inference and assess the quality of the trained models\n"]},{"cell_type":"markdown","id":"566afc59-8782-482b-b138-c88b57e206ce","metadata":{},"outputs":[],"source":["## Table of Contents\n"]},{"cell_type":"markdown","id":"435af5cb-9ef4-4f5b-a364-4b9a89a85643","metadata":{},"outputs":[],"source":["
\n","
\n","
Introduction
\n","
Import Libraries
\n","
Dataset Analysis
\n","
Dataset Preprocessing
\n","
Dataset Train/Test Split
\n","
Build a Decision Tree Classifier model with Scikit-Learn
\n","
Build a Decision Tree Classifier model with Snap ML
\n","
Evaluate the Scikit-Learn and Snap ML Decision Tree Classifiers
\n","
Build a Support Vector Machine model with Scikit-Learn
\n","
Build a Support Vector Machine model with Snap ML
\n","
Evaluate the Scikit-Learn and Snap ML Support Vector Machine Models
\n","
\n","
\n","
\n","
\n"]},{"cell_type":"markdown","id":"4db0f36a-acb5-48dc-9f0e-7198f4a9d243","metadata":{},"outputs":[],"source":["
\n","
Introduction
\n","
Imagine that you work for a financial institution and part of your job is to build a model that predicts if a credit card transaction is fraudulent or not. You can model the problem as a binary classification problem. A transaction belongs to the positive class (1) if it is a fraud, otherwise it belongs to the negative class (0).\n","
\n","
You have access to transactions that occured over a certain period of time. The majority of the transactions are normally legitimate and only a small fraction are non-legitimate. Thus, typically you have access to a dataset that is highly unbalanced. This is also the case of the current dataset: only 492 transactions out of 284,807 are fraudulent (the positive class - the frauds - accounts for 0.172% of all transactions).\n","
\n","
This is a Kaggle dataset. You can find this \"Credit Card Fraud Detection\" dataset from the following link: Credit Card Fraud Detection.\n","
\n","
To train the model, you can use part of the input dataset, while the remaining data can be utilized to assess the quality of the trained model. First, let's import the necessary libraries and download the dataset.\n","
\n","
\n"]},{"cell_type":"markdown","id":"e147e6ce-2942-480d-be46-d5d9ed3119e8","metadata":{},"outputs":[],"source":["
\n","
Import Libraries
\n","
\n"]},{"cell_type":"code","id":"65c0add3-3391-4c42-92a0-a9a013d972d8","metadata":{},"outputs":[],"source":["# Install scikit-learn using pip\n!pip install scikit-learn==1.0.2\n\n# Snap ML is available on PyPI. To install it simply run the pip command below.\n!pip install snapml"]},{"cell_type":"code","id":"a2dfdbbd-fd5e-4840-a0df-34e5bd67307d","metadata":{},"outputs":[],"source":["# Import the libraries we need to use in this lab\nfrom future import print_function\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n%matplotlib inline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import normalize, StandardScaler\nfrom sklearn.utils.class_weight import compute_sample_weight\nfrom sklearn.metrics import roc_auc_score\nimport time\nimport warnings\nwarnings.filterwarnings('ignore')"]},{"cell_type":"code","id":"ed34b094-8e81-443c-b601-04860684ee88","metadata":{},"outputs":[],"source":["# download the dataset\nurl= \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/creditcard.csv\"\n\n# read the input data\nraw_data=pd.read_csv(url)\nprint(\"There are \" + str(len(raw_data)) + \" observations in the credit card fraud dataset.\")\nprint(\"There are \" + str(len(raw_data.columns)) + \" variables in the dataset.\")"]},{"cell_type":"markdown","id":"a4217311-02d5-412f-b120-0f739a116a64","metadata":{},"outputs":[],"source":["Did you know? When it comes to Machine Learning, you will most likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](https://ibm.biz/BdPfxf)\n"]},{"cell_type":"markdown","id":"4bba494b-f521-45eb-87cf-f28374a05414","metadata":{},"outputs":[],"source":["
\n","
Dataset Analysis
\n","
\n"]},{"cell_type":"markdown","id":"7a201c8c-0b82-4dc3-9f54-9e73ace555e2","metadata":{},"outputs":[],"source":["In this section you will read the dataset in a Pandas dataframe and visualize its content. You will also look at some data statistics. \n","\n","Note: A Pandas dataframe is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure. For more information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html. \n"]},{"cell_type":"code","id":"e317f9f0-34e5-43e2-9309-d0e6da0507e7","metadata":{},"outputs":[],"source":["# display the first rows in the dataset\nraw_data.head()"]},{"cell_type":"markdown","id":"a408d951-9957-4892-bf6f-eb84e8c4b70d","metadata":{},"outputs":[],"source":["In practice, a financial institution may have access to a much larger dataset of transactions. To simulate such a case, we will inflate the original one 10 times.\n"]},{"cell_type":"code","id":"44e1f3cd-0ece-467c-89e1-50c360e4f5f3","metadata":{},"outputs":[],"source":["n_replicas = 10\n\n# inflate the original dataset\nbig_raw_data = pd.DataFrame(np.repeat(raw_data.values, n_replicas, axis=0), columns=raw_data.columns)\n\nprint(\"There are \" + str(len(big_raw_data)) + \" observations in the inflated credit card fraud dataset.\")\nprint(\"There are \" + str(len(big_raw_data.columns)) + \" variables in the dataset.\")\n\n# display first rows in the new dataset\nbig_raw_data.head()"]},{"cell_type":"markdown","id":"7285dcea-42ba-4353-8ca9-1aabcffca3f0","metadata":{},"outputs":[],"source":["Each row in the dataset represents a credit card transaction. As shown above, each row has 31 variables. One variable (the last variable in the table above) is called Class and represents the target variable. Your objective will be to train a model that uses the other variables to predict the value of the Class variable. Let's first retrieve basic statistics about the target variable.\n","\n","Note: For confidentiality reasons, the original names of most features are anonymized V1, V2 .. V28. The values of these features are the result of a PCA transformation and are numerical. The feature 'Class' is the target variable and it takes two values: 1 in case of fraud and 0 otherwise. For more information about the dataset please visit this webpage: https://www.kaggle.com/mlg-ulb/creditcardfraud.\n"]},{"cell_type":"code","id":"b9ece505-6fb6-42e8-8180-5bb8a5cdc25f","metadata":{},"outputs":[],"source":["# get the set of distinct classes\nlabels = big_raw_data.Class.unique()\n\n# get the count of each class\nsizes = big_raw_data.Class.value_counts().values\n\n# plot the class value counts\nfig, ax = plt.subplots()\nax.pie(sizes, labels=labels, autopct='%1.3f%%')\nax.set_title('Target Variable Value Counts')\nplt.show()"]},{"cell_type":"markdown","id":"77ed6d90-0625-4f7f-b523-2ec4eeeb8705","metadata":{},"outputs":[],"source":["As shown above, the Class variable has two values: 0 (the credit card transaction is legitimate) and 1 (the credit card transaction is fraudulent). Thus, you need to model a binary classification problem. Moreover, the dataset is highly unbalanced, the target variable classes are not represented equally. This case requires special attention when training or when evaluating the quality of a model. One way of handing this case at train time is to bias the model to pay more attention to the samples in the minority class. The models under the current study will be configured to take into account the class weights of the samples at train/fit time.\n"]},{"cell_type":"markdown","id":"7c8fa86e-85a3-40b6-a8e8-f506f8f095be","metadata":{},"outputs":[],"source":["### Practice\n"]},{"cell_type":"markdown","id":"96ce0d38-b610-4707-8141-863a5c19da86","metadata":{},"outputs":[],"source":["The credit card transactions have different amounts. Could you plot a histogram that shows the distribution of these amounts? What is the range of these amounts (min/max)? Could you print the 90th percentile of the amount values?\n"]},{"cell_type":"code","id":"b5f347c9-bb21-4e85-a4a2-5737f8e04a5c","metadata":{},"outputs":[],"source":["# your code here"]},{"cell_type":"code","id":"ada68834-282a-4c0e-ba35-78db318bbbc8","metadata":{},"outputs":[],"source":["# we provide our solution here\nplt.hist(big_raw_data.Amount.values, 6, histtype='bar', facecolor='g')\nplt.show()\n\nprint(\"Minimum amount value is \", np.min(big_raw_data.Amount.values))\nprint(\"Maximum amount value is \", np.max(big_raw_data.Amount.values))\nprint(\"90% of the transactions have an amount less or equal than \", np.percentile(raw_data.Amount.values, 90))"]},{"cell_type":"markdown","id":"0d6fb553-3ffe-42a0-b79c-d39e9a663695","metadata":{},"outputs":[],"source":["
\n","
Dataset Preprocessing
\n","
\n"]},{"cell_type":"markdown","id":"4ba758f5-0fa2-4b67-8371-044169212f31","metadata":{},"outputs":[],"source":["In this subsection you will prepare the data for training. \n"]},{"cell_type":"code","id":"a8db3395-2e9a-4806-add8-27448c45ab8c","metadata":{},"outputs":[],"source":["# data preprocessing such as scaling/normalization is typically useful for \n# linear models to accelerate the training convergence\n\n# standardize features by removing the mean and scaling to unit variance\nbig_raw_data.iloc[:, 1:30] = StandardScaler().fit_transform(big_raw_data.iloc[:, 1:30])\ndata_matrix = big_raw_data.values\n\n# X: feature matrix (for this analysis, we exclude the Time variable from the dataset)\nX = data_matrix[:, 1:30]\n\n# y: labels vector\ny = data_matrix[:, 30]\n\n# data normalization\nX = normalize(X, norm=\"l1\")\n\n# print the shape of the features matrix and the labels vector\nprint('X.shape=', X.shape, 'y.shape=', y.shape)"]},{"cell_type":"markdown","id":"640d3c6a-d10d-4f81-bf08-03df099319dc","metadata":{},"outputs":[],"source":["
\n","
Dataset Train/Test Split
\n","
\n"]},{"cell_type":"markdown","id":"fa821365-6f87-446a-9a8c-20605ae545a8","metadata":{},"outputs":[],"source":["Now that the dataset is ready for building the classification models, you need to first divide the pre-processed dataset into a subset to be used for training the model (the train set) and a subset to be used for evaluating the quality of the model (the test set).\n"]},{"cell_type":"code","id":"e68844bf-99d1-4366-b674-ef2a41d12975","metadata":{},"outputs":[],"source":["X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) \nprint('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)\nprint('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)"]},{"cell_type":"markdown","id":"8bedfac8-ae4e-48fa-889a-4dd6e572c179","metadata":{},"outputs":[],"source":["
\n","
Build a Decision Tree Classifier model with Scikit-Learn
\n","
\n"]},{"cell_type":"code","id":"d2408892-20bf-43f3-8084-055aa4484af4","metadata":{},"outputs":[],"source":["# compute the sample weights to be used as input to the train routine so that \n# it takes into account the class imbalance present in this dataset\nw_train = compute_sample_weight('balanced', y_train)\n\n# import the Decision Tree Classifier Model from scikit-learn\nfrom sklearn.tree import DecisionTreeClassifier\n\n# for reproducible output across multiple function calls, set random_state to a given integer value\nsklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)\n\n# train a Decision Tree Classifier using scikit-learn\nt0 = time.time()\nsklearn_dt.fit(X_train, y_train, sample_weight=w_train)\nsklearn_time = time.time()-t0\nprint(\"[Scikit-Learn] Training time (s): {0:.5f}\".format(sklearn_time))"]},{"cell_type":"markdown","id":"b9546d6d-aaef-477c-baf3-19c298b05be8","metadata":{},"outputs":[],"source":["
\n","
Build a Decision Tree Classifier model with Snap ML
\n","
\n"]},{"cell_type":"code","id":"0c8b4e9f-de3f-48cd-969b-3f58f4357808","metadata":{},"outputs":[],"source":["# if not already computed, \n# compute the sample weights to be used as input to the train routine so that \n# it takes into account the class imbalance present in this dataset\n# w_train = compute_sample_weight('balanced', y_train)\n\n# import the Decision Tree Classifier Model from Snap ML\nfrom snapml import DecisionTreeClassifier\n\n# Snap ML offers multi-threaded CPU/GPU training of decision trees, unlike scikit-learn\n# to use the GPU, set the use_gpu parameter to True\n# snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, use_gpu=True)\n\n# to set the number of CPU threads used at training time, set the n_jobs parameter\n# for reproducible output across multiple function calls, set random_state to a given integer value\nsnapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)\n\n# train a Decision Tree Classifier model using Snap ML\nt0 = time.time()\nsnapml_dt.fit(X_train, y_train, sample_weight=w_train)\nsnapml_time = time.time()-t0\nprint(\"[Snap ML] Training time (s): {0:.5f}\".format(snapml_time))"]},{"cell_type":"markdown","id":"5fda600e-bf93-47b2-b443-c6183f493aff","metadata":{},"outputs":[],"source":["
\n","
Evaluate the Scikit-Learn and Snap ML Decision Tree Classifier Models
\n","
\n"]},{"cell_type":"code","id":"eb39b543-d7aa-43cf-aebc-54a0a5d97eb7","metadata":{},"outputs":[],"source":["# Snap ML vs Scikit-Learn training speedup\ntraining_speedup = sklearn_time/snapml_time\nprint('[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : {0:.2f}x '.format(training_speedup))\n\n# run inference and compute the probabilities of the test samples \n# to belong to the class of fraudulent transactions\nsklearn_pred = sklearn_dt.predict_proba(X_test)[:,1]\n\n# evaluate the Compute Area Under the Receiver Operating Characteristic \n# Curve (ROC-AUC) score from the predictions\nsklearn_roc_auc = roc_auc_score(y_test, sklearn_pred)\nprint('[Scikit-Learn] ROC-AUC score : {0:.3f}'.format(sklearn_roc_auc))\n\n# run inference and compute the probabilities of the test samples\n# to belong to the class of fraudulent transactions\nsnapml_pred = snapml_dt.predict_proba(X_test)[:,1]\n\n# evaluate the Compute Area Under the Receiver Operating Characteristic\n# Curve (ROC-AUC) score from the prediction scores\nsnapml_roc_auc = roc_auc_score(y_test, snapml_pred) \nprint('[Snap ML] ROC-AUC score : {0:.3f}'.format(snapml_roc_auc))"]},{"cell_type":"markdown","id":"58db7916-12e9-4d6d-902a-8c2027aaee3b","metadata":{},"outputs":[],"source":["As shown above both decision tree models provide the same score on the test dataset. However Snap ML runs the training routine 12x faster than Scikit-Learn. This is one of the advantages of using Snap ML: acceleration of training of classical machine learning models, such as linear and tree-based models. For more Snap ML examples, please visit [snapml-examples](https://ibm.biz/BdPfxP).\n"]},{"cell_type":"markdown","id":"63a81e13-6eae-41bf-8c84-2d970be5530f","metadata":{},"outputs":[],"source":["
\n","
Build a Support Vector Machine model with Scikit-Learn
\n","
\n"]},{"cell_type":"code","id":"5a989794-7f41-421f-8eaf-ee8c5a6821d0","metadata":{},"outputs":[],"source":["# import the linear Support Vector Machine (SVM) model from Scikit-Learn\nfrom sklearn.svm import LinearSVC\n\n# instatiate a scikit-learn SVM model\n# to indicate the class imbalance at fit time, set class_weight='balanced'\n# for reproducible output across multiple function calls, set random_state to a given integer value\nsklearn_svm = LinearSVC(class_weight='balanced', random_state=31, loss=\"hinge\", fit_intercept=False)\n\n# train a linear Support Vector Machine model using Scikit-Learn\nt0 = time.time()\nsklearn_svm.fit(X_train, y_train)\nsklearn_time = time.time() - t0\nprint(\"[Scikit-Learn] Training time (s): {0:.2f}\".format(sklearn_time))"]},{"cell_type":"markdown","id":"afae2db5-321d-4329-a9f2-e4b915be0a13","metadata":{},"outputs":[],"source":["
\n","
Build a Support Vector Machine model with Snap ML
\n","
\n"]},{"cell_type":"code","id":"ce44ff4a-c72a-473b-9be6-84cfe8d34b52","metadata":{},"outputs":[],"source":["# import the Support Vector Machine model (SVM) from Snap ML\nfrom snapml import SupportVectorMachine\n\n# in contrast to scikit-learn's LinearSVC, Snap ML offers multi-threaded CPU/GPU training of SVMs\n# to use the GPU, set the use_gpu parameter to True\n# snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, use_gpu=True, fit_intercept=False)\n\n# to set the number of threads used at training time, one needs to set the n_jobs parameter\nsnapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, n_jobs=4, fit_intercept=False)\n# print(snapml_svm.get_params())\n\n# train an SVM model using Snap ML\nt0 = time.time()\nmodel = snapml_svm.fit(X_train, y_train)\nsnapml_time = time.time() - t0\nprint(\"[Snap ML] Training time (s): {0:.2f}\".format(snapml_time))"]},{"cell_type":"markdown","id":"7c5a358e-3d1b-44b6-aaef-7df8db3f0b49","metadata":{},"outputs":[],"source":["
\n","
Evaluate the Scikit-Learn and Snap ML Support Vector Machine Models
\n","
\n"]},{"cell_type":"code","id":"c5722774-64c4-49f8-809d-4c1232faf553","metadata":{},"outputs":[],"source":["# compute the Snap ML vs Scikit-Learn training speedup\ntraining_speedup = sklearn_time/snapml_time\nprint('[Support Vector Machine] Snap ML vs. Scikit-Learn training speedup : {0:.2f}x '.format(training_speedup))\n\n# run inference using the Scikit-Learn model\n# get the confidence scores for the test samples\nsklearn_pred = sklearn_svm.decision_function(X_test)\n\n# evaluate accuracy on test set\nacc_sklearn = roc_auc_score(y_test, sklearn_pred)\nprint(\"[Scikit-Learn] ROC-AUC score: {0:.3f}\".format(acc_sklearn))\n\n# run inference using the Snap ML model\n# get the confidence scores for the test samples\nsnapml_pred = snapml_svm.decision_function(X_test)\n\n# evaluate accuracy on test set\nacc_snapml = roc_auc_score(y_test, snapml_pred)\nprint(\"[Snap ML] ROC-AUC score: {0:.3f}\".format(acc_snapml))"]},{"cell_type":"markdown","id":"73fd83d6-1f1a-4e4d-aa83-e5da5470d7b5","metadata":{},"outputs":[],"source":["As shown above both SVM models provide the same score on the test dataset. However, as in the case of decision trees, Snap ML runs the training routine faster than Scikit-Learn. For more Snap ML examples, please visit [snapml-examples](https://ibm.biz/BdPfxP). Moreover, as shown above, not only is Snap ML seemlessly accelerating scikit-learn applications, but the library's Python API is also compatible with scikit-learn metrics and data preprocessors.\n"]},{"cell_type":"markdown","id":"c94e718a-adcb-4acd-85e3-b1dca0442c2b","metadata":{},"outputs":[],"source":["### Practice\n"]},{"cell_type":"markdown","id":"e2eef4eb-95fc-4dcb-ba6b-1f66c49e87ff","metadata":{},"outputs":[],"source":["In this section you will evaluate the quality of the SVM models trained above using the hinge loss metric (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html). Run inference on the test set using both Scikit-Learn and Snap ML models. Compute the hinge loss metric for both sets of predictions. Print the hinge losses of Scikit-Learn and Snap ML.\n"]},{"cell_type":"code","id":"34da00de-f3d4-44a4-9d69-ea14e1c65e59","metadata":{},"outputs":[],"source":["# your code goes here"]},{"cell_type":"code","id":"cf060d93-acfb-401a-b659-1eda59036b49","metadata":{},"outputs":[],"source":["# get the confidence scores for the test samples\nsklearn_pred = sklearn_svm.decision_function(X_test)\nsnapml_pred = snapml_svm.decision_function(X_test)\n\n# import the hinge_loss metric from scikit-learn\nfrom sklearn.metrics import hinge_loss\n\n# evaluate the hinge loss from the predictions\nloss_snapml = hinge_loss(y_test, snapml_pred)\nprint(\"[Snap ML] Hinge loss: {0:.3f}\".format(loss_snapml))\n\n# evaluate the hinge loss metric from the predictions\nloss_sklearn = hinge_loss(y_test, sklearn_pred)\nprint(\"[Scikit-Learn] Hinge loss: {0:.3f}\".format(loss_snapml))\n\n# the two models should give the same Hinge loss"]},{"cell_type":"markdown","id":"03ac75c1-97c3-400d-9701-53d5dc820418","metadata":{},"outputs":[],"source":["## Authors\n"]},{"cell_type":"markdown","id":"c400d2e3-a8d6-4059-898f-160d5172e4bf","metadata":{},"outputs":[],"source":["Andreea Anghel\n"]},{"cell_type":"markdown","id":"437d605d-cbd1-468a-bfed-4f9843eb47e8","metadata":{},"outputs":[],"source":["### Other Contributors\n"]},{"cell_type":"markdown","id":"a3849cb9-9ecf-4c26-9ee4-58a9b1a583bf","metadata":{},"outputs":[],"source":["Joseph Santarcangelo\n"]},{"cell_type":"markdown","id":"2d73311d-163c-4368-94e9-d7dfa2b318b8","metadata":{},"outputs":[],"source":["## Change Log\n"]},{"cell_type":"markdown","id":"da21c7b1-5fae-4ede-94ae-3355066a0d08","metadata":{},"outputs":[],"source":["| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2021-08-31 | 0.1 | AAN | Created Lab Content |\n"]},{"cell_type":"markdown","id":"acc18f53-626d-4013-b632-88e32da97dba","metadata":{},"outputs":[],"source":[" Copyright © 2021 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"prev_pub_hash":"f2a517f24c6b065305a2bf78857fae552c3754af3502212b3301150886d0a5b2"},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Clustering/Clus-K-Means-Customer-Seg.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["\n","
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","\n","# K-Means Clustering\n","\n","\n","Estimated time needed: 25 minutes\n"," \n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use scikit-learn's K-Means Clustering to cluster data\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Introduction\n","\n","There are many models for clustering out there. In this notebook, we will be presenting the model that is considered one of the simplest models amongst them. Despite its simplicity, the K-means is vastly used for clustering in many data science applications, it is especially useful if you need to quickly discover insights from unlabeled data. In this notebook, you will learn how to use k-Means for customer segmentation.\n","\n","Some real-world applications of k-means:\n","- Customer segmentation\n","- Understand what the visitors of a website are trying to accomplish\n","- Pattern recognition\n","- Machine learning\n","- Data compression\n","\n","\n","In this notebook we practice k-means clustering with 2 examples:\n","- k-means on a random generated dataset\n","- Using k-means for customer segmentation\n"]},{"cell_type":"markdown","metadata":{},"source":["

\n","\n","

\n","

k-Means on a randomly generated dataset

Setting up K-Means
Creating the Visual Plot

Customer Segmentation with K-Means

Pre-processing
Modeling
Insights

\n","

\n","
\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Import libraries\n","Let's first import the required libraries.\n","Also run %matplotlib inline since we will be plotting in this section.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Surpress warnings:\n","def warn(*args, **kwargs):\n"," pass\n","import warnings\n","warnings.warn = warn"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import random \n","import numpy as np \n","import matplotlib.pyplot as plt \n","from sklearn.cluster import KMeans \n","from sklearn.datasets import make_blobs \n","%matplotlib inline"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

k-Means on a randomly generated dataset

\n","\n","Let's create our own dataset for this lab!\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["First we need to set a random seed. Use numpy's random.seed() function, where the seed will be set to 0.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["np.random.seed(0)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Next we will be making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones.

\n"," Input \n","

n_samples: The total number of points equally divided among clusters.

Value will be: 5000

centers: The number of centers to generate, or the fixed center locations.

Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]

cluster_std: The standard deviation of the clusters.

Value will be: 0.9

\n","
\n"," Output \n","

X: Array of shape [n_samples, n_features]. (Feature Matrix)

The generated samples.

y: Array of shape [n_samples]. (Response Vector)

The integer labels for cluster membership of each sample.

\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Display the scatter plot of the randomly generated data.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["plt.scatter(X[:, 0], X[:, 1], marker='.')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Setting up K-Means

\n","Now that we have our random data, let's set up our K-Means Clustering.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["The KMeans class has many parameters that can be used, but we will be using these three:\n","

init: Initialization method of the centroids.

Value will be: \"k-means++\"
k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

n_clusters: The number of clusters to form as well as the number of centroids to generate.

Value will be: 4 (since we have 4 centers)

n_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

Value will be: 12

\n","\n","Initialize KMeans with these parameters, where the output parameter is called k_means.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["k_means = KMeans(init = \"k-means++\", n_clusters = 4, n_init = 12)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Now let's fit the KMeans model with the feature matrix we created above, X .\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["k_means.fit(X)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Now let's grab the labels for each point in the model using KMeans' .labels\\_ attribute and save it as k_means_labels .\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["k_means_labels = k_means.labels_\n","k_means_labels"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We will also get the coordinates of the cluster centers using KMeans' .cluster_centers_ and save it as k_means_cluster_centers .\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["k_means_cluster_centers = k_means.cluster_centers_\n","k_means_cluster_centers"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Creating the Visual Plot

\n","\n","So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Please read through the code and comments to understand how to plot the model.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# Initialize the plot with the specified dimensions.\n","fig = plt.figure(figsize=(6, 4))\n","\n","# Colors uses a color map, which will produce an array of colors based on\n","# the number of labels there are. We use set(k_means_labels) to get the\n","# unique labels.\n","colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))\n","\n","# Create a plot\n","ax = fig.add_subplot(1, 1, 1)\n","\n","# For loop that plots the data points and centroids.\n","# k will range from 0-3, which will match the possible clusters that each\n","# data point is in.\n","for k, col in zip(range(len([[4,4], [-2, -1], [2, -3], [1, 1]])), colors):\n","\n"," # Create a list of all data points, where the data points that are \n"," # in the cluster (ex. cluster 0) are labeled as true, else they are\n"," # labeled as false.\n"," my_members = (k_means_labels == k)\n"," \n"," # Define the centroid, or cluster center.\n"," cluster_center = k_means_cluster_centers[k]\n"," \n"," # Plots the datapoints with color col.\n"," ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')\n"," \n"," # Plots the centroids with specified color, but with a darker outline\n"," ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)\n","\n","# Title of the plot\n","ax.set_title('KMeans')\n","\n","# Remove x-axis ticks\n","ax.set_xticks(())\n","\n","# Remove y-axis ticks\n","ax.set_yticks(())\n","\n","# Show the plot\n","plt.show()\n"]},{"cell_type":"markdown","metadata":{},"source":["## Practice\n","Try to cluster the above dataset into 3 clusters. \n","Notice: do not generate the data again, use the same dataset as above.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# write your code here\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["

Click here for the solution

\n","\n","```python\n","k_means3 = KMeans(init = \"k-means++\", n_clusters = 3, n_init = 12)\n","k_means3.fit(X)\n","fig = plt.figure(figsize=(6, 4))\n","colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means3.labels_))))\n","ax = fig.add_subplot(1, 1, 1)\n","for k, col in zip(range(len(k_means3.cluster_centers_)), colors):\n"," my_members = (k_means3.labels_ == k)\n"," cluster_center = k_means3.cluster_centers_[k]\n"," ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')\n"," ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)\n","plt.show()\n","\n","```\n","\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Customer Segmentation with K-Means

\n","\n","Imagine that you have a customer dataset, and you need to apply customer segmentation on this historical data.\n","Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retain those customers. Another group might include customers from non-profit organizations and so on.\n","\n","__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Load Data From CSV File \n","Before you can work with the data, let's use pandas to read the dataset from IBM Object Storage.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import pandas as pd\n","cust_df = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/Cust_Segmentation.csv\")\n","cust_df.head()"]},{"cell_type":"markdown","metadata":{},"source":["

Pre-processingModeling

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["In our example (if we didn't have access to the k-means algorithm), it would be the same as guessing that each customer group would have certain age, income, education, etc, with multiple tests and experiments. However, using the K-means clustering we can do all this process much easier.\n","\n","Let's apply k-means on our dataset, and take a look at cluster labels.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["clusterNum = 3\n","k_means = KMeans(init = \"k-means++\", n_clusters = clusterNum, n_init = 12)\n","k_means.fit(X)\n","labels = k_means.labels_\n","print(labels)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Insights

\n","\n","We assign the labels to each row in the dataframe.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df[\"Clus_km\"] = labels\n","df.head(5)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We can easily check the centroid values by averaging the features in each cluster.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df.groupby('Clus_km').mean()"]},{"cell_type":"markdown","metadata":{},"source":["Now, let's look at the distribution of customers based on their age and income:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["area = np.pi * ( X[:, 1])**2 \n","plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)\n","plt.xlabel('Age', fontsize=18)\n","plt.ylabel('Income', fontsize=16)\n","\n","plt.show()\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from mpl_toolkits.mplot3d import Axes3D \n","fig = plt.figure(1, figsize=(8, 6))\n","plt.clf()\n","ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n","\n","plt.cla()\n","# plt.ylabel('Age', fontsize=18)\n","# plt.xlabel('Income', fontsize=16)\n","# plt.zlabel('Education', fontsize=16)\n","ax.set_xlabel('Education')\n","ax.set_ylabel('Age')\n","ax.set_zlabel('Income')\n","\n","ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["k-means will partition your customers into mutually exclusive groups, for example, into 3 clusters. The customers in each cluster are similar to each other demographically.\n","Now we can create a profile for each group, considering the common characteristics of each cluster. \n","For example, the 3 clusters can be:\n","\n","- AFFLUENT, EDUCATED AND OLD AGED\n","- MIDDLE AGED AND MIDDLE INCOME\n","- YOUNG AND LOW INCOME\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Want to learn more?

\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["### Thank you for completing this lab!\n","\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","\n","\n","\n","## Change Log\n","\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2020-11-03 | 2.1 | Lakshmi | Updated URL of csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","\n","##

\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Linear Classification/Clas-SVM-cancer-project.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"5f9f987e-58e5-400b-9b90-8ca43b01a07f","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","\n","# SVM (Support Vector Machines)\n","\n","\n","Estimated time needed: 15 minutes\n"," \n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use scikit-learn to Support Vector Machine to classify\n"]},{"cell_type":"markdown","id":"ae964f7c-ca56-4d6b-b521-5e221a746164","metadata":{},"outputs":[],"source":["In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n","\n","SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.\n"]},{"cell_type":"markdown","id":"219ee2e7-c558-40c8-8265-a5e4a577ab15","metadata":{},"outputs":[],"source":["

\n","\n","

\n","

Load the Cancer data
Modeling
Evaluation
Practice

\n","

\n","
\n","

\n"]},{"cell_type":"code","id":"9b6afb91-66d0-4b30-9892-34e99fa557ef","metadata":{},"outputs":[],"source":["!pip install scikit-learn==0.23.1"]},{"cell_type":"code","id":"345e1afc-3d1b-46e5-adea-ee05324a833b","metadata":{},"outputs":[],"source":["import pandas as pd\nimport pylab as pl\nimport numpy as np\nimport scipy.optimize as opt\nfrom sklearn import preprocessing\nfrom sklearn.model_selection import train_test_split\n%matplotlib inline \nimport matplotlib.pyplot as plt"]},{"cell_type":"markdown","id":"b869ce64-28d1-4cab-b3b7-b3849756bfb2","metadata":{},"outputs":[],"source":["

Load the Cancer data

\n","The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n","\n","|Field name|Description|\n","|--- |--- |\n","|ID|Clump thickness|\n","|Clump|Clump thickness|\n","|UnifSize|Uniformity of cell size|\n","|UnifShape|Uniformity of cell shape|\n","|MargAdh|Marginal adhesion|\n","|SingEpiSize|Single epithelial cell size|\n","|BareNuc|Bare nuclei|\n","|BlandChrom|Bland chromatin|\n","|NormNucl|Normal nucleoli|\n","|Mit|Mitoses|\n","|Class|Benign or malignant|\n","\n","
\n","
\n","\n","For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage. \n","\n","__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"code","id":"4acb9e1a-ddb2-4f87-8c16-437cca952058","metadata":{},"outputs":[],"source":["#Click here and press Shift+Enter\n!wget -O cell_samples.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"]},{"cell_type":"markdown","id":"2c7488af-938c-493e-8284-71b0c5ad0a2b","metadata":{},"outputs":[],"source":["## Load Data From CSV File \n"]},{"cell_type":"code","id":"7afded80-ea99-4c72-a247-7af084055b76","metadata":{},"outputs":[],"source":["cell_df = pd.read_csv(\"cell_samples.csv\")\ncell_df.head()"]},{"cell_type":"markdown","id":"d6ad8036-7ce2-43c6-bff0-53e88835ee49","metadata":{},"outputs":[],"source":["The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n","\n","The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n","\n","Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:\n"]},{"cell_type":"code","id":"386c99a6-aae4-4879-afa1-cff8d58d926e","metadata":{},"outputs":[],"source":["ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\ncell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);\nplt.show()"]},{"cell_type":"markdown","id":"e55434b7-e290-4cc0-96a6-829aa78f1645","metadata":{},"outputs":[],"source":["## Data pre-processing and selection\n"]},{"cell_type":"markdown","id":"38f41ec2-92ea-433d-900e-365308d9761c","metadata":{},"outputs":[],"source":["Let's first look at columns data types:\n"]},{"cell_type":"code","id":"865ecaf9-d4fa-49de-a5c0-14aed34a014f","metadata":{},"outputs":[],"source":["cell_df.dtypes"]},{"cell_type":"markdown","id":"ac07bf24-40cd-4491-8952-6b8517c4e29b","metadata":{},"outputs":[],"source":["It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:\n"]},{"cell_type":"code","id":"a9ddb41c-f9c1-4123-aa05-425fed1a0675","metadata":{},"outputs":[],"source":["cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\ncell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\ncell_df.dtypes"]},{"cell_type":"code","id":"b26fd903-8366-45ea-bc92-90d92e761ac1","metadata":{},"outputs":[],"source":["feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\nX = np.asarray(feature_df)\nX[0:5]"]},{"cell_type":"markdown","id":"f15a4069-c0c1-4a3d-84b7-4dd9b15d328b","metadata":{},"outputs":[],"source":["We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)).\n"]},{"cell_type":"code","id":"2e9ac65c-72f3-4e8e-8cb0-8b01fa5da84f","metadata":{},"outputs":[],"source":["y = np.asarray(cell_df['Class'])\ny [0:5]"]},{"cell_type":"markdown","id":"300f169f-505b-4da0-af90-26b1d56c8a09","metadata":{},"outputs":[],"source":["## Train/Test dataset\n"]},{"cell_type":"markdown","id":"d62262ad-adbc-411a-a992-9c34c8d51659","metadata":{},"outputs":[],"source":["We split our dataset into train and test set:\n"]},{"cell_type":"code","id":"a1ce942a-9e8e-4c49-9c37-0c8cf9343ff0","metadata":{},"outputs":[],"source":["X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\nprint ('Train set:', X_train.shape, y_train.shape)\nprint ('Test set:', X_test.shape, y_test.shape)"]},{"cell_type":"markdown","id":"92b24ae6-47a3-4792-a9e2-d5d27bc80898","metadata":{},"outputs":[],"source":["

Modeling (SVM with Scikit-learn)

\n"]},{"cell_type":"markdown","id":"3638436a-f840-4d33-add8-2a9af5aa8cf5","metadata":{},"outputs":[],"source":["The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n","\n"," 1.Linear\n"," 2.Polynomial\n"," 3.Radial basis function (RBF)\n"," 4.Sigmoid\n","Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.\n"]},{"cell_type":"code","id":"5ad98b8f-fae0-4c96-b2b1-2a03ca9f6385","metadata":{},"outputs":[],"source":["from sklearn import svm\nclf = svm.SVC(kernel='rbf')\nclf.fit(X_train, y_train) "]},{"cell_type":"markdown","id":"98456360-256e-48da-b211-97d1523deaee","metadata":{},"outputs":[],"source":["After being fitted, the model can then be used to predict new values:\n"]},{"cell_type":"code","id":"0c7131cb-0f95-43df-ac6b-61f91194cf06","metadata":{},"outputs":[],"source":["yhat = clf.predict(X_test)\nyhat [0:5]"]},{"cell_type":"markdown","id":"d176d905-4dc0-41c2-a96a-4a40b0b7e9a9","metadata":{},"outputs":[],"source":["

Evaluation

\n"]},{"cell_type":"code","id":"659b285a-5a0b-4a7a-b58b-870d668efa8d","metadata":{},"outputs":[],"source":["from sklearn.metrics import classification_report, confusion_matrix\nimport itertools"]},{"cell_type":"code","id":"24c59f55-6c17-4a47-bf9e-bafe7f5a4032","metadata":{},"outputs":[],"source":["def plot_confusion_matrix(cm, classes,\n normalize=False,\n title='Confusion matrix',\n cmap=plt.cm.Blues):\n \"\"\"\n This function prints and plots the confusion matrix.\n Normalization can be applied by setting `normalize=True`.\n \"\"\"\n if normalize:\n cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n print(\"Normalized confusion matrix\")\n else:\n print('Confusion matrix, without normalization')\n\n print(cm)\n\n plt.imshow(cm, interpolation='nearest', cmap=cmap)\n plt.title(title)\n plt.colorbar()\n tick_marks = np.arange(len(classes))\n plt.xticks(tick_marks, classes, rotation=45)\n plt.yticks(tick_marks, classes)\n\n fmt = '.2f' if normalize else 'd'\n thresh = cm.max() / 2.\n for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n plt.text(j, i, format(cm[i, j], fmt),\n horizontalalignment=\"center\",\n color=\"white\" if cm[i, j] > thresh else \"black\")\n\n plt.tight_layout()\n plt.ylabel('True label')\n plt.xlabel('Predicted label')"]},{"cell_type":"code","id":"7db59216-f5fd-481f-a615-8b3b52a3895b","metadata":{},"outputs":[],"source":["# Compute confusion matrix\ncnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\nnp.set_printoptions(precision=2)\n\nprint (classification_report(y_test, yhat))\n\n# Plot non-normalized confusion matrix\nplt.figure()\nplot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')"]},{"cell_type":"markdown","id":"544c97c0-2708-4e36-ba2b-50dd5a50fe09","metadata":{},"outputs":[],"source":["You can also easily use the __f1_score__ from sklearn library:\n"]},{"cell_type":"code","id":"ba790b89-428b-47fc-b676-9f7f1bfe4334","metadata":{},"outputs":[],"source":["from sklearn.metrics import f1_score\nf1_score(y_test, yhat, average='weighted') "]},{"cell_type":"markdown","id":"159c3ffb-d0b5-46f7-aa9f-1be3010fb29f","metadata":{},"outputs":[],"source":["Let's try the jaccard index for accuracy:\n"]},{"cell_type":"code","id":"495ba481-e55e-430e-ba88-65e30cb32836","metadata":{},"outputs":[],"source":["from sklearn.metrics import jaccard_score\njaccard_score(y_test, yhat,pos_label=2)"]},{"cell_type":"markdown","id":"46685f67-1ab0-4e70-9225-5d2bd85758fe","metadata":{},"outputs":[],"source":["

Practice

\n","Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?\n"]},{"cell_type":"code","id":"58bc674f-d525-4e88-9db0-1cfa97b70436","metadata":{},"outputs":[],"source":["# write your code here\n"]},{"cell_type":"markdown","id":"6e8bfeb3-bab0-4d3d-9780-8234648c6a00","metadata":{},"outputs":[],"source":["

Click here for the solution

\n","\n","```python\n","clf2 = svm.SVC(kernel='linear')\n","clf2.fit(X_train, y_train) \n","yhat2 = clf2.predict(X_test)\n","print(\"Avg F1-score: %.4f\" % f1_score(y_test, yhat2, average='weighted'))\n","print(\"Jaccard score: %.4f\" % jaccard_score(y_test, yhat2,pos_label=2))\n","\n","```\n","\n","

\n","\n"]},{"cell_type":"markdown","id":"268f206e-99e9-4deb-b4cc-f080b14c86c6","metadata":{},"outputs":[],"source":["

Want to learn more?

\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n","\n"]},{"cell_type":"markdown","id":"98ce26a4-29dd-457b-82af-ac84d9dabe4f","metadata":{},"outputs":[],"source":["### Thank you for completing this lab!\n","\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","\n","\n","\n","## Change Log\n","\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2021-01-21 | 2.2 | Lakshmi | Updated sklearn library |\n","| 2020-11-03 | 2.1 | Lakshmi | Updated URL of csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","\n","##

\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Linear Classification/Logistic-Regression-churn.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","\n","# Logistic Regression with Python\n","\n","\n","Estimated time needed: 25 minutes\n"," \n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use scikit Logistic Regression to classify\n","* Understand confusion matrix\n"]},{"cell_type":"markdown","metadata":{},"source":["In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.\n"]},{"cell_type":"markdown","metadata":{},"source":["

\n","\n","

\n","

About the dataset
Data pre-processing and selection
Modeling (Logistic Regression with Scikit-learn)
Evaluation
Practice

\n","

\n","
\n","

\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["\n","## What is the difference between Linear and Logistic Regression?\n","\n","While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.\n","\n","

\n","Recall linear regression:\n","
\n","
\n"," As you know, Linear regression finds a function that relates a continuous dependent variable, y, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, simple linear regression assumes a function of the form:\n","

\n","$$\n","y = \\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\cdots\n","$$\n","
\n","and finds the values of parameters $\\theta_0, \\theta_1, \\theta_2$, etc, where the term $\\theta_0$ is the \"intercept\". It can be generally shown as:\n","

\n","$$\n","ℎ_\\theta(𝑥) = \\theta^TX\n","$$\n","

\n","\n","

\n","\n","Logistic Regression is a variation of Linear Regression, used when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n","\n","Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability with the following function, which is called the sigmoid function 𝜎:\n","\n","$$\n","ℎ_\\theta(𝑥) = \\sigma({\\theta^TX}) = \\frac {e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +...)}}{1 + e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +\\cdots)}}\n","$$\n","Or:\n","$$\n","ProbabilityOfaClass_1 = P(Y=1|X) = \\sigma({\\theta^TX}) = \\frac{e^{\\theta^TX}}{1+e^{\\theta^TX}} \n","$$\n","\n","In this equation, ${\\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\\sigma(\\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork1047-2023-01-01), also called logistic curve. It is a common \"S\" shape (sigmoid curve).\n","\n","So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n","\n","

\n","\n","\n","The objective of the __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_\\theta(𝑥)$ = $\\sigma({\\theta^TX})$, in such a way that the model best predicts the class of each case.\n"]},{"cell_type":"markdown","metadata":{},"source":["### Customer churn with Logistic Regression\n","A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!pip install scikit-learn==0.23.1"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's first import required libraries:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["import pandas as pd\n","import pylab as pl\n","import numpy as np\n","import scipy.optimize as opt\n","from sklearn import preprocessing\n","%matplotlib inline \n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

About the dataset

\n","We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n","\n","\n","This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n","\n","\n","\n","The dataset includes information about:\n","\n","- Customers who left within the last month – the column is called Churn\n","- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n","- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n","- Demographic info about customers – gender, age range, and if they have partners and dependents\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Load the Telco Churn data \n","Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n","\n","To download the data, we will use `!wget` to download it from IBM Object Storage.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["#Click here and press Shift+Enter\n","!wget -O ChurnData.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv"]},{"cell_type":"markdown","metadata":{},"source":["__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Load Data From CSV File \n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["churn_df = pd.read_csv(\"ChurnData.csv\")\n","churn_df.head()"]},{"cell_type":"markdown","metadata":{},"source":["

Data pre-processing and selection

\n"]},{"cell_type":"markdown","metadata":{},"source":["Let's select some features for the modeling. Also, we change the target data type to be an integer, as it is a requirement by the skitlearn algorithm:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]\n","churn_df['churn'] = churn_df['churn'].astype('int')\n","churn_df.head()"]},{"cell_type":"markdown","metadata":{"button":true,"new_sheet":true,"run_control":{"read_only":false}},"source":["## Practice\n","How many rows and columns are in this dataset in total? What are the names of columns?\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# write your code here\n"]},{"cell_type":"markdown","metadata":{},"source":["

Click here for the solution

\n","\n","```python\n","churn_df.shape\n","\n","```\n","\n","

\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["Let's define X, and y for our dataset:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])\n","X[0:5]"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["y = np.asarray(churn_df['churn'])\n","y [0:5]"]},{"cell_type":"markdown","metadata":{},"source":["Also, we normalize the dataset:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn import preprocessing\n","X = preprocessing.StandardScaler().fit(X).transform(X)\n","X[0:5]"]},{"cell_type":"markdown","metadata":{},"source":["## Train/Test dataset\n"]},{"cell_type":"markdown","metadata":{},"source":["We split our dataset into train and test set:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n","print ('Train set:', X_train.shape, y_train.shape)\n","print ('Test set:', X_test.shape, y_test.shape)"]},{"cell_type":"markdown","metadata":{},"source":["

Modeling (Logistic Regression with Scikit-learn)

\n"]},{"cell_type":"markdown","metadata":{},"source":["Let's build our model using __LogisticRegression__ from the Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in the internet.\n","\n","The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem of machine learning models.\n","__C__ parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. \n","Now let's fit our model with train set:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.linear_model import LogisticRegression\n","from sklearn.metrics import confusion_matrix\n","LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)\n","LR"]},{"cell_type":"markdown","metadata":{},"source":["Now we can predict using our test set:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["yhat = LR.predict(X_test)\n","yhat"]},{"cell_type":"markdown","metadata":{},"source":["__predict_proba__ returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 0, P(Y=0|X), and second column is probability of class 1, P(Y=1|X):\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["yhat_prob = LR.predict_proba(X_test)\n","yhat_prob"]},{"cell_type":"markdown","metadata":{},"source":["

Evaluation

\n"]},{"cell_type":"markdown","metadata":{},"source":["### jaccard index\n","Let's try the jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.metrics import jaccard_score\n","jaccard_score(y_test, yhat,pos_label=0)"]},{"cell_type":"markdown","metadata":{},"source":["### confusion matrix\n","Another way of looking at the accuracy of the classifier is to look at __confusion matrix__.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.metrics import classification_report, confusion_matrix\n","import itertools\n","def plot_confusion_matrix(cm, classes,\n"," normalize=False,\n"," title='Confusion matrix',\n"," cmap=plt.cm.Blues):\n"," \"\"\"\n"," This function prints and plots the confusion matrix.\n"," Normalization can be applied by setting `normalize=True`.\n"," \"\"\"\n"," if normalize:\n"," cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n"," print(\"Normalized confusion matrix\")\n"," else:\n"," print('Confusion matrix, without normalization')\n","\n"," print(cm)\n","\n"," plt.imshow(cm, interpolation='nearest', cmap=cmap)\n"," plt.title(title)\n"," plt.colorbar()\n"," tick_marks = np.arange(len(classes))\n"," plt.xticks(tick_marks, classes, rotation=45)\n"," plt.yticks(tick_marks, classes)\n","\n"," fmt = '.2f' if normalize else 'd'\n"," thresh = cm.max() / 2.\n"," for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n"," plt.text(j, i, format(cm[i, j], fmt),\n"," horizontalalignment=\"center\",\n"," color=\"white\" if cm[i, j] > thresh else \"black\")\n","\n"," plt.tight_layout()\n"," plt.ylabel('True label')\n"," plt.xlabel('Predicted label')\n","print(confusion_matrix(y_test, yhat, labels=[1,0]))"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Compute confusion matrix\n","cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n","np.set_printoptions(precision=2)\n","\n","\n","# Plot non-normalized confusion matrix\n","plt.figure()\n","plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')"]},{"cell_type":"markdown","metadata":{},"source":["Let's look at first row. The first row is for customers whose actual churn value in the test set is 1.\n","As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n","Out of these 15 cases, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n","\n","This means, for 6 customers, the actual churn value was 1 in test set and classifier also correctly predicted those as 1. However, while the actual label of 9 customers was 1, the classifier predicted those as 0, which is not very good. We can consider it as the error of the model for first row.\n","\n","What about the customers with churn value 0? Lets look at the second row.\n","It looks like there were 25 customers whom their churn value were 0. \n","\n","\n","The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about the confusion matrix is that it shows the model’s ability to correctly predict or separate the classes. In a specific case of the binary classifier, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. \n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["print (classification_report(y_test, yhat))\n"]},{"cell_type":"markdown","metadata":{},"source":["Based on the count of each section, we can calculate precision and recall of each label:\n","\n","\n","- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n","\n","- __Recall__ is the true positive rate. It is defined as: Recall = TP / (TP + FN)\n","\n"," \n","So, we can calculate the precision and recall of each class.\n","\n","__F1 score:__\n","Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n","\n","The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n","\n","\n","Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case.\n"]},{"cell_type":"markdown","metadata":{},"source":["### log loss\n","Now, let's try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n","Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.metrics import log_loss\n","log_loss(y_test, yhat_prob)"]},{"cell_type":"markdown","metadata":{},"source":["

Practice

\n","Try to build Logistic Regression model again for the same dataset, but this time, use different __solver__ and __regularization__ values? What is new __logLoss__ value?\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# write your code here\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["

Click here for the solution

\n","\n","```python\n","LR2 = LogisticRegression(C=0.01, solver='sag').fit(X_train,y_train)\n","yhat_prob2 = LR2.predict_proba(X_test)\n","print (\"LogLoss: : %.2f\" % log_loss(y_test, yhat_prob2))\n","\n","```\n","\n","

\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["

Want to learn more?

\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["### Thank you for completing this lab!\n","\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","\n","\n","\n","## Change Log\n","\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2021-01-21 | 2.2 | Lakshmi | Updated sklearn library|\n","| 2020-11-03 | 2.1 | Lakshmi | Updated URL of csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","\n","##

\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":2} -------------------------------------------------------------------------------- /Linear Classification/Multi-class_Classification.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"cb2edafe-1800-4ce1-a138-8170af21ab4e","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n"]},{"cell_type":"markdown","id":"9b0db70b-0a27-4507-86e2-131e13687d42","metadata":{},"outputs":[],"source":["# Softmax Regression, One-vs-All and One-vs-One for Multi-class Classification\n"]},{"cell_type":"markdown","id":"1393bd45-c2ad-4976-9e42-eaf062b58b7e","metadata":{},"outputs":[],"source":["Estimated time needed: 1 hour\n"]},{"cell_type":"markdown","id":"0cbdd1f3-4fea-4cd3-96d7-8c3ea82d471e","metadata":{},"outputs":[],"source":[" In this lab, we will study how to convert a linear classifier into a multi-class classifier, including multinomial logistic regression or softmax regression, One vs. All (One-vs-Rest) and One vs. One.\n"]},{"cell_type":"markdown","id":"3ba954cc-9490-4d1b-a2cf-d61c3319a0fc","metadata":{},"outputs":[],"source":["## Objectives\n"]},{"cell_type":"markdown","id":"e52fc731-4309-4c61-a0ee-5d697af79f82","metadata":{},"outputs":[],"source":["After completing this lab you will be able to:\n"]},{"cell_type":"markdown","id":"a8fb850e-44db-43fb-9357-d24f0fc5cd88","metadata":{},"outputs":[],"source":["* Understand and apply some theory behind:\n"," * Softmax regression\n"," * One vs. All (One-vs-Rest)\n"," * One vs. One\n"]},{"cell_type":"markdown","id":"04e4b39b-d7c8-49c9-9c19-59107ed24031","metadata":{},"outputs":[],"source":["## Introduction\n"]},{"cell_type":"markdown","id":"5bcd84c2-f024-4e41-b061-630d2141ee92","metadata":{},"outputs":[],"source":["In Multi-class classification, we classify data into multiple class labels. Unlike classification trees and k-nearest neighbor, the concept of multi-class classification for linear classifiers is not as straightforward. We can convert logistic regression to multi-class classification using multinomial logistic regression or softmax regression; this is a generalization of logistic regression, this will not work for support vector machines. One vs. All (One-vs-Rest) and One vs. One are two other multi-class classification techniques can convert any two-class classifier into a multi-class classifier.\n"]},{"cell_type":"markdown","id":"18f1c8ac-99fe-4ceb-87f4-cec2e1dd962d","metadata":{},"outputs":[],"source":["*\n"]},{"cell_type":"markdown","id":"05b8410e-d3b8-43d1-9be8-d8777ecb1728","metadata":{},"outputs":[],"source":["## Install and Import the required libraries\n"]},{"cell_type":"markdown","id":"1db192c4-1beb-4afb-a0dd-1beea9625530","metadata":{},"outputs":[],"source":["For this lab, we are going to be using several Python libraries such as scit-learn, numpy, and matplotlib for visualizations. Some of these libraries might be installed in your lab environment, and others may need to be installed by you by removing the hash signs. The cells below will install these libraries when executed.\n"]},{"cell_type":"code","id":"33b69894-9cb0-4f38-a1d5-5338c3c76534","metadata":{},"outputs":[],"source":["!pip install scikit-learn==1.0.2"]},{"cell_type":"code","id":"0528a1c7-d667-4b9e-b788-363c87838cd2","metadata":{},"outputs":[],"source":["import numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn import datasets\nfrom sklearn.svm import SVC\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score\nimport pandas as pd"]},{"cell_type":"markdown","id":"01b9b16d-025b-48a2-9115-730e440c63f0","metadata":{},"outputs":[],"source":["## Utility Function\n"]},{"cell_type":"markdown","id":"37bb99bd-d95e-4319-8646-6040d97dabe3","metadata":{},"outputs":[],"source":["This function plots a different decision boundary. \n"]},{"cell_type":"code","id":"69b6101a-f1c3-4f16-8799-d6231edd9380","metadata":{},"outputs":[],"source":["plot_colors = \"ryb\"\nplot_step = 0.02\n\ndef decision_boundary (X,y,model,iris, two=None):\n x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),\n np.arange(y_min, y_max, plot_step))\n plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)\n \n Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n Z = Z.reshape(xx.shape)\n cs = plt.contourf(xx, yy, Z,cmap=plt.cm.RdYlBu)\n \n if two:\n cs = plt.contourf(xx, yy, Z,cmap=plt.cm.RdYlBu)\n for i, color in zip(np.unique(y), plot_colors):\n \n idx = np.where( y== i)\n plt.scatter(X[idx, 0], X[idx, 1], label=y,cmap=plt.cm.RdYlBu, s=15)\n plt.show()\n \n else:\n set_={0,1,2}\n print(set_)\n for i, color in zip(range(3), plot_colors):\n idx = np.where( y== i)\n if np.any(idx):\n\n set_.remove(i)\n\n plt.scatter(X[idx, 0], X[idx, 1], label=y,cmap=plt.cm.RdYlBu, edgecolor='black', s=15)\n\n\n for i in set_:\n idx = np.where( iris.target== i)\n plt.scatter(X[idx, 0], X[idx, 1], marker='x',color='black')\n\n plt.show()\n"]},{"cell_type":"markdown","id":"f496831e-1d7a-445d-9c59-8496a9ac139d","metadata":{},"outputs":[],"source":["This function will plot the probability of belonging to each class; each column is the probability of belonging to a class and the row number is the sample number.\n"]},{"cell_type":"code","id":"df66c234-bfcb-4278-98fc-64e3d6ced1a2","metadata":{},"outputs":[],"source":["def plot_probability_array(X,probability_array):\n\n plot_array=np.zeros((X.shape[0],30))\n col_start=0\n ones=np.ones((X.shape[0],30))\n for class_,col_end in enumerate([10,20,30]):\n plot_array[:,col_start:col_end]= np.repeat(probability_array[:,class_].reshape(-1,1), 10,axis=1)\n col_start=col_end\n plt.imshow(plot_array)\n plt.xticks([])\n plt.ylabel(\"samples\")\n plt.xlabel(\"probability of 3 classes\")\n plt.colorbar()\n plt.show()"]},{"cell_type":"markdown","id":"55be7a5a-f4fc-4ab5-9726-5283b043be5a","metadata":{},"outputs":[],"source":["In ths lab we will use the iris dataset, it consists of three different types of irises’ (Setosa y=0, Versicolour y=1, and Virginica y=2), petal and sepal length, stored in a 150x4 numpy.ndarray.\n","\n","The rows being the samples and the columns: Sepal Length, Sepal Width, Petal Length and Petal Width.\n","\n","The following plot uses the second two features:\n"]},{"cell_type":"code","id":"ca1ca1b1-5802-41b3-8f63-379cfb6416b2","metadata":{},"outputs":[],"source":["pair=[1, 3]\niris = datasets.load_iris()\nX = iris.data[:, pair]\ny = iris.target\nnp.unique(y)"]},{"cell_type":"code","id":"86267a91-476e-4a89-9a10-2c41ecff4bb3","metadata":{},"outputs":[],"source":["plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu)\nplt.xlabel(\"sepal width (cm)\")\nplt.ylabel(\"petal width\")"]},{"cell_type":"markdown","id":"8b513e4d-eeb1-41fa-8d36-3cbc96d2ef83","metadata":{},"outputs":[],"source":["## Softmax Regression** \n"]},{"cell_type":"markdown","id":"e3786b51-8e7d-4470-8940-b479841ca406","metadata":{},"outputs":[],"source":["SoftMax regression is similar to logistic regression, and the softmax function converts the actual distances, that is, dot products of $x$ with each of the parameters $\\theta_i$ for the $K$ classes. This is converted to probabilities using the following: \n"]},{"cell_type":"markdown","id":"891af090-f534-4a92-89e3-984681f97cad","metadata":{},"outputs":[],"source":["$softmax(x,i) = \\frac{e^{ \\theta_i^T \\bf x}}{\\sum_{j=1}^K e^{\\theta_j^T x}} $\n"]},{"cell_type":"markdown","id":"58a23616-8fd2-4d71-aa66-273b68df836d","metadata":{},"outputs":[],"source":["The training procedure is almost identical to logistic regression. Consider the three-class example where $y \\in \\{0,1,2\\}$ we would like to classify $x_1$. We can use the softmax function to generate a probability of how likely the sample belongs to each class:\n"]},{"cell_type":"markdown","id":"4d6de2d0-94db-4380-8ddc-41c6f414462c","metadata":{},"outputs":[],"source":["$[softmax(x_1,0),softmax(x_1,1),softmax(x_1,2)]=[0.97,0.2,0.1]$\n"]},{"cell_type":"markdown","id":"3925482d-fb4f-452c-9f11-3ecbc3d1cca1","metadata":{},"outputs":[],"source":["The index of each probability is the same as the class. We can make a prediction using the argmax function:\n"]},{"cell_type":"markdown","id":"feecf03d-b218-4820-bf2f-4b320aa337f6","metadata":{},"outputs":[],"source":["$\\hat{y}=argmax_i \\{softmax(x,i)\\}$\n"]},{"cell_type":"markdown","id":"fecca6b5-777e-47a5-9977-bd7ca18c44b4","metadata":{},"outputs":[],"source":["For the previous example, we can make a prediction as follows:\n"]},{"cell_type":"markdown","id":"6efe2bf2-70e8-49b0-8695-ddc022ffb470","metadata":{},"outputs":[],"source":["$\\hat{y}=argmax_i \\{[0.97,0.2,0.1]\\}=0$\n"]},{"cell_type":"markdown","id":"08b3cc0c-4f34-477d-85fd-1e3b7db7505f","metadata":{},"outputs":[],"source":["The `sklearn` does this automatically, but we can verify the prediction step, as we fit the model:\n"]},{"cell_type":"code","id":"19101303-1bea-45fe-9224-c049e8a113ce","metadata":{},"outputs":[],"source":["lr = LogisticRegression(random_state=0).fit(X, y)"]},{"cell_type":"markdown","id":"47a6ccce-9838-420f-a011-eaeb301a3570","metadata":{},"outputs":[],"source":["We generate the probability using the method predict_proba:\n"]},{"cell_type":"code","id":"bf8ad304-cf33-4f27-9e1e-e19f0b89a96d","metadata":{},"outputs":[],"source":["probability=lr.predict_proba(X)\n"]},{"cell_type":"markdown","id":"d9f35993-6c5c-4fb0-a7ea-175491f322da","metadata":{},"outputs":[],"source":["We can plot the probability of belonging to each class; each column is the probability of belonging to a class and the row number is the sample number.\n"]},{"cell_type":"code","id":"a7abbd99-2a09-46b6-aa87-a85bb0d866f0","metadata":{},"outputs":[],"source":["plot_probability_array(X,probability)"]},{"cell_type":"markdown","id":"33b68549-ba00-42a4-96f5-e97a320a056c","metadata":{},"outputs":[],"source":["Here, is the output for the first sample:\n"]},{"cell_type":"code","id":"991fde21-21f2-41e7-8ace-b7fa20d48953","metadata":{},"outputs":[],"source":["probability[0,:]"]},{"cell_type":"markdown","id":"f0a28809-7c66-403c-a06f-b4947963377c","metadata":{},"outputs":[],"source":["We see it sums to one.\n"]},{"cell_type":"code","id":"d7173763-e7c5-4efd-958d-5487d3402b86","metadata":{},"outputs":[],"source":["probability[0,:].sum()"]},{"cell_type":"markdown","id":"a0f04132-4283-403e-a8df-e9a7a424c63b","metadata":{},"outputs":[],"source":["We can apply the $argmax$ function.\n"]},{"cell_type":"code","id":"c86421b0-ff03-4c73-b46b-0239f9266e5d","metadata":{},"outputs":[],"source":["np.argmax(probability[0,:])"]},{"cell_type":"markdown","id":"160383cf-06b6-429e-9f1b-a29c942b3c10","metadata":{},"outputs":[],"source":["We can apply the $argmax$ function to each sample.\n"]},{"cell_type":"code","id":"24a19060-6adf-41f4-8ca8-d38bfbf2aea8","metadata":{},"outputs":[],"source":["softmax_prediction=np.argmax(probability,axis=1)\nsoftmax_prediction"]},{"cell_type":"markdown","id":"d1c34145-2524-466b-92a5-671a662b39d7","metadata":{},"outputs":[],"source":["We can verify that sklearn does this under the hood by comparing it to the output of the method `predict` .\n"]},{"cell_type":"code","id":"942b4561-f700-4fe7-b4a4-8d854c85c86a","metadata":{},"outputs":[],"source":["yhat =lr.predict(X)\naccuracy_score(yhat,softmax_prediction)"]},{"cell_type":"markdown","id":"ad19b50f-ac08-49c9-a1e6-a16fcc68f79e","metadata":{},"outputs":[],"source":["We can't use Softmax regression for SVMs, Let's explore two methods of Multi-class Classification that we can apply to SVM.\n"]},{"cell_type":"markdown","id":"c06da5ae-ac5c-4c85-89ab-24a536d84145","metadata":{},"outputs":[],"source":["## SVM \n"]},{"cell_type":"markdown","id":"d5186b95-8255-4483-a268-f913a80955d5","metadata":{},"outputs":[],"source":["Sklean performs Multi-class Classification automatically, we can apply the method and calculate the accuracy. Train a SVM classifier with the `kernel` set to `linear`, `gamma` set to `0.5`, and the `probability` paramter set to `True`, then train the model using the `X` and `y` data.\n"]},{"cell_type":"code","id":"9b07a71c-8338-48f6-bf30-3fe174466ac1","metadata":{},"outputs":[],"source":["model = #ADD CODE\n\n#ADD CODE\n"]},{"cell_type":"markdown","id":"cfed6275-048f-4ebd-9ceb-38ef5a8318ac","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","model = SVC(kernel='linear', gamma=.5, probability=True)\n","\n","model.fit(X,y)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"9d487972-74eb-4f68-a073-c0e366ef3bd5","metadata":{},"outputs":[],"source":["Find the `accuracy_score` on the training data.\n"]},{"cell_type":"code","id":"c8bdfde9-e2a5-4fb0-add3-b034bf3970c3","metadata":{},"outputs":[],"source":[""]},{"cell_type":"markdown","id":"9908430b-a02e-418f-bb37-8e6ecb2472bd","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","yhat = model.predict(X)\n","\n","accuracy_score(y,yhat)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"fca47c11-e1bf-4891-848e-7d38d7bee336","metadata":{},"outputs":[],"source":["We can plot the decision_boundary.\n"]},{"cell_type":"code","id":"5e4d91e3-34d7-4bcc-a6ba-82bf367b0d03","metadata":{},"outputs":[],"source":["decision_boundary (X,y,model,iris)"]},{"cell_type":"markdown","id":"d13613cc-86e3-4bf0-ad48-6c3b160f121f","metadata":{},"outputs":[],"source":["Let's implement One vs. All and One vs One:\n"]},{"cell_type":"markdown","id":"056062e6-a925-40a9-b217-f5ab9e1b8351","metadata":{},"outputs":[],"source":["## One vs. All (One-vs-Rest) \n"]},{"cell_type":"markdown","id":"d0a08806-3b42-45b3-b3d6-c806b07748e0","metadata":{},"outputs":[],"source":["For one-vs-all classification, if we have K classes, we use K two-class classifier models. The number of class labels present in the dataset is equal to the number of generated classifiers. First, we create an artificial class we will call this \"dummy\" class. For each classifier, we split the data into two classes. We take the class samples we would like to classify, the rest of the samples will be labelled as a dummy class. We repeat the process for each class. To make a classification, we use the classifier with the highest probability, disregarding the dummy class.\n"]},{"cell_type":"markdown","id":"a13c8030-ce49-428b-92d0-b5e368a72213","metadata":{},"outputs":[],"source":["### Train Each Classifier\n"]},{"cell_type":"markdown","id":"5b0720d0-4547-4580-abfb-b216facf682e","metadata":{},"outputs":[],"source":["Here, we train three classifiers and place them in the list `my_models`. For each class we take the class samples we would like to classify, and the rest will be labelled as a dummy class. We repeat the process for each class. For each classifier, we plot the decision regions. The class we are interested in is in red, and the dummy class is in blue. Similarly, the class samples are marked in blue, and the dummy samples are marked with a black x. \n"]},{"cell_type":"code","id":"9a0375f6-c82e-4386-8344-62a0487b85c1","metadata":{},"outputs":[],"source":["#dummy class\ndummy_class=y.max()+1\n#list used for classifiers \nmy_models=[]\n#iterate through each class\nfor class_ in np.unique(y):\n #select the index of our class\n select=(y==class_)\n temp_y=np.zeros(y.shape)\n #class, we are trying to classify \n temp_y[y==class_]=class_\n #set other samples to a dummy class \n temp_y[y!=class_]=dummy_class\n #Train model and add to list \n model=SVC(kernel='linear', gamma=.5, probability=True) \n my_models.append(model.fit(X,temp_y))\n #plot decision boundary \n decision_boundary (X,temp_y,model,iris)\n"]},{"cell_type":"markdown","id":"83b546ee-5845-48d9-acaf-be730234feb2","metadata":{},"outputs":[],"source":[" For each sample we calculate the probability of belonging to each class, not including the dummy class.\n"]},{"cell_type":"code","id":"ce2ab43b-94d5-4bcd-901c-158c8fb77c72","metadata":{},"outputs":[],"source":["probability_array=np.zeros((X.shape[0],3))\nfor j,model in enumerate(my_models):\n\n real_class=np.where(np.array(model.classes_)!=3)[0]\n\n probability_array[:,j]=model.predict_proba(X)[:,real_class][:,0]"]},{"cell_type":"markdown","id":"4dfa5118-9641-42aa-bc90-b267ac431ea0","metadata":{},"outputs":[],"source":["Here, is the probability of belonging to each class for the first sample.\n"]},{"cell_type":"code","id":"22fa4115-bc52-4381-990f-5a1e2532648b","metadata":{},"outputs":[],"source":["probability_array[0,:]"]},{"cell_type":"markdown","id":"b25cffc1-1194-4f18-806d-9a13626f9458","metadata":{},"outputs":[],"source":["As each is the probability of belonging to the actual class and not the dummy class, it does not sum to one. \n"]},{"cell_type":"code","id":"e11d8bef-6978-4449-91ed-c4f03e290426","metadata":{},"outputs":[],"source":["probability_array[0,:].sum()"]},{"cell_type":"markdown","id":"473d012e-90b7-4249-8d22-1d61f3f4436e","metadata":{},"outputs":[],"source":["We can plot the probability of belonging to the class. The row number is the sample number.\n"]},{"cell_type":"code","id":"4d553a0a-3e2d-4cdb-bde7-300d5b1ea868","metadata":{},"outputs":[],"source":["plot_probability_array(X,probability_array)"]},{"cell_type":"markdown","id":"861d431b-9a93-4ea6-ad94-6bcbc1458d6c","metadata":{},"outputs":[],"source":["We can apply the $argmax$ function to each sample to find the class.\n"]},{"cell_type":"code","id":"8e80edc1-f0b4-48ef-8789-88b127411ec8","metadata":{},"outputs":[],"source":["one_vs_all=np.argmax(probability_array,axis=1)\none_vs_all"]},{"cell_type":"markdown","id":"e6fb1031-d204-4e70-9054-a93197052e3d","metadata":{},"outputs":[],"source":["We can calculate the accuracy. \n"]},{"cell_type":"code","id":"8eab0fda-1a59-440e-a975-7a003d8bcff6","metadata":{},"outputs":[],"source":["accuracy_score(y,one_vs_all)"]},{"cell_type":"markdown","id":"acf7a88c-49fd-4af3-8ea1-2aef723de06d","metadata":{},"outputs":[],"source":["We see the accuracy is less than the one obtained by sklearn, and this is because for SVM, sklearn uses one vs one; let's verify it by comparing the outputs. \n"]},{"cell_type":"code","id":"5ad51df1-19a1-4e68-9c69-1b13acaf19dc","metadata":{},"outputs":[],"source":["accuracy_score(one_vs_all,yhat)"]},{"cell_type":"markdown","id":"befb9f53-4db3-420f-b176-5fd288e58a52","metadata":{},"outputs":[],"source":["We see that the outputs are different, now lets implement one vs one.\n"]},{"cell_type":"markdown","id":"21b7e5a7-ec7a-434d-8620-9805c93d7a62","metadata":{},"outputs":[],"source":["## One vs One \n"]},{"cell_type":"markdown","id":"8e2a630c-c2e7-4ce1-ae57-9b8aa75609e0","metadata":{},"outputs":[],"source":["\n","In One-vs-One classification, we split up the data into each class, and then train a two-class classifier on each pair of classes. For example, if we have class 0,1,2, we would train one classifier on the samples that are class 0 and class 1, a second classifier on samples that are of class 0 and class 2, and a final classifier on samples of class 1 and class 2.\n","\n","For $K$ classes, we have to train $K(K-1)/2$ classifiers. So, if $K=3$, we have $(3x2)/2=3 $classes.\n","\n","To perform classification on a sample, we perform a majority vote and select the class with the most predictions. \n"]},{"cell_type":"markdown","id":"b0f8884c-f863-4e87-a298-d5bf52701c44","metadata":{},"outputs":[],"source":["Here, we list each class.\n"]},{"cell_type":"code","id":"ccabe18c-852a-4f40-9225-2dd448544412","metadata":{},"outputs":[],"source":["classes_=set(np.unique(y))\nclasses_\n "]},{"cell_type":"markdown","id":"dd8422be-6c67-4340-a20f-51d4b0b3219c","metadata":{},"outputs":[],"source":["Determine the number of classifiers:\n"]},{"cell_type":"code","id":"ef15efbd-828a-4329-bc12-c635e121ab03","metadata":{},"outputs":[],"source":["K=len(classes_)\nK(K-1)/2"]},{"cell_type":"markdown","id":"92195f6b-6e39-4a9a-8628-4759d49ed01d","metadata":{},"outputs":[],"source":["We then train a two-class classifier on each pair of classes. We plot the different training points for each of the two classes. \n"]},{"cell_type":"code","id":"0e8f22bb-10d9-4ebd-a0d5-0bae56aaba4c","metadata":{},"outputs":[],"source":["pairs=[]\nleft_overs=classes_.copy()\n#list used for classifiers \nmy_models=[]\n#iterate through each class\nfor class_ in classes_:\n #remove class we have seen before \n left_overs.remove(class_)\n #the second class in the pair\n for second_class in left_overs:\n pairs.append(str(class_)+' and '+str(second_class))\n print(\"class {} vs class {} \".format(class_,second_class) )\n temp_y=np.zeros(y.shape)\n #find classes in pair \n select=np.logical_or(y==class_ , y==second_class)\n #train model \n model=SVC(kernel='linear', gamma=.5, probability=True) \n model.fit(X[select,:],y[select])\n my_models.append(model)\n #Plot decision boundary for each pair and corresponding Training samples. \n decision_boundary (X[select,:],y[select],model,iris,two=True)\n \n \n "]},{"cell_type":"code","id":"18036ff0-7562-4032-ac5d-a581ab5d4dbb","metadata":{},"outputs":[],"source":["pairs"]},{"cell_type":"markdown","id":"73b143c0-9849-4831-97ed-22b1028364c1","metadata":{},"outputs":[],"source":["As we can see, our data is left-skewed, containing more \"5\" star reviews. \n"]},{"cell_type":"markdown","id":"d55b3297-6531-463e-a1e7-bf3261d89c5f","metadata":{},"outputs":[],"source":["Here, we are plotting the distribution of text length.\n"]},{"cell_type":"code","id":"05e48538-8893-449c-bd13-d7e93b240fc8","metadata":{},"outputs":[],"source":["pairs\nmajority_vote_array=np.zeros((X.shape[0],3))\nmajority_vote_dict={}\nfor j,(model,pair) in enumerate(zip(my_models,pairs)):\n\n majority_vote_dict[pair]=model.predict(X)\n majority_vote_array[:,j]=model.predict(X)"]},{"cell_type":"markdown","id":"8af3ba6b-51d1-46a7-bd69-be4fdc75c4ac","metadata":{},"outputs":[],"source":["In the following table, each column is the output of a classifier for each pair of classes and the output is the prediction:\n"]},{"cell_type":"code","id":"ec18cdb3-e1e7-4723-bf31-059a253ab8d3","metadata":{},"outputs":[],"source":["pd.DataFrame(majority_vote_dict).head(10)"]},{"cell_type":"markdown","id":"687866a0-2cee-4962-8b1b-73d89bed5ce4","metadata":{},"outputs":[],"source":["To perform classification on a sample, we perform a majority vote, that is, select the class with the most predictions. We repeat the process for each sample. \n"]},{"cell_type":"code","id":"2f72a96f-f6ab-4949-be60-58e73c9ede2e","metadata":{},"outputs":[],"source":["one_vs_one=np.array([np.bincount(sample.astype(int)).argmax() for sample in majority_vote_array]) \none_vs_one\n "]},{"cell_type":"markdown","id":"7a291230-5b48-45a7-8533-3dacac1bfbfa","metadata":{},"outputs":[],"source":["We calculate the accuracy:\n"]},{"cell_type":"code","id":"84842c9b-7e71-4e87-9572-a51dd85ae1e0","metadata":{},"outputs":[],"source":["accuracy_score(y,one_vs_one)"]},{"cell_type":"markdown","id":"dbba5d0d-673c-4004-89a7-7ea8cd962f11","metadata":{},"outputs":[],"source":["If we compare it to `sklearn`, it's the same! \n"]},{"cell_type":"code","id":"28cd31ae-09c0-4dc3-b81c-d4e00d2373ca","metadata":{},"outputs":[],"source":["accuracy_score(yhat,one_vs_one)"]},{"cell_type":"markdown","id":"7ae52c2b-a857-4273-83cb-8dee9c6363df","metadata":{},"outputs":[],"source":["\n"]},{"cell_type":"markdown","id":"ea40dca5-a199-4ed2-bc3c-fd1ddd6a9146","metadata":{},"outputs":[],"source":["## Author\n"]},{"cell_type":"markdown","id":"1ce4c652-5abc-4da9-b6a4-c70e10e8f065","metadata":{},"outputs":[],"source":["Joseph Santarcangelo\n"]},{"cell_type":"markdown","id":"be47f36a-b882-4d25-9e42-fe65c6d7b73c","metadata":{},"outputs":[],"source":["### Other Contributors\n"]},{"cell_type":"markdown","id":"dab0bf01-5a2f-47c5-b220-e13b2afb3ef1","metadata":{},"outputs":[],"source":["Azim Hirjani\n"]},{"cell_type":"markdown","id":"32cae4b7-3f0c-4435-8896-912ebba2b2c8","metadata":{},"outputs":[],"source":["## Change Log\n"]},{"cell_type":"markdown","id":"adeb4f2a-4805-421d-a828-8bf9cbada588","metadata":{},"outputs":[],"source":["| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ---------- | ----------------------- |\n","| 2020-07-20 | 0.2 | Azim | Modified Multiple Areas |\n","| 2020-07-17 | 0.1 | Azim | Created Lab Template |\n","| 2022-08-31 | 0.3 | Steve Hord | QA pass edits |\n"]},{"cell_type":"markdown","id":"104a4321-2567-4b90-814a-680ef86945d6","metadata":{},"outputs":[],"source":["Copyright © 2020 IBM Corporation. All rights reserved.\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"prev_pub_hash":"3f4da738eb2cfeb8e584b87ded5f63ffa2837c92f82feee6cafa83bbe261b045"},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # IBM Machine Learning with Python - Jupyter Notebook Labs 2 | This repository contains Jupyter Notebook labs for the "Machine Learning with Python" course by IBM on Coursera. Each notebook corresponds to a specific module in the course and provides hands-on exercises to reinforce the theoretical concepts covered in the lectures. 3 | 4 | ## How to Use 5 | 6 | ### Local Setup 7 | 8 | 1. Clone the repository: 9 | ```bash 10 | git clone https://github.com/Oussama1403/Machine-Learning-with-Python 11 | ``` 12 | 13 | 2. Navigate to the repository directory: 14 | ```bash 15 | cd Machine-Learning-with-Python 16 | ``` 17 | 18 | 3. Launch Jupyter Notebook: 19 | ```bash 20 | jupyter notebook 21 | ``` 22 | 23 | 4. Open the desired notebook to explore and run the code cells. 24 | 25 | #### Requirements 26 | 27 | - Python 3.x 28 | - Jupyter Notebook 29 | - Required Python libraries. 30 | 31 | ### Google Colab 32 | 33 | 1. Open Google Colab in your browser: [Google Colab](https://colab.research.google.com/). 34 | 35 | 2. Select "GitHub" from the pop-up menu or click on the "GitHub" tab. 36 | 37 | 3. Enter the repository URL: 38 | ``` 39 | https://github.com/Oussama1403/Machine-Learning-with-Python 40 | ``` 41 | 42 | 4. Choose the notebook you want to open from the list that appears. 43 | 44 | 5. Click on the notebook file to open it in Colab, where you can run and modify the code cells. 45 | 46 | 47 | ## License 48 | 49 | This project is licensed under the MIT License. 50 | -------------------------------------------------------------------------------- /Regression/Mulitple-Linear-Regression-Co2.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"ce0a4e0b-8c73-473b-866a-37179e596ce6","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","\n","# Multiple Linear Regression\n","\n","\n","Estimated time needed: 15** minutes\n"," \n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use scikit-learn to implement Multiple Linear Regression\n","* Create a model, train it, test it and use the model\n"]},{"cell_type":"markdown","id":"812b810d-0ec5-48bb-86ea-1fe50311bd6a","metadata":{},"outputs":[],"source":["
Table of contents
\n","\n","
\n","
\n","
Understanding the Data
\n","
Reading the Data in
\n","
Multiple Regression Model
\n","
Prediction
\n","
Practice
\n","
\n","
\n","
\n","
\n"]},{"cell_type":"markdown","id":"5e2c4f6d-d83c-4c12-8ba2-40bfde44874e","metadata":{},"outputs":[],"source":["### Importing Needed packages\n"]},{"cell_type":"code","id":"3ffa2549-728d-459b-b3ea-2b090fbc2fbe","metadata":{},"outputs":[],"source":["import matplotlib.pyplot as plt\nimport pandas as pd\nimport pylab as pl\nimport numpy as np\n%matplotlib inline"]},{"cell_type":"markdown","id":"8909deaf-634e-4624-a144-b8e3b22555b0","metadata":{},"outputs":[],"source":["### Downloading Data\n","To download the data, we will use !wget to download it from IBM Object Storage.\n"]},{"cell_type":"code","id":"f9b8a948-6e73-4847-a5cb-84ae159704ae","metadata":{},"outputs":[],"source":["!wget -O FuelConsumption.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv"]},{"cell_type":"markdown","id":"94435d94-ad54-45f9-9239-dd5d6105a8a3","metadata":{},"outputs":[],"source":["Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","id":"8ebe5a66-1b47-4baf-8105-2cf02ff08b01","metadata":{},"outputs":[],"source":["\n","
Understanding the Data
\n","\n","### `FuelConsumption.csv`:\n","We have downloaded a fuel consumption dataset, `FuelConsumption.csv`, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)\n","\n","- MODELYEAR e.g. 2014\n","- MAKE e.g. Acura\n","- MODEL e.g. ILX\n","- VEHICLE CLASS e.g. SUV\n","- ENGINE SIZE e.g. 4.7\n","- CYLINDERS e.g 6\n","- TRANSMISSION e.g. A6\n","- FUELTYPE e.g. z\n","- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9\n","- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9\n","- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2\n","- CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0\n"]},{"cell_type":"markdown","id":"c32ad4a3-9a70-420e-a1e8-9bdbf06804f8","metadata":{},"outputs":[],"source":["
Reading the data in
\n"]},{"cell_type":"code","id":"09827aa2-e24a-4a19-932e-7b6471554163","metadata":{},"outputs":[],"source":["df = pd.read_csv(\"FuelConsumption.csv\")\n\n# take a look at the dataset\ndf.head()"]},{"cell_type":"markdown","id":"9fd1aef5-9336-4232-acab-9e8de081aa36","metadata":{},"outputs":[],"source":["Let's select some features that we want to use for regression.\n"]},{"cell_type":"code","id":"4e40a007-f8c7-4dd6-8afa-8ed085a925a9","metadata":{},"outputs":[],"source":["cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']]\ncdf.head(9)"]},{"cell_type":"markdown","id":"b96e8ecc-c47e-4f5e-9827-4147a6b26276","metadata":{},"outputs":[],"source":["Let's plot Emission values with respect to Engine size:\n"]},{"cell_type":"code","id":"b2f0037d-832d-42a3-8e4f-b0aac322c998","metadata":{},"outputs":[],"source":["plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')\nplt.xlabel(\"Engine size\")\nplt.ylabel(\"Emission\")\nplt.show()"]},{"cell_type":"markdown","id":"4366f395-57ab-4f70-8082-a6127cc3769c","metadata":{},"outputs":[],"source":["#### Creating train and test dataset\n","Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. \n","This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the model. Therefore, it gives us a better understanding of how well our model generalizes on new data.\n","\n","We know the outcome of each data point in the testing dataset, making it great to test with! Since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.\n","\n","Let's split our dataset into train and test sets. Around 80% of the entire dataset will be used for training and 20% for testing. We create a mask to select random rows using the np.random.rand() function: \n"]},{"cell_type":"code","id":"9969ebff-379b-4d82-8db8-df209423acb6","metadata":{},"outputs":[],"source":["msk = np.random.rand(len(df)) < 0.8\ntrain = cdf[msk]\ntest = cdf[~msk]"]},{"cell_type":"markdown","id":"33b15783-861f-47ef-805e-d56f711be271","metadata":{},"outputs":[],"source":["#### Train data distribution\n"]},{"cell_type":"code","id":"4b38bf44-1a99-41da-ab4d-b5213d188c2e","metadata":{},"outputs":[],"source":["plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')\nplt.xlabel(\"Engine size\")\nplt.ylabel(\"Emission\")\nplt.show()"]},{"cell_type":"markdown","id":"ffe480cf-e14a-401b-a169-007ad433a9df","metadata":{},"outputs":[],"source":["
Multiple Regression Model
\n"]},{"cell_type":"markdown","id":"07ce4ec1-5987-4a5b-9349-1d38af2e6ccd","metadata":{},"outputs":[],"source":["In reality, there are multiple variables that impact the co2emission. When more than one independent variable is present, the process is called multiple linear regression. An example of multiple linear regression is predicting co2emission using the features FUELCONSUMPTION_COMB, EngineSize and Cylinders of cars. The good thing here is that multiple linear regression model is the extension of the simple linear regression model.\n"]},{"cell_type":"code","id":"5cf1e3f2-5ce8-486b-8c91-cc6d16ddb7f5","metadata":{},"outputs":[],"source":["from sklearn import linear_model\nregr = linear_model.LinearRegression()\nx = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])\ny = np.asanyarray(train[['CO2EMISSIONS']])\nregr.fit (x, y)\n# The coefficients\nprint ('Coefficients: ', regr.coef_)"]},{"cell_type":"markdown","id":"a62c9582-6dbe-4bf7-946f-ae0abaf32e8c","metadata":{},"outputs":[],"source":["As mentioned before, Coefficient and Intercept are the parameters of the fitted line. \n","Given that it is a multiple linear regression model with 3 parameters and that the parameters are the intercept and coefficients of the hyperplane, sklearn can estimate them from our data. Scikit-learn uses plain Ordinary Least Squares method to solve this problem.\n","\n","#### Ordinary Least Squares (OLS)\n","OLS is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the target dependent variable and those predicted by the linear function. In other words, it tries to minimizes the sum of squared errors (SSE) or mean squared error (MSE) between the target variable (y) and our predicted output ($\\hat{y}$) over all samples in the dataset.\n","\n","OLS can find the best parameters using of the following methods:\n","* Solving the model parameters analytically using closed-form equations\n","* Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s Method, etc.)\n"]},{"cell_type":"markdown","id":"e20aad04-83b2-46fa-8390-d105970fff48","metadata":{},"outputs":[],"source":["
Prediction
\n"]},{"cell_type":"code","id":"c344c753-8d8a-4961-84cd-80d87645cbfd","metadata":{},"outputs":[],"source":["y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])\nx = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])\ny = np.asanyarray(test[['CO2EMISSIONS']])\nprint(\"Mean Squared Error (MSE) : %.2f\"\n % np.mean((y_hat - y) 2))\n\n# Explained variance score: 1 is perfect prediction\nprint('Variance score: %.2f' % regr.score(x, y))"]},{"cell_type":"markdown","id":"1272fb5e-6540-4bd1-8cb8-b4e869d0012d","metadata":{},"outputs":[],"source":["Explained variance regression score: \n","Let $\\hat{y}$ be the estimated target output, y the corresponding (correct) target output, and Var be the Variance (the square of the standard deviation). Then the explained variance is estimated as follows:\n","\n","$\\texttt{explainedVariance}(y, \\hat{y}) = 1 - \\frac{Var\\{ y - \\hat{y}\\}}{Var\\{y\\}}$ \n","The best possible score is 1.0, the lower values are worse.\n"]},{"cell_type":"markdown","id":"ca7e52f6-dd0d-4e26-9de3-87b96f3eeecb","metadata":{},"outputs":[],"source":["
Practice
\n","Try to use a multiple linear regression with the same dataset, but this time use FUELCONSUMPTION_CITY and FUELCONSUMPTION_HWY instead of FUELCONSUMPTION_COMB. Does it result in better accuracy?\n"]},{"cell_type":"code","id":"df9bf0b1-4653-46db-867e-837abc3c8c86","metadata":{},"outputs":[],"source":["# write your code here\n\n"]},{"cell_type":"markdown","id":"4663aadc-b327-49fd-b5f0-a7214c50e565","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python\n","regr = linear_model.LinearRegression()\n","x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']])\n","y = np.asanyarray(train[['CO2EMISSIONS']])\n","regr.fit (x, y)\n","print ('Coefficients: ', regr.coef_)\n","y_= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']])\n","x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY']])\n","y = np.asanyarray(test[['CO2EMISSIONS']])\n","print(\"Residual sum of squares: %.2f\"% np.mean((y_ - y) 2))\n","print('Variance score: %.2f' % regr.score(x, y))\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"abf19628-dd0e-4cd8-9917-8def90613788","metadata":{},"outputs":[],"source":["
Want to learn more?
\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n","\n"]},{"cell_type":"markdown","id":"17c9e262-f365-4abc-a3f7-0ba0120864ba","metadata":{},"outputs":[],"source":["### Thank you for completing this lab!\n","\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","\n","\n","\n","## Change Log\n","\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2020-11-03 | 2.1 | Lakshmi | Made changes in URL |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","\n","##
© IBM Corporation 2020. All rights reserved.
\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Regression/Simple-Linear-Regression-Co2.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"401a5f59-caa1-4bdf-8351-c691b1afd05c","metadata":{},"outputs":[],"source":["
\n"," \n"," $\"Skills$ \n"," \n","
\n","\n","\n","# Simple Linear Regression\n","\n","\n","Estimated time needed: 15 minutes\n"," \n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Use scikit-learn to implement simple Linear Regression\n","* Create a model, train it, test it and use the model\n"]},{"cell_type":"markdown","id":"54629347-6400-44ba-888d-898edbe39254","metadata":{},"outputs":[],"source":["### Importing Needed packages\n"]},{"cell_type":"code","id":"b2e6aeb7-00d0-4e3b-bdd3-891b23a38f1a","metadata":{},"outputs":[],"source":["import matplotlib.pyplot as plt\nimport pandas as pd\nimport pylab as pl\nimport numpy as np\n%matplotlib inline"]},{"cell_type":"markdown","id":"b3d40467-d77c-4098-ad44-b5e68725e390","metadata":{},"outputs":[],"source":["### Downloading Data\n","To download the data, we will use !wget to download it from IBM Object Storage.\n"]},{"cell_type":"code","id":"56bb472f-a552-4b7f-bcf8-d182f1293357","metadata":{},"outputs":[],"source":["!wget -O FuelConsumption.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv"]},{"cell_type":"markdown","id":"4a1c7158-2e35-4ec7-9dec-d34cefb11cc7","metadata":{},"outputs":[],"source":["In case you're working locally uncomment the below line. \n"]},{"cell_type":"code","id":"f6d2a8f8-eefc-4255-a5f5-2fe900ce4c71","metadata":{},"outputs":[],"source":["#!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv -o FuelConsumptionCo2.csv"]},{"cell_type":"markdown","id":"c3dadff0-fce7-4f73-ac86-11817384d260","metadata":{},"outputs":[],"source":["Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"]},{"cell_type":"markdown","id":"77ec68fd-a0c7-4584-bab3-4530ab5c3f5b","metadata":{},"outputs":[],"source":["\n","## Understanding the Data\n","\n","### `FuelConsumption.csv`:\n","We have downloaded a fuel consumption dataset, `FuelConsumption.csv`, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)\n","\n","- MODELYEAR e.g. 2014\n","- MAKE e.g. Acura\n","- MODEL e.g. ILX\n","- VEHICLE CLASS e.g. SUV\n","- ENGINE SIZE e.g. 4.7\n","- CYLINDERS e.g 6\n","- TRANSMISSION e.g. A6\n","- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9\n","- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9\n","- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2\n","- CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0\n"]},{"cell_type":"markdown","id":"52fc502d-f917-45b3-a742-4ec87b356e55","metadata":{},"outputs":[],"source":["## Reading the data in\n"]},{"cell_type":"code","id":"2031dcf0-bb81-4a35-90b2-25d45f877add","metadata":{},"outputs":[],"source":["df = pd.read_csv(\"FuelConsumption.csv\")\n\n# take a look at the dataset\ndf.head()\n\n"]},{"cell_type":"markdown","id":"e3eb3af5-150e-46c9-b8a6-2805c97a6ae2","metadata":{},"outputs":[],"source":["### Data Exploration\n","Let's first have a descriptive exploration on our data.\n"]},{"cell_type":"code","id":"7d378b8a-b95a-46bc-940a-dc11d6382d8c","metadata":{},"outputs":[],"source":["# summarize the data\ndf.describe()"]},{"cell_type":"markdown","id":"4ecc4039-f8e8-4fb0-b889-61a0fafbac20","metadata":{},"outputs":[],"source":["Let's select some features to explore more.\n"]},{"cell_type":"code","id":"9bf19c92-9bb0-46e8-9447-136af26119aa","metadata":{},"outputs":[],"source":["cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]\ncdf.head(9)"]},{"cell_type":"markdown","id":"a3e1bbde-b33f-4938-b1fa-fcbf11bce946","metadata":{},"outputs":[],"source":["We can plot each of these features:\n"]},{"cell_type":"code","id":"036d0901-9696-42cd-9a44-3ee60b7147b1","metadata":{},"outputs":[],"source":["viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]\nviz.hist()\nplt.show()"]},{"cell_type":"markdown","id":"cceb0b51-fd90-4c2d-bee5-3465793f0db3","metadata":{},"outputs":[],"source":["Now, let's plot each of these features against the Emission, to see how linear their relationship is:\n"]},{"cell_type":"code","id":"fa706db3-dbb0-40e8-ba75-f738d1d280a8","metadata":{},"outputs":[],"source":["plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')\nplt.xlabel(\"FUELCONSUMPTION_COMB\")\nplt.ylabel(\"Emission\")\nplt.show()"]},{"cell_type":"code","id":"a8ac3824-c752-4758-97f6-1f41894c0e58","metadata":{},"outputs":[],"source":["plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')\nplt.xlabel(\"Engine size\")\nplt.ylabel(\"Emission\")\nplt.show()"]},{"cell_type":"markdown","id":"0ee90b3f-f99c-4059-b04a-16e44087fa24","metadata":{},"outputs":[],"source":["## Practice\n","Plot CYLINDER vs the Emission, to see how linear is their relationship is:\n"]},{"cell_type":"code","id":"55b27ee2-e96b-423f-80b5-5e0b237127f1","metadata":{},"outputs":[],"source":["# write your code here\n\n\n"]},{"cell_type":"markdown","id":"3ddb1d55-9f38-41a6-8cfa-be798db4ae76","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","plt.scatter(cdf.CYLINDERS, cdf.CO2EMISSIONS, color='blue')\n","plt.xlabel(\"Cylinders\")\n","plt.ylabel(\"Emission\")\n","plt.show()\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"58307ce0-a6d5-47e6-8052-591f2096e73f","metadata":{},"outputs":[],"source":["#### Creating train and test dataset\n","Train/Test Split involves splitting the dataset into training and testing sets that are mutually exclusive. After which, you train with the training set and test with the testing set. \n","This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the model. Therefore, it gives us a better understanding of how well our model generalizes on new data.\n","\n","This means that we know the outcome of each data point in the testing dataset, making it great to test with! Since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.\n","\n","Let's split our dataset into train and test sets. 80% of the entire dataset will be used for training and 20% for testing. We create a mask to select random rows using np.random.rand() function: \n"]},{"cell_type":"code","id":"3d13f767-ed9c-43dc-8002-fef65751194f","metadata":{},"outputs":[],"source":["msk = np.random.rand(len(df)) < 0.8\ntrain = cdf[msk]\ntest = cdf[~msk]"]},{"cell_type":"markdown","id":"869153f7-3060-4e8c-bd22-50e5adfe7a8f","metadata":{},"outputs":[],"source":["### Simple Regression Model\n","Linear Regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the 'residual sum of squares' between the actual value y in the dataset, and the predicted value yhat using linear approximation. \n"]},{"cell_type":"markdown","id":"4057490d-c1b3-412e-bbe0-03c80bda9fc9","metadata":{},"outputs":[],"source":["#### Train data distribution\n"]},{"cell_type":"code","id":"9892f279-00fb-4130-970b-a490ce384e92","metadata":{},"outputs":[],"source":["plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')\nplt.xlabel(\"Engine size\")\nplt.ylabel(\"Emission\")\nplt.show()"]},{"cell_type":"markdown","id":"0ff2eb9d-7d4d-45fa-9b3b-5c143b5597f5","metadata":{},"outputs":[],"source":["#### Modeling\n","Using sklearn package to model data.\n"]},{"cell_type":"code","id":"9a74e335-9915-4293-befd-190aef9b2bdd","metadata":{},"outputs":[],"source":["from sklearn import linear_model\nregr = linear_model.LinearRegression()\ntrain_x = np.asanyarray(train[['ENGINESIZE']])\ntrain_y = np.asanyarray(train[['CO2EMISSIONS']])\nregr.fit(train_x, train_y)\n# The coefficients\nprint ('Coefficients: ', regr.coef_)\nprint ('Intercept: ',regr.intercept_)"]},{"cell_type":"markdown","id":"9b0824e4-fe06-4da6-9260-aee6d8975152","metadata":{},"outputs":[],"source":["As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the fit line. \n","Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. \n","Notice that all of the data must be available to traverse and calculate the parameters.\n"]},{"cell_type":"markdown","id":"54463569-5b61-4c06-9dbb-7dd627d6cd22","metadata":{},"outputs":[],"source":["#### Plot outputs\n"]},{"cell_type":"markdown","id":"8f657b86-4a28-492d-a7f6-dff03062e393","metadata":{},"outputs":[],"source":["We can plot the fit line over the data:\n"]},{"cell_type":"code","id":"3870b0f2-dd64-4d3d-b263-f7fd00b56458","metadata":{},"outputs":[],"source":["plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')\nplt.plot(train_x, regr.coef_[0][0]train_x + regr.intercept_[0], '-r')\nplt.xlabel(\"Engine size\")\nplt.ylabel(\"Emission\")"]},{"cell_type":"markdown","id":"e0ec0b01-efe0-48b8-9d02-7f0595daa6fd","metadata":{},"outputs":[],"source":["#### Evaluation\n","We compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.\n","\n","There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model based on the test set: \n"," Mean Absolute Error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error.\n","\n","* Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s more popular than Mean Absolute Error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.\n","\n","* Root Mean Squared Error (RMSE). \n","\n","* R-squared is not an error, but rather a popular metric to measure the performance of your regression model. It represents how close the data points are to the fitted regression line. The higher the R-squared value, the better the model fits your data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).\n"]},{"cell_type":"code","id":"27d900ef-d415-46a7-91b0-f036504a8fe2","metadata":{},"outputs":[],"source":["from sklearn.metrics import r2_score\n\ntest_x = np.asanyarray(test[['ENGINESIZE']])\ntest_y = np.asanyarray(test[['CO2EMISSIONS']])\ntest_y_ = regr.predict(test_x)\n\nprint(\"Mean absolute error: %.2f\" % np.mean(np.absolute(test_y_ - test_y)))\nprint(\"Residual sum of squares (MSE): %.2f\" % np.mean((test_y_ - test_y) ** 2))\nprint(\"R2-score: %.2f\" % r2_score(test_y , test_y_) )"]},{"cell_type":"markdown","id":"bb624cef-35bf-4156-a530-c209ef8d8451","metadata":{},"outputs":[],"source":["## Exercise\n"]},{"cell_type":"markdown","id":"ef19173f-065c-4f5c-88de-af0b738c657f","metadata":{},"outputs":[],"source":["Lets see what the evaluation metrics are if we trained a regression model using the `FUELCONSUMPTION_COMB` feature.\n","\n","Start by selecting `FUELCONSUMPTION_COMB` as the train_x data from the `train` dataframe, then select `FUELCONSUMPTION_COMB` as the test_x data from the `test` dataframe\n"]},{"cell_type":"code","id":"cacbc9b6-22ba-46ed-9be3-111310ca2eab","metadata":{},"outputs":[],"source":["train_x = #ADD CODE\n\ntest_x = #ADD CODE"]},{"cell_type":"markdown","id":"7877d637-35c8-44b4-adc6-a09caa5f9e44","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","train_x = train[[\"FUELCONSUMPTION_COMB\"]]\n","\n","test_x = test[[\"FUELCONSUMPTION_COMB\"]]\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"0d18fa3f-74bc-4231-9d56-4a54e5dffc62","metadata":{},"outputs":[],"source":["Now train a Linear Regression Model using the `train_x` you created and the `train_y` created previously\n"]},{"cell_type":"code","id":"990cf0ce-21a2-4355-bf08-9beecb319903","metadata":{},"outputs":[],"source":["regr = linear_model.LinearRegression()\n\n#ADD CODE\n"]},{"cell_type":"markdown","id":"edc5f917-1b5f-4f16-8843-ab34dd19b0aa","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","regr = linear_model.LinearRegression()\n","\n","regr.fit(train_x, train_y)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"656b64a9-01c2-468f-a5a9-79a4acb326fd","metadata":{},"outputs":[],"source":["Find the predictions using the model's `predict` function and the `test_x` data\n"]},{"cell_type":"code","id":"13d5e7ec-2acf-421e-9f1b-69e6238a6882","metadata":{},"outputs":[],"source":["predictions = #ADD CODE"]},{"cell_type":"markdown","id":"cd2cbf2e-9c81-480a-a207-a123979cae6c","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","predictions = regr.predict(test_x)\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"6ad5857d-6d28-417d-b947-61a9bc2fb997","metadata":{},"outputs":[],"source":["Finally use the `predictions` and the `test_y` data and find the Mean Absolute Error value using the `np.absolute` and `np.mean` function like done previously\n"]},{"cell_type":"code","id":"1ba79065-be15-4e57-a6f7-c3b0c93641bc","metadata":{},"outputs":[],"source":["#ADD CODE\n"]},{"cell_type":"markdown","id":"b9a6db85-03e1-4b74-a8b2-516e03e347b0","metadata":{},"outputs":[],"source":["
Click here for the solution
\n","\n","```python \n","print(\"Mean Absolute Error: %.2f\" % np.mean(np.absolute(predictions - test_y)))\n","\n","```\n","\n","
\n"]},{"cell_type":"markdown","id":"7f940e59-2dd2-4e54-b6c5-0137b9f422d3","metadata":{},"outputs":[],"source":["We can see that the MAE is much worse when we train using `ENGINESIZE` than `FUELCONSUMPTION_COMB`\n"]},{"cell_type":"markdown","id":"fa4d5fe2-f543-47b1-b31f-26ed0ff3ab8b","metadata":{},"outputs":[],"source":["
Want to learn more?
\n","\n","IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n","\n","Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n","\n"]},{"cell_type":"markdown","id":"d10aa32a-e01a-4e0a-84b4-9c4a9903baec","metadata":{},"outputs":[],"source":["### Thank you for completing this lab!\n","\n","\n","## Author\n","\n","Saeed Aghabozorgi\n","\n","\n","### Other Contributors\n","\n","Joseph Santarcangelo\n","\n","Azim Hirjani\n","\n","\n","## Change Log\n","\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","|---|---|---|---|\n","| 2020-11-03 | 2.1 | Lakshmi Holla | Changed URL of the csv |\n","| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","\n","##
© IBM Corporation 2020. All rights reserved.
\n"]}],"metadata":{"kernelspec":{"display_name":"Python","language":"python","name":"conda-env-python-py"},"language_info":{"name":""}},"nbformat":4,"nbformat_minor":4} --------------------------------------------------------------------------------