├── assets ├── stacking.png ├── DoubleStacking.png └── datacamp.svg ├── notebooks ├── python_live_session_template_spark.ipynb ├── Applied_Machine_Learning_Ensemble_Modeling_Learners.ipynb └── Applied_Machine_Learning_Ensemble_Modeling_Solution.ipynb ├── README.md └── data └── pima-indians-diabetes.csv /assets/stacking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/master/assets/stacking.png -------------------------------------------------------------------------------- /assets/DoubleStacking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/master/assets/DoubleStacking.png -------------------------------------------------------------------------------- /notebooks/python_live_session_template_spark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "6Ijg5wUCTQYG" 8 | }, 9 | "source": [ 10 | "

\n", 11 | "\"DataCamp\n", 12 | "

\n", 13 | "

\n", 14 | "\n", 15 | "## **Python PySpark Live Training Template**\n", 16 | "\n", 17 | "_Enter a brief description of your session, here's an example below:_\n", 18 | "\n", 19 | "Welcome to this hands-on training where we will immerse yourself in data visualization in Python. Using both `matplotlib` and `seaborn`, we'll learn how to create visualizations that are presentation-ready.\n", 20 | "\n", 21 | "The ability to present and discuss\n", 22 | "\n", 23 | "* Create various types of plots, including bar-plots, distribution plots, box-plots and more using Seaborn and Matplotlib.\n", 24 | "* Format and stylize your visualizations to make them report-ready.\n", 25 | "* Create sub-plots to create clearer visualizations and supercharge your workflow.\n", 26 | "\n", 27 | "## **The Dataset**\n", 28 | "\n", 29 | "_Enter a brief description of your dataset and its columns, here's an example below:_\n", 30 | "\n", 31 | "\n", 32 | "The dataset to be used in this webinar is a CSV file named `airbnb.csv`, which contains data on airbnb listings in the state of New York. It contains the following columns:\n", 33 | "\n", 34 | "- `listing_id`: The unique identifier for a listing\n", 35 | "- `description`: The description used on the listing\n", 36 | "- `host_id`: Unique identifier for a host\n", 37 | "- `host_name`: Name of host\n", 38 | "- `neighbourhood_full`: Name of boroughs and neighbourhoods\n", 39 | "- `coordinates`: Coordinates of listing _(latitude, longitude)_\n", 40 | "- `Listing added`: Date of added listing\n", 41 | "- `room_type`: Type of room \n", 42 | "- `rating`: Rating from 0 to 5.\n", 43 | "- `price`: Price per night for listing\n", 44 | "- `number_of_reviews`: Amount of reviews received \n", 45 | "- `last_review`: Date of last review\n", 46 | "- `reviews_per_month`: Number of reviews per month\n", 47 | "- `availability_365`: Number of days available per year\n", 48 | "- `Number of stays`: Total number of stays thus far\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## **Setting up a PySpark session**\n", 56 | "\n", 57 | "This set of code lets you enable a PySpark session using google colabs, make sure to run the code snippets to enable PySpark." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Just run this code\n", 67 | "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", 68 | "!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", 69 | "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n", 70 | "!pip install -q findspark" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Just run this code too!\n", 80 | "import os\n", 81 | "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", 82 | "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "# Set up a Spark session\n", 92 | "import findspark\n", 93 | "findspark.init()\n", 94 | "from pyspark.sql import SparkSession\n", 95 | "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "colab_type": "text", 102 | "id": "BMYfcKeDY85K" 103 | }, 104 | "source": [ 105 | "## **Getting started**" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 2, 111 | "metadata": { 112 | "colab": {}, 113 | "colab_type": "code", 114 | "id": "EMQfyC7GUNhT" 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "# Import other relevant libraries\n", 119 | "from pyspark.ml.feature import VectorAssembler\n", 120 | "from pyspark.ml.regression import LinearRegression" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 0, 126 | "metadata": { 127 | "colab": {}, 128 | "colab_type": "code", 129 | "id": "IAfz_jiu0NjN" 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# Get dataset into local environment\n", 134 | "!wget -O /tmp/airbnb.csv 'https://github.com/datacamp/python-live-training-template/blob/master/data/airbnb.csv?raw=True'\n", 135 | "airbnb = spark.read.csv('/tmp/airbnb.csv', inferSchema=True, header =True)" 136 | ] 137 | } 138 | ], 139 | "metadata": { 140 | "colab": { 141 | "name": "Cleaning Data in Python live session.ipynb", 142 | "provenance": [] 143 | }, 144 | "kernelspec": { 145 | "display_name": "Python 3", 146 | "language": "python", 147 | "name": "python3" 148 | }, 149 | "language_info": { 150 | "codemirror_mode": { 151 | "name": "ipython", 152 | "version": 3 153 | }, 154 | "file_extension": ".py", 155 | "mimetype": "text/x-python", 156 | "name": "python", 157 | "nbconvert_exporter": "python", 158 | "pygments_lexer": "ipython3", 159 | "version": "3.7.1" 160 | } 161 | }, 162 | "nbformat": 4, 163 | "nbformat_minor": 1 164 | } 165 | -------------------------------------------------------------------------------- /assets/datacamp.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Applied Machine Learning: Ensemble Modeling**
by **Lisa Stuart** 2 | 3 | Live training sessions are designed to mimic the flow of how a real data scientist would address a problem or a task. As such, a session needs to have some “narrative” where learners are achieving stated learning objectives in the form of a real-life data science task or project. For example, a data visualization live session could be around analyzing a dataset and creating a report with a specific business objective in mind _(ex: analyzing and visualizing churn)_, a data cleaning live session could be about preparing a dataset for analysis etc ... 4 | 5 | As part of the 'Live training Spec' process, you will need to complete the following tasks: 6 | 7 | Edit this README by filling in the information for steps 1 - 4. 8 | 9 | ## Step 1: Foundations 10 | 11 | This part of the 'Live training Spec' process is designed to help guide you through session design by having you think through several key questions. Please make sure to delete the examples provided here for you. 12 | 13 | ### A. What problem(s) will students learn how to solve? (minimum of 5 problems) 14 | 15 | > _Here's an example from the Python for Spreadsheeets Users live session_ 16 | > 17 | > - Key considerations to take in when transitioning from spreadsheets to Python. 18 | > - The Data Scientist mindset and keys to success in transitioning to Python. 19 | > - How to import `.xlsx` and `.csv` files into Python using `pandas`. 20 | > - How to filter a DataFrame using `pandas`. 21 | > - How to create new columns out of your DataFrame for more interesting features. 22 | > - Perform exploratory analysis of a DataFrame in `pandas`. 23 | > - How to clean a DataFrame using `pandas` to make it ready for analysis. 24 | > - Create simple, interesting visualizations using `matplotlib`. 25 | 26 | > - Key considerations to take in when transitioning from single layer model to stacking layers. 27 | > - The Data Scientist mindset and keys to success in transitioning from baseline models to stacking models. 28 | > - How to select a baseline Machine Learning algorithm 29 | > - Discuss alternative stacking methods 30 | > - Create simple, two-layer regressor and classifier stacked models 31 | > - How to tune hyperparameters using K-fold cross-validation 32 | 33 | 34 | 35 | ### B. What technologies, packages, or functions will students use? Please be exhaustive. 36 | 37 | > - pandas 38 | > - matplotlib 39 | > - seaborn 40 | > - scikit-learn 41 | > - mlxtend.StackingClassifier 42 | > - vecstack 43 | > - sklearn.ensemble.StackingClassifier 44 | > - sklearn.ensemble.StackingRegressor 45 | 46 | ### C. What terms or jargon will you define? 47 | 48 | _Whether during your opening and closing talk or your live training, you might have to define some terms and jargon to walk students through a problem you’re solving. Intuitive explanations using analogies are encouraged._ 49 | 50 | > _Here's an example from the [Python for Spreadsheeets Users live session](https://www.datacamp.com/resources/webinars/live-training-python-for-spreadsheet-users)._ 51 | > 52 | > - Packages: Packages are pieces of software we can import to Python. Similar to how we download, install Excel on MacOs, we import pandas on Python. (You can find it at minute 6:30) 53 | 54 | > - What is considered a 'weak' learner? 55 | > - Ensemble: In machine learning, a collection of multiple base models combined to create a single model that has better predictive performance than any of the base models used to produce it. For example, the Random Forest algorithm is an ensemble method that constructs a collection of Decision Trees to output a single trained Random Forest model. 56 | > - Stacking: In machine learning, a collection of multiple base models that use algorithms that are different from one another and used in layers. The predictions from the layers are used as input in the final layer to produce a final trained model that has better predictive performance than any of the base models used to product it. Stacking is an ensemble method. 57 | 58 | ### D. What mistakes or misconceptions do you expect? 59 | 60 | _To help minimize the amount of Q&As and make your live training re-usable, list out some mistakes and misconceptions you think students might encounter along the way._ 61 | 62 | > _Here's an example from the [Data Visualization in Python live session](https://www.datacamp.com/resources/webinars/data-visualization-in-python)_ 63 | > 64 | > - Anatomy of a matplotlib figure: When calling a matplotlib plot, a figure, axes and plot is being created behind the background. (You can find it at minute 11) 65 | > - As long as you do understand how plots work behind the scenes, you don't need to memorize syntax to customize your plot. 66 | 67 | > - Ensuring the layers are composed of 'weak' learners. 68 | > - Concept of leakage, not leaking information from between layers to avoid overfitting, not generalizing, etc. 69 | > - As long as you understand how base models work behind the scenes, you don't need to memorize arguments to customize your stacking model. 70 | 71 | ### E. What datasets will you use? 72 | 73 | Live training sessions are designed to walk students through something closer to a real-life data science workflow. Accordingly, the dataset needs to accommodate that user experience. 74 | As a rule of thumb, your dataset should always answer yes to the following question: 75 | > Is the dataset/problem I’m working on, something an industry data scientist/analyst could work on? 76 | 77 | Check our [datasets to avoid](https://instructor-support.datacamp.com/en/articles/2360699-datasets-to-avoid) list. 78 | 79 | > - [Abalone Age](https://archive.ics.uci.edu/ml/datasets/abalone)-Regression 80 | > - [Pima Indians Diabetes](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv)-Binary Classification 81 | 82 | ## Step 2: Who is this session for? 83 | 84 | Terms like "beginner" and "expert" mean different things to different people, so we use personas to help instructors clarify a live training's audience. When designing a specific live training, instructors should explain how it will or won't help these people, and what extra skills or prerequisite knowledge they are assuming their students have above and beyond what's included in the persona. 85 | 86 | - [ ] Please select the roles and industries that align with your live training. 87 | - [ ] Include an explanation describing your reasoning and any other relevant information. 88 | 89 | ### What roles would this live training be suitable for? 90 | 91 | *Check all that apply.* 92 | 93 | - [ ] Data Consumer 94 | - [ ] Leader 95 | - [X] Data Analyst 96 | - [X] Citizen Data Scientist 97 | - [X] Data Scientist 98 | - [X] Data Engineer 99 | - [ ] Database Administrator 100 | - [ ] Statistician 101 | - [X] Machine Learning Scientist 102 | - [ ] Programmer 103 | - [ ] Other (please describe) 104 | 105 | ### What industries would this apply to? 106 | 107 | *List one or more industries that the content would be appropriate for.* 108 | Industry Agnostic 109 | 110 | 111 | ### What level of expertise should learners have before beginning the live training? 112 | 113 | *List three or more examples of skills that you expect learners to have before beginning the live training* 114 | 115 | > - Can draw common plot types (scatter, bar, histogram) using matplotlib and interpret them 116 | > - Can run a linear regression, use it to make predictions, and interpret the coefficients. 117 | > - Can calculate grouped summary statistics using SELECT queries with GROUP BY clauses. 118 | 119 | > - Can run a linear regression, use it to make predictions, and calculate performance metrics. 120 | > - Can run a logistic regression, use it to make predictions, and calculate performance metrics. 121 | > - Can run a decision tree classifier/regressor, use it to make predictions, and calculate performance metrics. 122 | 123 | 124 | ## Step 3: Prerequisites 125 | 126 | List any prerequisite courses you think your live training could use from. This could be the live session’s companion course or a course you think students should take before the session. Prerequisites act as a guiding principle for your session and will set the topic framework, but you do not have to limit yourself in the live session to the syntax used in the prerequisite courses. 127 | 128 | > - [Supervised Learning with scikit-learn](https://learn.datacamp.com/courses/supervised-learning-with-scikit-learn) 129 | > - [Ensemble Methods in Python](https://learn.datacamp.com/courses/ensemble-methods-in-python) 130 | 131 | 132 | 133 | ## Step 4: Session Outline 134 | 135 | A live training session usually begins with an introductory presentation, followed by the live training itself, and an ending presentation. Your live session is expected to be around 2h30m-3h long (including Q&A) with a hard-limit at 3h30m. You can check out our live training content guidelines [here](_LINK_). 136 | 137 | 138 | > _Example from [Python for Spreadsheet Users](https://www.datacamp.com/resources/webinars/live-training-python-for-spreadsheet-users)_ 139 | > 140 | > ### Introduction Slides 141 | > - Introduction to the webinar and instructor (led by DataCamp TA) 142 | > - Introduction to the topics 143 | > - Discuss need to become familiar with baseline machine learning algorithms 144 | > - Define what a 'weak' learner is 145 | > - Discuss learning ensemble methods and go over session outline 146 | > - Set expectations about Q&A 147 | > 148 | > ### Live Training 149 | > #### Ensemble Technique #1 - `Classifier` 150 | > - Import Diabets classification dataset and print header of DataFrame `pd.read_csv()`, `.head()` 151 | > - Glimpse at the data using `.dtypes`, `.describe()`, `.info()` to understand the data 152 | > - Build baseline models 153 | > - Build layers and first stacked model classifier 154 | > - Compare baseline and stacked models using `seaborn.boxplot` 155 | > #### Ensemble Technique #2 - `Regressor` 156 | > - Import and briefly explore Abalone regression dataset 157 | > - Build baseline models 158 | > - Build layers and first stacked regressor 159 | > - Compare baseline and stacked models using `seaborn.boxplot` 160 | > #### Ensemble Technique #3 - `Regressor` 161 | > - Discuss multiple layer stacking 162 | > - Build additional layer and stacked regressor model 163 | > - Compare baseline and stacked models using `seaborn.boxplot` 164 | > - Discuss how to apply same to `sklearn.ensemble.StackingClassifier` 165 | > - **Q&A** 166 | > 167 | > ### Ending slides 168 | > - Recap of what we learned 169 | > - The model stacking mindset 170 | > - Call to action and course recommendations 171 | 172 | ## Authoring your session 173 | 174 | To get yourself started with setting up your live session, follow the steps below: 175 | 176 | 1. Download and install the "Open in Colabs" extension from [here](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo?hl=en). This will let you take any jupyter notebook you see in a GitHub repository and open it as a **temporary** Colabs link. 177 | 2. Upload your dataset(s) to the `data` folder. 178 | 3. Upload your images, gifs, or any other assets you want to use in the notebook in the `assets` folder. 179 | 4. Check out the notebooks templates in the `notebooks` folder, and keep the template you want for your session while deleting all remaining ones. 180 | 5. Preview your desired notebook, press on "Open in Colabs" extension - and start developing your content in colabs _(which will act as the solution code to the session)_. :warning: **Important** :warning: Your progress will **not** be saved on Google Colabs since it's a temporary link. To save your progress, make sure to press on `File`, `Save a copy in GitHub` and follow remaining prompts. You can also download the notebook locally and develop the content there as long you test out that the syntax works on Colabs as well. 181 | 6. Once your notebooks is ready to go, give it the name `session_name_solution.ipynb` create an empty version of the Notebook to be filled out by you and learners during the session, end the file name with `session_name_learners.ipynb`. 182 | 7. Create Colabs links for both sessions and save them in notebooks :tada: 183 | -------------------------------------------------------------------------------- /data/pima-indians-diabetes.csv: -------------------------------------------------------------------------------- 1 | n_preg,pl_glucose,dia_bp,tri_thick,serum_ins,bmi,diab_ped,age,class 2 | 6,148,72,35,0,33.6,0.627,50,1 3 | 1,85,66,29,0,26.6,0.351,31,0 4 | 8,183,64,0,0,23.3,0.672,32,1 5 | 1,89,66,23,94,28.1,0.167,21,0 6 | 0,137,40,35,168,43.1,2.288,33,1 7 | 5,116,74,0,0,25.6,0.201,30,0 8 | 3,78,50,32,88,31,0.248,26,1 9 | 10,115,0,0,0,35.3,0.134,29,0 10 | 2,197,70,45,543,30.5,0.158,53,1 11 | 8,125,96,0,0,0,0.232,54,1 12 | 4,110,92,0,0,37.6,0.191,30,0 13 | 10,168,74,0,0,38,0.537,34,1 14 | 10,139,80,0,0,27.1,1.441,57,0 15 | 1,189,60,23,846,30.1,0.398,59,1 16 | 5,166,72,19,175,25.8,0.587,51,1 17 | 7,100,0,0,0,30,0.484,32,1 18 | 0,118,84,47,230,45.8,0.551,31,1 19 | 7,107,74,0,0,29.6,0.254,31,1 20 | 1,103,30,38,83,43.3,0.183,33,0 21 | 1,115,70,30,96,34.6,0.529,32,1 22 | 3,126,88,41,235,39.3,0.704,27,0 23 | 8,99,84,0,0,35.4,0.388,50,0 24 | 7,196,90,0,0,39.8,0.451,41,1 25 | 9,119,80,35,0,29,0.263,29,1 26 | 11,143,94,33,146,36.6,0.254,51,1 27 | 10,125,70,26,115,31.1,0.205,41,1 28 | 7,147,76,0,0,39.4,0.257,43,1 29 | 1,97,66,15,140,23.2,0.487,22,0 30 | 13,145,82,19,110,22.2,0.245,57,0 31 | 5,117,92,0,0,34.1,0.337,38,0 32 | 5,109,75,26,0,36,0.546,60,0 33 | 3,158,76,36,245,31.6,0.851,28,1 34 | 3,88,58,11,54,24.8,0.267,22,0 35 | 6,92,92,0,0,19.9,0.188,28,0 36 | 10,122,78,31,0,27.6,0.512,45,0 37 | 4,103,60,33,192,24,0.966,33,0 38 | 11,138,76,0,0,33.2,0.42,35,0 39 | 9,102,76,37,0,32.9,0.665,46,1 40 | 2,90,68,42,0,38.2,0.503,27,1 41 | 4,111,72,47,207,37.1,1.39,56,1 42 | 3,180,64,25,70,34,0.271,26,0 43 | 7,133,84,0,0,40.2,0.696,37,0 44 | 7,106,92,18,0,22.7,0.235,48,0 45 | 9,171,110,24,240,45.4,0.721,54,1 46 | 7,159,64,0,0,27.4,0.294,40,0 47 | 0,180,66,39,0,42,1.893,25,1 48 | 1,146,56,0,0,29.7,0.564,29,0 49 | 2,71,70,27,0,28,0.586,22,0 50 | 7,103,66,32,0,39.1,0.344,31,1 51 | 7,105,0,0,0,0,0.305,24,0 52 | 1,103,80,11,82,19.4,0.491,22,0 53 | 1,101,50,15,36,24.2,0.526,26,0 54 | 5,88,66,21,23,24.4,0.342,30,0 55 | 8,176,90,34,300,33.7,0.467,58,1 56 | 7,150,66,42,342,34.7,0.718,42,0 57 | 1,73,50,10,0,23,0.248,21,0 58 | 7,187,68,39,304,37.7,0.254,41,1 59 | 0,100,88,60,110,46.8,0.962,31,0 60 | 0,146,82,0,0,40.5,1.781,44,0 61 | 0,105,64,41,142,41.5,0.173,22,0 62 | 2,84,0,0,0,0,0.304,21,0 63 | 8,133,72,0,0,32.9,0.27,39,1 64 | 5,44,62,0,0,25,0.587,36,0 65 | 2,141,58,34,128,25.4,0.699,24,0 66 | 7,114,66,0,0,32.8,0.258,42,1 67 | 5,99,74,27,0,29,0.203,32,0 68 | 0,109,88,30,0,32.5,0.855,38,1 69 | 2,109,92,0,0,42.7,0.845,54,0 70 | 1,95,66,13,38,19.6,0.334,25,0 71 | 4,146,85,27,100,28.9,0.189,27,0 72 | 2,100,66,20,90,32.9,0.867,28,1 73 | 5,139,64,35,140,28.6,0.411,26,0 74 | 13,126,90,0,0,43.4,0.583,42,1 75 | 4,129,86,20,270,35.1,0.231,23,0 76 | 1,79,75,30,0,32,0.396,22,0 77 | 1,0,48,20,0,24.7,0.14,22,0 78 | 7,62,78,0,0,32.6,0.391,41,0 79 | 5,95,72,33,0,37.7,0.37,27,0 80 | 0,131,0,0,0,43.2,0.27,26,1 81 | 2,112,66,22,0,25,0.307,24,0 82 | 3,113,44,13,0,22.4,0.14,22,0 83 | 2,74,0,0,0,0,0.102,22,0 84 | 7,83,78,26,71,29.3,0.767,36,0 85 | 0,101,65,28,0,24.6,0.237,22,0 86 | 5,137,108,0,0,48.8,0.227,37,1 87 | 2,110,74,29,125,32.4,0.698,27,0 88 | 13,106,72,54,0,36.6,0.178,45,0 89 | 2,100,68,25,71,38.5,0.324,26,0 90 | 15,136,70,32,110,37.1,0.153,43,1 91 | 1,107,68,19,0,26.5,0.165,24,0 92 | 1,80,55,0,0,19.1,0.258,21,0 93 | 4,123,80,15,176,32,0.443,34,0 94 | 7,81,78,40,48,46.7,0.261,42,0 95 | 4,134,72,0,0,23.8,0.277,60,1 96 | 2,142,82,18,64,24.7,0.761,21,0 97 | 6,144,72,27,228,33.9,0.255,40,0 98 | 2,92,62,28,0,31.6,0.13,24,0 99 | 1,71,48,18,76,20.4,0.323,22,0 100 | 6,93,50,30,64,28.7,0.356,23,0 101 | 1,122,90,51,220,49.7,0.325,31,1 102 | 1,163,72,0,0,39,1.222,33,1 103 | 1,151,60,0,0,26.1,0.179,22,0 104 | 0,125,96,0,0,22.5,0.262,21,0 105 | 1,81,72,18,40,26.6,0.283,24,0 106 | 2,85,65,0,0,39.6,0.93,27,0 107 | 1,126,56,29,152,28.7,0.801,21,0 108 | 1,96,122,0,0,22.4,0.207,27,0 109 | 4,144,58,28,140,29.5,0.287,37,0 110 | 3,83,58,31,18,34.3,0.336,25,0 111 | 0,95,85,25,36,37.4,0.247,24,1 112 | 3,171,72,33,135,33.3,0.199,24,1 113 | 8,155,62,26,495,34,0.543,46,1 114 | 1,89,76,34,37,31.2,0.192,23,0 115 | 4,76,62,0,0,34,0.391,25,0 116 | 7,160,54,32,175,30.5,0.588,39,1 117 | 4,146,92,0,0,31.2,0.539,61,1 118 | 5,124,74,0,0,34,0.22,38,1 119 | 5,78,48,0,0,33.7,0.654,25,0 120 | 4,97,60,23,0,28.2,0.443,22,0 121 | 4,99,76,15,51,23.2,0.223,21,0 122 | 0,162,76,56,100,53.2,0.759,25,1 123 | 6,111,64,39,0,34.2,0.26,24,0 124 | 2,107,74,30,100,33.6,0.404,23,0 125 | 5,132,80,0,0,26.8,0.186,69,0 126 | 0,113,76,0,0,33.3,0.278,23,1 127 | 1,88,30,42,99,55,0.496,26,1 128 | 3,120,70,30,135,42.9,0.452,30,0 129 | 1,118,58,36,94,33.3,0.261,23,0 130 | 1,117,88,24,145,34.5,0.403,40,1 131 | 0,105,84,0,0,27.9,0.741,62,1 132 | 4,173,70,14,168,29.7,0.361,33,1 133 | 9,122,56,0,0,33.3,1.114,33,1 134 | 3,170,64,37,225,34.5,0.356,30,1 135 | 8,84,74,31,0,38.3,0.457,39,0 136 | 2,96,68,13,49,21.1,0.647,26,0 137 | 2,125,60,20,140,33.8,0.088,31,0 138 | 0,100,70,26,50,30.8,0.597,21,0 139 | 0,93,60,25,92,28.7,0.532,22,0 140 | 0,129,80,0,0,31.2,0.703,29,0 141 | 5,105,72,29,325,36.9,0.159,28,0 142 | 3,128,78,0,0,21.1,0.268,55,0 143 | 5,106,82,30,0,39.5,0.286,38,0 144 | 2,108,52,26,63,32.5,0.318,22,0 145 | 10,108,66,0,0,32.4,0.272,42,1 146 | 4,154,62,31,284,32.8,0.237,23,0 147 | 0,102,75,23,0,0,0.572,21,0 148 | 9,57,80,37,0,32.8,0.096,41,0 149 | 2,106,64,35,119,30.5,1.4,34,0 150 | 5,147,78,0,0,33.7,0.218,65,0 151 | 2,90,70,17,0,27.3,0.085,22,0 152 | 1,136,74,50,204,37.4,0.399,24,0 153 | 4,114,65,0,0,21.9,0.432,37,0 154 | 9,156,86,28,155,34.3,1.189,42,1 155 | 1,153,82,42,485,40.6,0.687,23,0 156 | 8,188,78,0,0,47.9,0.137,43,1 157 | 7,152,88,44,0,50,0.337,36,1 158 | 2,99,52,15,94,24.6,0.637,21,0 159 | 1,109,56,21,135,25.2,0.833,23,0 160 | 2,88,74,19,53,29,0.229,22,0 161 | 17,163,72,41,114,40.9,0.817,47,1 162 | 4,151,90,38,0,29.7,0.294,36,0 163 | 7,102,74,40,105,37.2,0.204,45,0 164 | 0,114,80,34,285,44.2,0.167,27,0 165 | 2,100,64,23,0,29.7,0.368,21,0 166 | 0,131,88,0,0,31.6,0.743,32,1 167 | 6,104,74,18,156,29.9,0.722,41,1 168 | 3,148,66,25,0,32.5,0.256,22,0 169 | 4,120,68,0,0,29.6,0.709,34,0 170 | 4,110,66,0,0,31.9,0.471,29,0 171 | 3,111,90,12,78,28.4,0.495,29,0 172 | 6,102,82,0,0,30.8,0.18,36,1 173 | 6,134,70,23,130,35.4,0.542,29,1 174 | 2,87,0,23,0,28.9,0.773,25,0 175 | 1,79,60,42,48,43.5,0.678,23,0 176 | 2,75,64,24,55,29.7,0.37,33,0 177 | 8,179,72,42,130,32.7,0.719,36,1 178 | 6,85,78,0,0,31.2,0.382,42,0 179 | 0,129,110,46,130,67.1,0.319,26,1 180 | 5,143,78,0,0,45,0.19,47,0 181 | 5,130,82,0,0,39.1,0.956,37,1 182 | 6,87,80,0,0,23.2,0.084,32,0 183 | 0,119,64,18,92,34.9,0.725,23,0 184 | 1,0,74,20,23,27.7,0.299,21,0 185 | 5,73,60,0,0,26.8,0.268,27,0 186 | 4,141,74,0,0,27.6,0.244,40,0 187 | 7,194,68,28,0,35.9,0.745,41,1 188 | 8,181,68,36,495,30.1,0.615,60,1 189 | 1,128,98,41,58,32,1.321,33,1 190 | 8,109,76,39,114,27.9,0.64,31,1 191 | 5,139,80,35,160,31.6,0.361,25,1 192 | 3,111,62,0,0,22.6,0.142,21,0 193 | 9,123,70,44,94,33.1,0.374,40,0 194 | 7,159,66,0,0,30.4,0.383,36,1 195 | 11,135,0,0,0,52.3,0.578,40,1 196 | 8,85,55,20,0,24.4,0.136,42,0 197 | 5,158,84,41,210,39.4,0.395,29,1 198 | 1,105,58,0,0,24.3,0.187,21,0 199 | 3,107,62,13,48,22.9,0.678,23,1 200 | 4,109,64,44,99,34.8,0.905,26,1 201 | 4,148,60,27,318,30.9,0.15,29,1 202 | 0,113,80,16,0,31,0.874,21,0 203 | 1,138,82,0,0,40.1,0.236,28,0 204 | 0,108,68,20,0,27.3,0.787,32,0 205 | 2,99,70,16,44,20.4,0.235,27,0 206 | 6,103,72,32,190,37.7,0.324,55,0 207 | 5,111,72,28,0,23.9,0.407,27,0 208 | 8,196,76,29,280,37.5,0.605,57,1 209 | 5,162,104,0,0,37.7,0.151,52,1 210 | 1,96,64,27,87,33.2,0.289,21,0 211 | 7,184,84,33,0,35.5,0.355,41,1 212 | 2,81,60,22,0,27.7,0.29,25,0 213 | 0,147,85,54,0,42.8,0.375,24,0 214 | 7,179,95,31,0,34.2,0.164,60,0 215 | 0,140,65,26,130,42.6,0.431,24,1 216 | 9,112,82,32,175,34.2,0.26,36,1 217 | 12,151,70,40,271,41.8,0.742,38,1 218 | 5,109,62,41,129,35.8,0.514,25,1 219 | 6,125,68,30,120,30,0.464,32,0 220 | 5,85,74,22,0,29,1.224,32,1 221 | 5,112,66,0,0,37.8,0.261,41,1 222 | 0,177,60,29,478,34.6,1.072,21,1 223 | 2,158,90,0,0,31.6,0.805,66,1 224 | 7,119,0,0,0,25.2,0.209,37,0 225 | 7,142,60,33,190,28.8,0.687,61,0 226 | 1,100,66,15,56,23.6,0.666,26,0 227 | 1,87,78,27,32,34.6,0.101,22,0 228 | 0,101,76,0,0,35.7,0.198,26,0 229 | 3,162,52,38,0,37.2,0.652,24,1 230 | 4,197,70,39,744,36.7,2.329,31,0 231 | 0,117,80,31,53,45.2,0.089,24,0 232 | 4,142,86,0,0,44,0.645,22,1 233 | 6,134,80,37,370,46.2,0.238,46,1 234 | 1,79,80,25,37,25.4,0.583,22,0 235 | 4,122,68,0,0,35,0.394,29,0 236 | 3,74,68,28,45,29.7,0.293,23,0 237 | 4,171,72,0,0,43.6,0.479,26,1 238 | 7,181,84,21,192,35.9,0.586,51,1 239 | 0,179,90,27,0,44.1,0.686,23,1 240 | 9,164,84,21,0,30.8,0.831,32,1 241 | 0,104,76,0,0,18.4,0.582,27,0 242 | 1,91,64,24,0,29.2,0.192,21,0 243 | 4,91,70,32,88,33.1,0.446,22,0 244 | 3,139,54,0,0,25.6,0.402,22,1 245 | 6,119,50,22,176,27.1,1.318,33,1 246 | 2,146,76,35,194,38.2,0.329,29,0 247 | 9,184,85,15,0,30,1.213,49,1 248 | 10,122,68,0,0,31.2,0.258,41,0 249 | 0,165,90,33,680,52.3,0.427,23,0 250 | 9,124,70,33,402,35.4,0.282,34,0 251 | 1,111,86,19,0,30.1,0.143,23,0 252 | 9,106,52,0,0,31.2,0.38,42,0 253 | 2,129,84,0,0,28,0.284,27,0 254 | 2,90,80,14,55,24.4,0.249,24,0 255 | 0,86,68,32,0,35.8,0.238,25,0 256 | 12,92,62,7,258,27.6,0.926,44,1 257 | 1,113,64,35,0,33.6,0.543,21,1 258 | 3,111,56,39,0,30.1,0.557,30,0 259 | 2,114,68,22,0,28.7,0.092,25,0 260 | 1,193,50,16,375,25.9,0.655,24,0 261 | 11,155,76,28,150,33.3,1.353,51,1 262 | 3,191,68,15,130,30.9,0.299,34,0 263 | 3,141,0,0,0,30,0.761,27,1 264 | 4,95,70,32,0,32.1,0.612,24,0 265 | 3,142,80,15,0,32.4,0.2,63,0 266 | 4,123,62,0,0,32,0.226,35,1 267 | 5,96,74,18,67,33.6,0.997,43,0 268 | 0,138,0,0,0,36.3,0.933,25,1 269 | 2,128,64,42,0,40,1.101,24,0 270 | 0,102,52,0,0,25.1,0.078,21,0 271 | 2,146,0,0,0,27.5,0.24,28,1 272 | 10,101,86,37,0,45.6,1.136,38,1 273 | 2,108,62,32,56,25.2,0.128,21,0 274 | 3,122,78,0,0,23,0.254,40,0 275 | 1,71,78,50,45,33.2,0.422,21,0 276 | 13,106,70,0,0,34.2,0.251,52,0 277 | 2,100,70,52,57,40.5,0.677,25,0 278 | 7,106,60,24,0,26.5,0.296,29,1 279 | 0,104,64,23,116,27.8,0.454,23,0 280 | 5,114,74,0,0,24.9,0.744,57,0 281 | 2,108,62,10,278,25.3,0.881,22,0 282 | 0,146,70,0,0,37.9,0.334,28,1 283 | 10,129,76,28,122,35.9,0.28,39,0 284 | 7,133,88,15,155,32.4,0.262,37,0 285 | 7,161,86,0,0,30.4,0.165,47,1 286 | 2,108,80,0,0,27,0.259,52,1 287 | 7,136,74,26,135,26,0.647,51,0 288 | 5,155,84,44,545,38.7,0.619,34,0 289 | 1,119,86,39,220,45.6,0.808,29,1 290 | 4,96,56,17,49,20.8,0.34,26,0 291 | 5,108,72,43,75,36.1,0.263,33,0 292 | 0,78,88,29,40,36.9,0.434,21,0 293 | 0,107,62,30,74,36.6,0.757,25,1 294 | 2,128,78,37,182,43.3,1.224,31,1 295 | 1,128,48,45,194,40.5,0.613,24,1 296 | 0,161,50,0,0,21.9,0.254,65,0 297 | 6,151,62,31,120,35.5,0.692,28,0 298 | 2,146,70,38,360,28,0.337,29,1 299 | 0,126,84,29,215,30.7,0.52,24,0 300 | 14,100,78,25,184,36.6,0.412,46,1 301 | 8,112,72,0,0,23.6,0.84,58,0 302 | 0,167,0,0,0,32.3,0.839,30,1 303 | 2,144,58,33,135,31.6,0.422,25,1 304 | 5,77,82,41,42,35.8,0.156,35,0 305 | 5,115,98,0,0,52.9,0.209,28,1 306 | 3,150,76,0,0,21,0.207,37,0 307 | 2,120,76,37,105,39.7,0.215,29,0 308 | 10,161,68,23,132,25.5,0.326,47,1 309 | 0,137,68,14,148,24.8,0.143,21,0 310 | 0,128,68,19,180,30.5,1.391,25,1 311 | 2,124,68,28,205,32.9,0.875,30,1 312 | 6,80,66,30,0,26.2,0.313,41,0 313 | 0,106,70,37,148,39.4,0.605,22,0 314 | 2,155,74,17,96,26.6,0.433,27,1 315 | 3,113,50,10,85,29.5,0.626,25,0 316 | 7,109,80,31,0,35.9,1.127,43,1 317 | 2,112,68,22,94,34.1,0.315,26,0 318 | 3,99,80,11,64,19.3,0.284,30,0 319 | 3,182,74,0,0,30.5,0.345,29,1 320 | 3,115,66,39,140,38.1,0.15,28,0 321 | 6,194,78,0,0,23.5,0.129,59,1 322 | 4,129,60,12,231,27.5,0.527,31,0 323 | 3,112,74,30,0,31.6,0.197,25,1 324 | 0,124,70,20,0,27.4,0.254,36,1 325 | 13,152,90,33,29,26.8,0.731,43,1 326 | 2,112,75,32,0,35.7,0.148,21,0 327 | 1,157,72,21,168,25.6,0.123,24,0 328 | 1,122,64,32,156,35.1,0.692,30,1 329 | 10,179,70,0,0,35.1,0.2,37,0 330 | 2,102,86,36,120,45.5,0.127,23,1 331 | 6,105,70,32,68,30.8,0.122,37,0 332 | 8,118,72,19,0,23.1,1.476,46,0 333 | 2,87,58,16,52,32.7,0.166,25,0 334 | 1,180,0,0,0,43.3,0.282,41,1 335 | 12,106,80,0,0,23.6,0.137,44,0 336 | 1,95,60,18,58,23.9,0.26,22,0 337 | 0,165,76,43,255,47.9,0.259,26,0 338 | 0,117,0,0,0,33.8,0.932,44,0 339 | 5,115,76,0,0,31.2,0.343,44,1 340 | 9,152,78,34,171,34.2,0.893,33,1 341 | 7,178,84,0,0,39.9,0.331,41,1 342 | 1,130,70,13,105,25.9,0.472,22,0 343 | 1,95,74,21,73,25.9,0.673,36,0 344 | 1,0,68,35,0,32,0.389,22,0 345 | 5,122,86,0,0,34.7,0.29,33,0 346 | 8,95,72,0,0,36.8,0.485,57,0 347 | 8,126,88,36,108,38.5,0.349,49,0 348 | 1,139,46,19,83,28.7,0.654,22,0 349 | 3,116,0,0,0,23.5,0.187,23,0 350 | 3,99,62,19,74,21.8,0.279,26,0 351 | 5,0,80,32,0,41,0.346,37,1 352 | 4,92,80,0,0,42.2,0.237,29,0 353 | 4,137,84,0,0,31.2,0.252,30,0 354 | 3,61,82,28,0,34.4,0.243,46,0 355 | 1,90,62,12,43,27.2,0.58,24,0 356 | 3,90,78,0,0,42.7,0.559,21,0 357 | 9,165,88,0,0,30.4,0.302,49,1 358 | 1,125,50,40,167,33.3,0.962,28,1 359 | 13,129,0,30,0,39.9,0.569,44,1 360 | 12,88,74,40,54,35.3,0.378,48,0 361 | 1,196,76,36,249,36.5,0.875,29,1 362 | 5,189,64,33,325,31.2,0.583,29,1 363 | 5,158,70,0,0,29.8,0.207,63,0 364 | 5,103,108,37,0,39.2,0.305,65,0 365 | 4,146,78,0,0,38.5,0.52,67,1 366 | 4,147,74,25,293,34.9,0.385,30,0 367 | 5,99,54,28,83,34,0.499,30,0 368 | 6,124,72,0,0,27.6,0.368,29,1 369 | 0,101,64,17,0,21,0.252,21,0 370 | 3,81,86,16,66,27.5,0.306,22,0 371 | 1,133,102,28,140,32.8,0.234,45,1 372 | 3,173,82,48,465,38.4,2.137,25,1 373 | 0,118,64,23,89,0,1.731,21,0 374 | 0,84,64,22,66,35.8,0.545,21,0 375 | 2,105,58,40,94,34.9,0.225,25,0 376 | 2,122,52,43,158,36.2,0.816,28,0 377 | 12,140,82,43,325,39.2,0.528,58,1 378 | 0,98,82,15,84,25.2,0.299,22,0 379 | 1,87,60,37,75,37.2,0.509,22,0 380 | 4,156,75,0,0,48.3,0.238,32,1 381 | 0,93,100,39,72,43.4,1.021,35,0 382 | 1,107,72,30,82,30.8,0.821,24,0 383 | 0,105,68,22,0,20,0.236,22,0 384 | 1,109,60,8,182,25.4,0.947,21,0 385 | 1,90,62,18,59,25.1,1.268,25,0 386 | 1,125,70,24,110,24.3,0.221,25,0 387 | 1,119,54,13,50,22.3,0.205,24,0 388 | 5,116,74,29,0,32.3,0.66,35,1 389 | 8,105,100,36,0,43.3,0.239,45,1 390 | 5,144,82,26,285,32,0.452,58,1 391 | 3,100,68,23,81,31.6,0.949,28,0 392 | 1,100,66,29,196,32,0.444,42,0 393 | 5,166,76,0,0,45.7,0.34,27,1 394 | 1,131,64,14,415,23.7,0.389,21,0 395 | 4,116,72,12,87,22.1,0.463,37,0 396 | 4,158,78,0,0,32.9,0.803,31,1 397 | 2,127,58,24,275,27.7,1.6,25,0 398 | 3,96,56,34,115,24.7,0.944,39,0 399 | 0,131,66,40,0,34.3,0.196,22,1 400 | 3,82,70,0,0,21.1,0.389,25,0 401 | 3,193,70,31,0,34.9,0.241,25,1 402 | 4,95,64,0,0,32,0.161,31,1 403 | 6,137,61,0,0,24.2,0.151,55,0 404 | 5,136,84,41,88,35,0.286,35,1 405 | 9,72,78,25,0,31.6,0.28,38,0 406 | 5,168,64,0,0,32.9,0.135,41,1 407 | 2,123,48,32,165,42.1,0.52,26,0 408 | 4,115,72,0,0,28.9,0.376,46,1 409 | 0,101,62,0,0,21.9,0.336,25,0 410 | 8,197,74,0,0,25.9,1.191,39,1 411 | 1,172,68,49,579,42.4,0.702,28,1 412 | 6,102,90,39,0,35.7,0.674,28,0 413 | 1,112,72,30,176,34.4,0.528,25,0 414 | 1,143,84,23,310,42.4,1.076,22,0 415 | 1,143,74,22,61,26.2,0.256,21,0 416 | 0,138,60,35,167,34.6,0.534,21,1 417 | 3,173,84,33,474,35.7,0.258,22,1 418 | 1,97,68,21,0,27.2,1.095,22,0 419 | 4,144,82,32,0,38.5,0.554,37,1 420 | 1,83,68,0,0,18.2,0.624,27,0 421 | 3,129,64,29,115,26.4,0.219,28,1 422 | 1,119,88,41,170,45.3,0.507,26,0 423 | 2,94,68,18,76,26,0.561,21,0 424 | 0,102,64,46,78,40.6,0.496,21,0 425 | 2,115,64,22,0,30.8,0.421,21,0 426 | 8,151,78,32,210,42.9,0.516,36,1 427 | 4,184,78,39,277,37,0.264,31,1 428 | 0,94,0,0,0,0,0.256,25,0 429 | 1,181,64,30,180,34.1,0.328,38,1 430 | 0,135,94,46,145,40.6,0.284,26,0 431 | 1,95,82,25,180,35,0.233,43,1 432 | 2,99,0,0,0,22.2,0.108,23,0 433 | 3,89,74,16,85,30.4,0.551,38,0 434 | 1,80,74,11,60,30,0.527,22,0 435 | 2,139,75,0,0,25.6,0.167,29,0 436 | 1,90,68,8,0,24.5,1.138,36,0 437 | 0,141,0,0,0,42.4,0.205,29,1 438 | 12,140,85,33,0,37.4,0.244,41,0 439 | 5,147,75,0,0,29.9,0.434,28,0 440 | 1,97,70,15,0,18.2,0.147,21,0 441 | 6,107,88,0,0,36.8,0.727,31,0 442 | 0,189,104,25,0,34.3,0.435,41,1 443 | 2,83,66,23,50,32.2,0.497,22,0 444 | 4,117,64,27,120,33.2,0.23,24,0 445 | 8,108,70,0,0,30.5,0.955,33,1 446 | 4,117,62,12,0,29.7,0.38,30,1 447 | 0,180,78,63,14,59.4,2.42,25,1 448 | 1,100,72,12,70,25.3,0.658,28,0 449 | 0,95,80,45,92,36.5,0.33,26,0 450 | 0,104,64,37,64,33.6,0.51,22,1 451 | 0,120,74,18,63,30.5,0.285,26,0 452 | 1,82,64,13,95,21.2,0.415,23,0 453 | 2,134,70,0,0,28.9,0.542,23,1 454 | 0,91,68,32,210,39.9,0.381,25,0 455 | 2,119,0,0,0,19.6,0.832,72,0 456 | 2,100,54,28,105,37.8,0.498,24,0 457 | 14,175,62,30,0,33.6,0.212,38,1 458 | 1,135,54,0,0,26.7,0.687,62,0 459 | 5,86,68,28,71,30.2,0.364,24,0 460 | 10,148,84,48,237,37.6,1.001,51,1 461 | 9,134,74,33,60,25.9,0.46,81,0 462 | 9,120,72,22,56,20.8,0.733,48,0 463 | 1,71,62,0,0,21.8,0.416,26,0 464 | 8,74,70,40,49,35.3,0.705,39,0 465 | 5,88,78,30,0,27.6,0.258,37,0 466 | 10,115,98,0,0,24,1.022,34,0 467 | 0,124,56,13,105,21.8,0.452,21,0 468 | 0,74,52,10,36,27.8,0.269,22,0 469 | 0,97,64,36,100,36.8,0.6,25,0 470 | 8,120,0,0,0,30,0.183,38,1 471 | 6,154,78,41,140,46.1,0.571,27,0 472 | 1,144,82,40,0,41.3,0.607,28,0 473 | 0,137,70,38,0,33.2,0.17,22,0 474 | 0,119,66,27,0,38.8,0.259,22,0 475 | 7,136,90,0,0,29.9,0.21,50,0 476 | 4,114,64,0,0,28.9,0.126,24,0 477 | 0,137,84,27,0,27.3,0.231,59,0 478 | 2,105,80,45,191,33.7,0.711,29,1 479 | 7,114,76,17,110,23.8,0.466,31,0 480 | 8,126,74,38,75,25.9,0.162,39,0 481 | 4,132,86,31,0,28,0.419,63,0 482 | 3,158,70,30,328,35.5,0.344,35,1 483 | 0,123,88,37,0,35.2,0.197,29,0 484 | 4,85,58,22,49,27.8,0.306,28,0 485 | 0,84,82,31,125,38.2,0.233,23,0 486 | 0,145,0,0,0,44.2,0.63,31,1 487 | 0,135,68,42,250,42.3,0.365,24,1 488 | 1,139,62,41,480,40.7,0.536,21,0 489 | 0,173,78,32,265,46.5,1.159,58,0 490 | 4,99,72,17,0,25.6,0.294,28,0 491 | 8,194,80,0,0,26.1,0.551,67,0 492 | 2,83,65,28,66,36.8,0.629,24,0 493 | 2,89,90,30,0,33.5,0.292,42,0 494 | 4,99,68,38,0,32.8,0.145,33,0 495 | 4,125,70,18,122,28.9,1.144,45,1 496 | 3,80,0,0,0,0,0.174,22,0 497 | 6,166,74,0,0,26.6,0.304,66,0 498 | 5,110,68,0,0,26,0.292,30,0 499 | 2,81,72,15,76,30.1,0.547,25,0 500 | 7,195,70,33,145,25.1,0.163,55,1 501 | 6,154,74,32,193,29.3,0.839,39,0 502 | 2,117,90,19,71,25.2,0.313,21,0 503 | 3,84,72,32,0,37.2,0.267,28,0 504 | 6,0,68,41,0,39,0.727,41,1 505 | 7,94,64,25,79,33.3,0.738,41,0 506 | 3,96,78,39,0,37.3,0.238,40,0 507 | 10,75,82,0,0,33.3,0.263,38,0 508 | 0,180,90,26,90,36.5,0.314,35,1 509 | 1,130,60,23,170,28.6,0.692,21,0 510 | 2,84,50,23,76,30.4,0.968,21,0 511 | 8,120,78,0,0,25,0.409,64,0 512 | 12,84,72,31,0,29.7,0.297,46,1 513 | 0,139,62,17,210,22.1,0.207,21,0 514 | 9,91,68,0,0,24.2,0.2,58,0 515 | 2,91,62,0,0,27.3,0.525,22,0 516 | 3,99,54,19,86,25.6,0.154,24,0 517 | 3,163,70,18,105,31.6,0.268,28,1 518 | 9,145,88,34,165,30.3,0.771,53,1 519 | 7,125,86,0,0,37.6,0.304,51,0 520 | 13,76,60,0,0,32.8,0.18,41,0 521 | 6,129,90,7,326,19.6,0.582,60,0 522 | 2,68,70,32,66,25,0.187,25,0 523 | 3,124,80,33,130,33.2,0.305,26,0 524 | 6,114,0,0,0,0,0.189,26,0 525 | 9,130,70,0,0,34.2,0.652,45,1 526 | 3,125,58,0,0,31.6,0.151,24,0 527 | 3,87,60,18,0,21.8,0.444,21,0 528 | 1,97,64,19,82,18.2,0.299,21,0 529 | 3,116,74,15,105,26.3,0.107,24,0 530 | 0,117,66,31,188,30.8,0.493,22,0 531 | 0,111,65,0,0,24.6,0.66,31,0 532 | 2,122,60,18,106,29.8,0.717,22,0 533 | 0,107,76,0,0,45.3,0.686,24,0 534 | 1,86,66,52,65,41.3,0.917,29,0 535 | 6,91,0,0,0,29.8,0.501,31,0 536 | 1,77,56,30,56,33.3,1.251,24,0 537 | 4,132,0,0,0,32.9,0.302,23,1 538 | 0,105,90,0,0,29.6,0.197,46,0 539 | 0,57,60,0,0,21.7,0.735,67,0 540 | 0,127,80,37,210,36.3,0.804,23,0 541 | 3,129,92,49,155,36.4,0.968,32,1 542 | 8,100,74,40,215,39.4,0.661,43,1 543 | 3,128,72,25,190,32.4,0.549,27,1 544 | 10,90,85,32,0,34.9,0.825,56,1 545 | 4,84,90,23,56,39.5,0.159,25,0 546 | 1,88,78,29,76,32,0.365,29,0 547 | 8,186,90,35,225,34.5,0.423,37,1 548 | 5,187,76,27,207,43.6,1.034,53,1 549 | 4,131,68,21,166,33.1,0.16,28,0 550 | 1,164,82,43,67,32.8,0.341,50,0 551 | 4,189,110,31,0,28.5,0.68,37,0 552 | 1,116,70,28,0,27.4,0.204,21,0 553 | 3,84,68,30,106,31.9,0.591,25,0 554 | 6,114,88,0,0,27.8,0.247,66,0 555 | 1,88,62,24,44,29.9,0.422,23,0 556 | 1,84,64,23,115,36.9,0.471,28,0 557 | 7,124,70,33,215,25.5,0.161,37,0 558 | 1,97,70,40,0,38.1,0.218,30,0 559 | 8,110,76,0,0,27.8,0.237,58,0 560 | 11,103,68,40,0,46.2,0.126,42,0 561 | 11,85,74,0,0,30.1,0.3,35,0 562 | 6,125,76,0,0,33.8,0.121,54,1 563 | 0,198,66,32,274,41.3,0.502,28,1 564 | 1,87,68,34,77,37.6,0.401,24,0 565 | 6,99,60,19,54,26.9,0.497,32,0 566 | 0,91,80,0,0,32.4,0.601,27,0 567 | 2,95,54,14,88,26.1,0.748,22,0 568 | 1,99,72,30,18,38.6,0.412,21,0 569 | 6,92,62,32,126,32,0.085,46,0 570 | 4,154,72,29,126,31.3,0.338,37,0 571 | 0,121,66,30,165,34.3,0.203,33,1 572 | 3,78,70,0,0,32.5,0.27,39,0 573 | 2,130,96,0,0,22.6,0.268,21,0 574 | 3,111,58,31,44,29.5,0.43,22,0 575 | 2,98,60,17,120,34.7,0.198,22,0 576 | 1,143,86,30,330,30.1,0.892,23,0 577 | 1,119,44,47,63,35.5,0.28,25,0 578 | 6,108,44,20,130,24,0.813,35,0 579 | 2,118,80,0,0,42.9,0.693,21,1 580 | 10,133,68,0,0,27,0.245,36,0 581 | 2,197,70,99,0,34.7,0.575,62,1 582 | 0,151,90,46,0,42.1,0.371,21,1 583 | 6,109,60,27,0,25,0.206,27,0 584 | 12,121,78,17,0,26.5,0.259,62,0 585 | 8,100,76,0,0,38.7,0.19,42,0 586 | 8,124,76,24,600,28.7,0.687,52,1 587 | 1,93,56,11,0,22.5,0.417,22,0 588 | 8,143,66,0,0,34.9,0.129,41,1 589 | 6,103,66,0,0,24.3,0.249,29,0 590 | 3,176,86,27,156,33.3,1.154,52,1 591 | 0,73,0,0,0,21.1,0.342,25,0 592 | 11,111,84,40,0,46.8,0.925,45,1 593 | 2,112,78,50,140,39.4,0.175,24,0 594 | 3,132,80,0,0,34.4,0.402,44,1 595 | 2,82,52,22,115,28.5,1.699,25,0 596 | 6,123,72,45,230,33.6,0.733,34,0 597 | 0,188,82,14,185,32,0.682,22,1 598 | 0,67,76,0,0,45.3,0.194,46,0 599 | 1,89,24,19,25,27.8,0.559,21,0 600 | 1,173,74,0,0,36.8,0.088,38,1 601 | 1,109,38,18,120,23.1,0.407,26,0 602 | 1,108,88,19,0,27.1,0.4,24,0 603 | 6,96,0,0,0,23.7,0.19,28,0 604 | 1,124,74,36,0,27.8,0.1,30,0 605 | 7,150,78,29,126,35.2,0.692,54,1 606 | 4,183,0,0,0,28.4,0.212,36,1 607 | 1,124,60,32,0,35.8,0.514,21,0 608 | 1,181,78,42,293,40,1.258,22,1 609 | 1,92,62,25,41,19.5,0.482,25,0 610 | 0,152,82,39,272,41.5,0.27,27,0 611 | 1,111,62,13,182,24,0.138,23,0 612 | 3,106,54,21,158,30.9,0.292,24,0 613 | 3,174,58,22,194,32.9,0.593,36,1 614 | 7,168,88,42,321,38.2,0.787,40,1 615 | 6,105,80,28,0,32.5,0.878,26,0 616 | 11,138,74,26,144,36.1,0.557,50,1 617 | 3,106,72,0,0,25.8,0.207,27,0 618 | 6,117,96,0,0,28.7,0.157,30,0 619 | 2,68,62,13,15,20.1,0.257,23,0 620 | 9,112,82,24,0,28.2,1.282,50,1 621 | 0,119,0,0,0,32.4,0.141,24,1 622 | 2,112,86,42,160,38.4,0.246,28,0 623 | 2,92,76,20,0,24.2,1.698,28,0 624 | 6,183,94,0,0,40.8,1.461,45,0 625 | 0,94,70,27,115,43.5,0.347,21,0 626 | 2,108,64,0,0,30.8,0.158,21,0 627 | 4,90,88,47,54,37.7,0.362,29,0 628 | 0,125,68,0,0,24.7,0.206,21,0 629 | 0,132,78,0,0,32.4,0.393,21,0 630 | 5,128,80,0,0,34.6,0.144,45,0 631 | 4,94,65,22,0,24.7,0.148,21,0 632 | 7,114,64,0,0,27.4,0.732,34,1 633 | 0,102,78,40,90,34.5,0.238,24,0 634 | 2,111,60,0,0,26.2,0.343,23,0 635 | 1,128,82,17,183,27.5,0.115,22,0 636 | 10,92,62,0,0,25.9,0.167,31,0 637 | 13,104,72,0,0,31.2,0.465,38,1 638 | 5,104,74,0,0,28.8,0.153,48,0 639 | 2,94,76,18,66,31.6,0.649,23,0 640 | 7,97,76,32,91,40.9,0.871,32,1 641 | 1,100,74,12,46,19.5,0.149,28,0 642 | 0,102,86,17,105,29.3,0.695,27,0 643 | 4,128,70,0,0,34.3,0.303,24,0 644 | 6,147,80,0,0,29.5,0.178,50,1 645 | 4,90,0,0,0,28,0.61,31,0 646 | 3,103,72,30,152,27.6,0.73,27,0 647 | 2,157,74,35,440,39.4,0.134,30,0 648 | 1,167,74,17,144,23.4,0.447,33,1 649 | 0,179,50,36,159,37.8,0.455,22,1 650 | 11,136,84,35,130,28.3,0.26,42,1 651 | 0,107,60,25,0,26.4,0.133,23,0 652 | 1,91,54,25,100,25.2,0.234,23,0 653 | 1,117,60,23,106,33.8,0.466,27,0 654 | 5,123,74,40,77,34.1,0.269,28,0 655 | 2,120,54,0,0,26.8,0.455,27,0 656 | 1,106,70,28,135,34.2,0.142,22,0 657 | 2,155,52,27,540,38.7,0.24,25,1 658 | 2,101,58,35,90,21.8,0.155,22,0 659 | 1,120,80,48,200,38.9,1.162,41,0 660 | 11,127,106,0,0,39,0.19,51,0 661 | 3,80,82,31,70,34.2,1.292,27,1 662 | 10,162,84,0,0,27.7,0.182,54,0 663 | 1,199,76,43,0,42.9,1.394,22,1 664 | 8,167,106,46,231,37.6,0.165,43,1 665 | 9,145,80,46,130,37.9,0.637,40,1 666 | 6,115,60,39,0,33.7,0.245,40,1 667 | 1,112,80,45,132,34.8,0.217,24,0 668 | 4,145,82,18,0,32.5,0.235,70,1 669 | 10,111,70,27,0,27.5,0.141,40,1 670 | 6,98,58,33,190,34,0.43,43,0 671 | 9,154,78,30,100,30.9,0.164,45,0 672 | 6,165,68,26,168,33.6,0.631,49,0 673 | 1,99,58,10,0,25.4,0.551,21,0 674 | 10,68,106,23,49,35.5,0.285,47,0 675 | 3,123,100,35,240,57.3,0.88,22,0 676 | 8,91,82,0,0,35.6,0.587,68,0 677 | 6,195,70,0,0,30.9,0.328,31,1 678 | 9,156,86,0,0,24.8,0.23,53,1 679 | 0,93,60,0,0,35.3,0.263,25,0 680 | 3,121,52,0,0,36,0.127,25,1 681 | 2,101,58,17,265,24.2,0.614,23,0 682 | 2,56,56,28,45,24.2,0.332,22,0 683 | 0,162,76,36,0,49.6,0.364,26,1 684 | 0,95,64,39,105,44.6,0.366,22,0 685 | 4,125,80,0,0,32.3,0.536,27,1 686 | 5,136,82,0,0,0,0.64,69,0 687 | 2,129,74,26,205,33.2,0.591,25,0 688 | 3,130,64,0,0,23.1,0.314,22,0 689 | 1,107,50,19,0,28.3,0.181,29,0 690 | 1,140,74,26,180,24.1,0.828,23,0 691 | 1,144,82,46,180,46.1,0.335,46,1 692 | 8,107,80,0,0,24.6,0.856,34,0 693 | 13,158,114,0,0,42.3,0.257,44,1 694 | 2,121,70,32,95,39.1,0.886,23,0 695 | 7,129,68,49,125,38.5,0.439,43,1 696 | 2,90,60,0,0,23.5,0.191,25,0 697 | 7,142,90,24,480,30.4,0.128,43,1 698 | 3,169,74,19,125,29.9,0.268,31,1 699 | 0,99,0,0,0,25,0.253,22,0 700 | 4,127,88,11,155,34.5,0.598,28,0 701 | 4,118,70,0,0,44.5,0.904,26,0 702 | 2,122,76,27,200,35.9,0.483,26,0 703 | 6,125,78,31,0,27.6,0.565,49,1 704 | 1,168,88,29,0,35,0.905,52,1 705 | 2,129,0,0,0,38.5,0.304,41,0 706 | 4,110,76,20,100,28.4,0.118,27,0 707 | 6,80,80,36,0,39.8,0.177,28,0 708 | 10,115,0,0,0,0,0.261,30,1 709 | 2,127,46,21,335,34.4,0.176,22,0 710 | 9,164,78,0,0,32.8,0.148,45,1 711 | 2,93,64,32,160,38,0.674,23,1 712 | 3,158,64,13,387,31.2,0.295,24,0 713 | 5,126,78,27,22,29.6,0.439,40,0 714 | 10,129,62,36,0,41.2,0.441,38,1 715 | 0,134,58,20,291,26.4,0.352,21,0 716 | 3,102,74,0,0,29.5,0.121,32,0 717 | 7,187,50,33,392,33.9,0.826,34,1 718 | 3,173,78,39,185,33.8,0.97,31,1 719 | 10,94,72,18,0,23.1,0.595,56,0 720 | 1,108,60,46,178,35.5,0.415,24,0 721 | 5,97,76,27,0,35.6,0.378,52,1 722 | 4,83,86,19,0,29.3,0.317,34,0 723 | 1,114,66,36,200,38.1,0.289,21,0 724 | 1,149,68,29,127,29.3,0.349,42,1 725 | 5,117,86,30,105,39.1,0.251,42,0 726 | 1,111,94,0,0,32.8,0.265,45,0 727 | 4,112,78,40,0,39.4,0.236,38,0 728 | 1,116,78,29,180,36.1,0.496,25,0 729 | 0,141,84,26,0,32.4,0.433,22,0 730 | 2,175,88,0,0,22.9,0.326,22,0 731 | 2,92,52,0,0,30.1,0.141,22,0 732 | 3,130,78,23,79,28.4,0.323,34,1 733 | 8,120,86,0,0,28.4,0.259,22,1 734 | 2,174,88,37,120,44.5,0.646,24,1 735 | 2,106,56,27,165,29,0.426,22,0 736 | 2,105,75,0,0,23.3,0.56,53,0 737 | 4,95,60,32,0,35.4,0.284,28,0 738 | 0,126,86,27,120,27.4,0.515,21,0 739 | 8,65,72,23,0,32,0.6,42,0 740 | 2,99,60,17,160,36.6,0.453,21,0 741 | 1,102,74,0,0,39.5,0.293,42,1 742 | 11,120,80,37,150,42.3,0.785,48,1 743 | 3,102,44,20,94,30.8,0.4,26,0 744 | 1,109,58,18,116,28.5,0.219,22,0 745 | 9,140,94,0,0,32.7,0.734,45,1 746 | 13,153,88,37,140,40.6,1.174,39,0 747 | 12,100,84,33,105,30,0.488,46,0 748 | 1,147,94,41,0,49.3,0.358,27,1 749 | 1,81,74,41,57,46.3,1.096,32,0 750 | 3,187,70,22,200,36.4,0.408,36,1 751 | 6,162,62,0,0,24.3,0.178,50,1 752 | 4,136,70,0,0,31.2,1.182,22,1 753 | 1,121,78,39,74,39,0.261,28,0 754 | 3,108,62,24,0,26,0.223,25,0 755 | 0,181,88,44,510,43.3,0.222,26,1 756 | 8,154,78,32,0,32.4,0.443,45,1 757 | 1,128,88,39,110,36.5,1.057,37,1 758 | 7,137,90,41,0,32,0.391,39,0 759 | 0,123,72,0,0,36.3,0.258,52,1 760 | 1,106,76,0,0,37.5,0.197,26,0 761 | 6,190,92,0,0,35.5,0.278,66,1 762 | 2,88,58,26,16,28.4,0.766,22,0 763 | 9,170,74,31,0,44,0.403,43,1 764 | 9,89,62,0,0,22.5,0.142,33,0 765 | 10,101,76,48,180,32.9,0.171,63,0 766 | 2,122,70,27,0,36.8,0.34,27,0 767 | 5,121,72,23,112,26.2,0.245,30,0 768 | 1,126,60,0,0,30.1,0.349,47,1 769 | 1,93,70,31,0,30.4,0.315,23,0 770 | -------------------------------------------------------------------------------- /notebooks/Applied_Machine_Learning_Ensemble_Modeling_Learners.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "python_live_session_template.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | }, 14 | "language_info": { 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "file_extension": ".py", 20 | "mimetype": "text/x-python", 21 | "name": "python", 22 | "nbconvert_exporter": "python", 23 | "pygments_lexer": "ipython3", 24 | "version": "3.7.1" 25 | } 26 | }, 27 | "cells": [ 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "colab_type": "text", 32 | "id": "6Ijg5wUCTQYG" 33 | }, 34 | "source": [ 35 | "

\n", 36 | "\"DataCamp\n", 37 | "

\n", 38 | "

\n", 39 | "\n", 40 | "\n", 41 | "## **Applied Machine Learning - Ensemble Modeling Live Training**\n", 42 | "\n", 43 | "Welcome to this hands-on training where you will immerse yourself in applied machine learning in Python where we'll explore model stacking. Using `sklearn.ensemble`, we'll learn how to create layers that are stacking-ready.\n", 44 | "\n", 45 | "The foundations of model stacking:\n", 46 | "\n", 47 | "* Create various types of baseline models, including linear and logistic regression using Scikit-Learn, for comparison to ensemble methods.\n", 48 | "* Build layers, then stack them up.\n", 49 | "* Calculate and visualize performance metrics.\n", 50 | "\n", 51 | "\n", 52 | "\n", 53 | "---\n", 54 | "\n", 55 | "\n", 56 | "\n", 57 | "## **1st Dataset**\n", 58 | "\n", 59 | "\n", 60 | "The first dataset we'll use is a CSV file named `pima-indians-diabetes.csv`, which contains data on females of Pima Indian heritage that are at least 21 years old. It contains the following columns:\n", 61 | "\n", 62 | "- `n_preg`: Number of pregnancies\n", 63 | "- `pl_glucose`: Plasma glucose concentration 2 hours after an oral glucose tolerance test\n", 64 | "- `dia_bp`: Diastolic blood pressure (mm Hg)\n", 65 | "- `tri_thick`: Triceps skin fold thickness (mm)\n", 66 | "- `serum_ins`: 2-Hour serum insulin (mu U/ml)\n", 67 | "- `bmi`: Body mass index (weight in kg/(height in m)^2)\n", 68 | "- `diab_ped`: Diabetes pedigree function\n", 69 | "- `age`: Age (years)\n", 70 | "- `class`: Class variable (0 or 1)\n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "metadata": { 76 | "colab_type": "code", 77 | "id": "EMQfyC7GUNhT", 78 | "colab": {} 79 | }, 80 | "source": [ 81 | "# Import libraries\n", 82 | "import pandas as pd\n", 83 | "import numpy as np\n", 84 | "from numpy import mean\n", 85 | "from numpy import std\n", 86 | "import matplotlib.pyplot as plt\n", 87 | "import seaborn as sns\n", 88 | "from collections import Counter\n", 89 | "from sklearn.preprocessing import LabelEncoder\n", 90 | "from sklearn.model_selection import cross_val_score\n", 91 | "from sklearn.model_selection import RepeatedStratifiedKFold\n", 92 | "from sklearn.dummy import DummyClassifier\n", 93 | "from sklearn.tree import DecisionTreeClassifier" 94 | ], 95 | "execution_count": null, 96 | "outputs": [] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "metadata": { 101 | "colab_type": "code", 102 | "id": "l8t_EwRNZPLB", 103 | "colab": {} 104 | }, 105 | "source": [ 106 | "# Read in the dataset as Pandas DataFrame\n", 107 | "diabetes = pd.read_csv('https://github.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/blob/master/data/pima-indians-diabetes.csv?raw=true')" 108 | ], 109 | "execution_count": null, 110 | "outputs": [] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "metadata": { 115 | "id": "PRJPuinPZpGA", 116 | "colab_type": "code", 117 | "colab": {} 118 | }, 119 | "source": [ 120 | "# Look at data using the info() function\n" 121 | ], 122 | "execution_count": null, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": { 128 | "id": "C6OVOkU80oKP", 129 | "colab_type": "text" 130 | }, 131 | "source": [ 132 | "## **Observations:** \n", 133 | "- The `info()` function is critical to beginning to understand your data. Here, there are no missing values. However, that is not typical.\n", 134 | "- There is a mixture of integers and floats with the first 5 columns being `int64`, the next 2 `float64` and the last 2 'int64`." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": { 140 | "id": "v3hAsYrhVi4L", 141 | "colab_type": "text" 142 | }, 143 | "source": [ 144 | "---\n", 145 | "\n", 146 | "## Q&A\n", 147 | "\n", 148 | "--- " 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "metadata": { 154 | "id": "E6UtlpG_Zo50", 155 | "colab_type": "code", 156 | "colab": {} 157 | }, 158 | "source": [ 159 | "# Look at data using the describe() function\n" 160 | ], 161 | "execution_count": null, 162 | "outputs": [] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": { 167 | "id": "bCK9W_gk1HG8", 168 | "colab_type": "text" 169 | }, 170 | "source": [ 171 | "\n", 172 | "## **Observations:** \n", 173 | "- The `.describe()` function gives the summary statistics of the data. Notice that the min of the 1st six columns is zero. Even though there are no missing values, this is indicative of the measurements for those features having not been captured.\n", 174 | "- Although we previously saw there is a mixture of integer and float data types (as seen with `.info()`), the printout makes it appear as if all values are float. " 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "metadata": { 180 | "id": "UE5F_JUQ2X-0", 181 | "colab_type": "code", 182 | "colab": {} 183 | }, 184 | "source": [ 185 | "# Print the first 5 rows of the data using the head() function\n" 186 | ], 187 | "execution_count": null, 188 | "outputs": [] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": { 193 | "id": "A2VCIx0K2bT1", 194 | "colab_type": "text" 195 | }, 196 | "source": [ 197 | "\n", 198 | "## **Observation:**\n", 199 | "- Printing out the first 5 rows, we see that the data types of the columns are indeed as stated previously." 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "id": "ajAzhMDc2b1D", 206 | "colab_type": "text" 207 | }, 208 | "source": [ 209 | "## Let's check the number in each class:\n", 210 | "\n", 211 | "This avoids getting surprised by great results that are actually a side effect of class imbalance. This happens when the majority class far outweighs the minority class." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "metadata": { 217 | "id": "MKeXN3441-9W", 218 | "colab_type": "code", 219 | "colab": {} 220 | }, 221 | "source": [ 222 | "# Summarize class distribution\n", 223 | "target = diabetes['class']\n", 224 | "counter = Counter(target)\n", 225 | "print(counter)" 226 | ], 227 | "execution_count": null, 228 | "outputs": [] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": { 233 | "id": "FOpbGyQw55v3", 234 | "colab_type": "text" 235 | }, 236 | "source": [ 237 | "## **Observation:** For every two negative cases there is one positive case, not enough of a difference to be considered class imbalance. \n", 238 | "- Class imbalance tends to exist when the majority class is > 90% although there is no hard and fast rule about this threshold." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "metadata": { 244 | "id": "n5XaYl9ZZ8B5", 245 | "colab_type": "code", 246 | "colab": {} 247 | }, 248 | "source": [ 249 | "# Convert Pandas DataFrame to numpy array - Return only the values of the DataFrame with DataFrame.to_numpy()\n" 250 | ], 251 | "execution_count": null, 252 | "outputs": [] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": { 257 | "id": "MlGa9IBc7Gsr", 258 | "colab_type": "text" 259 | }, 260 | "source": [ 261 | "### Always verify that your X matrix and target array have the same number of rows to avoid errors during model training." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "metadata": { 267 | "id": "9FEvD6Ab6InP", 268 | "colab_type": "code", 269 | "colab": {} 270 | }, 271 | "source": [ 272 | "# Create X matrix and y (target) array using slicing [row_start:row_end, col_start:target_col],[row_start:row_end, target_col]\n", 273 | "\n", 274 | "\n", 275 | "# Print X matrix and y (target) array dimensions using .shape \n" 276 | ], 277 | "execution_count": null, 278 | "outputs": [] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "metadata": { 283 | "id": "hoI7t4U-Z8LU", 284 | "colab_type": "code", 285 | "colab": {} 286 | }, 287 | "source": [ 288 | "# Convert X matrix data types to 'float32' for consistency using .astype()\n", 289 | "\n", 290 | "\n", 291 | "# Convert y (target) array to 'str' using .astype()\n", 292 | "\n", 293 | "\n", 294 | "# Encode class labels in y array using dot notation with LabelEncoder().fit_transform()\n", 295 | "# Hint: y goes in the fit_transform function call\n" 296 | ], 297 | "execution_count": null, 298 | "outputs": [] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": { 303 | "id": "djXWv2xp9v1q", 304 | "colab_type": "text" 305 | }, 306 | "source": [ 307 | "### Don't let the `.astype('str')` throw you! This is simply taking the class labels and label encoding them – regardless of their original format.\n", 308 | "\n", 309 | "\n" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": { 315 | "id": "OHHu8uz7_yVa", 316 | "colab_type": "text" 317 | }, 318 | "source": [ 319 | "## **Creating a Naive Classifier**\n", 320 | "Here we'll use the `DummyClassifier` from `sklearn`. This creates a so-called 'naive' classifer and is simply a model that predicts a single class for all of the rows, regardless of their original class. \n", 321 | "\n", 322 | "1. `DummyClassifier()` arguments:\n", 323 | " - `strategy`: Strategy to use to generate predictions.\n", 324 | "\n", 325 | "2. `RepeatedStratifiedKFold()` arguments:\n", 326 | " - `n_splits`: Number of folds.\n", 327 | " - `n_repeats`: Number of times cross-validator needs to be repeated.\n", 328 | " - `random_state`: Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. (This is an equivalent argument to np.random.seed above, but will be specific to this naive model.)\n", 329 | "\n", 330 | "3. `cross_val_score()` arguments:\n", 331 | " - The model to use.\n", 332 | " - The data to fit. (X)\n", 333 | " - The target variable to try to predict. (y)\n", 334 | " - `scoring`: A single string scorer callable object/function such as 'accuracy' or 'roc_auc'. See https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for more options.\n", 335 | " - `cv`: Cross-validation splitting strategy (default is 5)\n", 336 | " - `n_jobs`: Number of CPU cores used when parallelizing. Set to -1 helps to avoid non-convergence errors.\n", 337 | " - `error_score`: Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised." 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "metadata": { 343 | "id": "BL4huFGPZ8RA", 344 | "colab_type": "code", 345 | "colab": {} 346 | }, 347 | "source": [ 348 | "# Evaluate naive\n", 349 | "\n", 350 | "# Instantiate a DummyClassifier with 'most_frequent' strategy\n", 351 | "naive = \n", 352 | "\n", 353 | "# Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 354 | "cv = \n", 355 | "\n", 356 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator, n_jobs=-1, and error_score set to 'raise'\n", 357 | "n_scores = \n", 358 | "\n", 359 | "# Print mean and standard deviation of n_scores: \n", 360 | "print('Naive score: %.3f (%.3f)' % (mean(), std()))\n" 361 | ], 362 | "execution_count": null, 363 | "outputs": [] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": { 368 | "id": "2tEgwsOfsoB6", 369 | "colab_type": "text" 370 | }, 371 | "source": [ 372 | "## **Observation** \n", 373 | "- We want to do better than 65% accuracy to consider any other models as an improvement to a totally naive model." 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": { 379 | "id": "l8QZOyg8s1eQ", 380 | "colab_type": "text" 381 | }, 382 | "source": [ 383 | "## **Creating a Baseline Classifier**\n", 384 | "Now we'll create a baseline classifier, one that seeks to correctly predict the class that each observation belongs to. Since the target variable is binary, we'll instantiate a `DecisionTreeClassifier` model. " 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "metadata": { 390 | "id": "QczFUGSfbQvl", 391 | "colab_type": "code", 392 | "colab": {} 393 | }, 394 | "source": [ 395 | "# Evaluate baseline model\n", 396 | "\n", 397 | "# Instantiate a DecisionTreeClassifier\n", 398 | "model = \n", 399 | "\n", 400 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', and error_score set to 'raise'\n", 401 | "m_scores = \n", 402 | "\n", 403 | "# Print mean and standard deviation of m_scores: \n", 404 | "print('Baseline score: %.3f (%.3f)' % (mean(), std()))" 405 | ], 406 | "execution_count": null, 407 | "outputs": [] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "id": "GRUBiqqmtNA6", 413 | "colab_type": "text" 414 | }, 415 | "source": [ 416 | "## **Observation**\n", 417 | "- We want to do better than 70% with a Stacking Classifier to consider it an improvement over this baseline Decision Tree model." 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": { 423 | "colab_type": "text", 424 | "id": "BMYfcKeDY85K" 425 | }, 426 | "source": [ 427 | "## **Getting started with Stacking Classifier**\n", 428 | "\n", 429 | "- We're going to compare several additional baseline classifiers to see if they perform better than the Decision Tree Classifier we just trained previously.\n" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": { 435 | "id": "T2pwEXnQBEFf", 436 | "colab_type": "text" 437 | }, 438 | "source": [ 439 | "

\n", 440 | "\"Stacking\"\n", 441 | "

\n", 442 | "

\n", 443 | "\n", 444 | "- We'll start by importing additional packages that we'll need." 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "metadata": { 450 | "id": "eHCHmx7k5NeT", 451 | "colab_type": "code", 452 | "colab": {} 453 | }, 454 | "source": [ 455 | "# Import several other classifiers for ensemble\n", 456 | "from sklearn.neighbors import KNeighborsClassifier\n", 457 | "from sklearn.svm import SVC\n", 458 | "from sklearn.naive_bayes import GaussianNB\n", 459 | "from sklearn.linear_model import LogisticRegression\n", 460 | "from sklearn.ensemble import StackingClassifier" 461 | ], 462 | "execution_count": null, 463 | "outputs": [] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": { 468 | "id": "teQMB0aWxhcN", 469 | "colab_type": "text" 470 | }, 471 | "source": [ 472 | "## Create custom functions\n", 473 | "1. get_stacking() - This function will create the layers of our `StackingClassifier()`.\n", 474 | "2. get_models() - This function will create a dictionary of models to be evaluated.\n", 475 | "3. evaluate_model() - This function will evaluate each of the models to be compared." 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": { 481 | "id": "wqtHxQFPvMqu", 482 | "colab_type": "text" 483 | }, 484 | "source": [ 485 | "## Custom function # 1: get_stacking()\n", 486 | "1. `StackingClassifier()` arguments:\n", 487 | " - `estimators`: List of baseline classifiers\n", 488 | " - `final_estimator`: Defined meta classifier \n", 489 | " - `cv`: Number of cross validations to perform." 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "metadata": { 495 | "id": "YFhBv6jR6FOe", 496 | "colab_type": "code", 497 | "colab": {} 498 | }, 499 | "source": [ 500 | "# Define get_stacking():\n", 501 | "def :\n", 502 | "\n", 503 | "\t# Create an empty list for the base models called layer1\n", 504 | " \n", 505 | "\n", 506 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsClassifier, SVC, and GaussianNB base models\n", 507 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 508 | " \n", 509 | "\n", 510 | " # Instantiate Logistic Regression as meta learner model called layer2\n", 511 | " \n", 512 | "\n", 513 | "\t# Define StackingClassifier() called model passing layer1 model list and meta learner with 5 cross-validations\n", 514 | " \n", 515 | "\n", 516 | " # return model\n", 517 | " " 518 | ], 519 | "execution_count": null, 520 | "outputs": [] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": { 525 | "id": "d5szw9liyaxp", 526 | "colab_type": "text" 527 | }, 528 | "source": [ 529 | "## Custom function # 2: get_models()" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "metadata": { 535 | "id": "0hEJlDLB4kv5", 536 | "colab_type": "code", 537 | "colab": {} 538 | }, 539 | "source": [ 540 | "# Define get_models():\n", 541 | "def :\n", 542 | "\n", 543 | " # Create empty dictionary called models\n", 544 | " \n", 545 | "\n", 546 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for KNeighborsClassifier, SVC, and GaussianNB base models\n", 547 | " # Hint: models['ModelName'] = Classifier()\n", 548 | " \n", 549 | "\n", 550 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 551 | " \n", 552 | "\n", 553 | " # return dictionary\n", 554 | " " 555 | ], 556 | "execution_count": null, 557 | "outputs": [] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": { 562 | "id": "flSG4dH1zCTK", 563 | "colab_type": "text" 564 | }, 565 | "source": [ 566 | "## Custom function # 3: evaluate_model(model)" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "metadata": { 572 | "id": "mGLKRr0j5Nit", 573 | "colab_type": "code", 574 | "colab": {} 575 | }, 576 | "source": [ 577 | "# Define evaluate_model:\n", 578 | "def :\n", 579 | "\n", 580 | " # Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 42.\n", 581 | " cv = \n", 582 | "\n", 583 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 584 | " scores = \n", 585 | "\n", 586 | " # return scores\n", 587 | " " 588 | ], 589 | "execution_count": null, 590 | "outputs": [] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "metadata": { 595 | "id": "Y5wmC-TH7B7E", 596 | "colab_type": "code", 597 | "colab": {} 598 | }, 599 | "source": [ 600 | "# Assign get_models() to a variable called models\n" 601 | ], 602 | "execution_count": null, 603 | "outputs": [] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": { 608 | "id": "02tyK34l2eh7", 609 | "colab_type": "text" 610 | }, 611 | "source": [ 612 | "## Python Dictionary Review:\n", 613 | "- The items() method is used to return the list with all dictionary keys with values. Parameters: This method takes no parameters. Returns: A view object that displays a list of a given dictionary's (key, value) tuple pair.\n", 614 | "- For our purposes, we'll use the dictionary created when we call the get_models() custom function in a for loop to iterate over each key:value pair and store the results.\n", 615 | "- Then, we will plot the results as a `boxplot` for comparison using `seaborn`.\n", 616 | "\n", 617 | "1. `sns.boxplot()` arguments:\n", 618 | " - `x`: Names of the variables in the data\n", 619 | " - `y`: Names of the variables in the data\n", 620 | " - `showmeans`: Whether or not to show mark at the mean of the data." 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "metadata": { 626 | "id": "QzXmYt1o6FWh", 627 | "colab_type": "code", 628 | "colab": {} 629 | }, 630 | "source": [ 631 | "# Evaluate the models and store results\n", 632 | "# Create an empty list for the results\n", 633 | "results =\n", 634 | "\n", 635 | "# Create an empty list for the model names\n", 636 | "names = \n", 637 | "\n", 638 | "# Create a for loop that iterates over each name, model in models dictionary \n", 639 | "for :\n", 640 | "\n", 641 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 642 | "\t\n", 643 | " \n", 644 | " # Append output from scores to the results list\n", 645 | "\t\n", 646 | " \n", 647 | " # Append name to the names list\n", 648 | "\t\n", 649 | " \n", 650 | " # Print name, mean and standard deviation of scores:\n", 651 | "\tprint('>%s %.3f (%.3f)' % (, mean(), std()))\n", 652 | " \n", 653 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 654 | "sns.boxplot(x=, y=, )" 655 | ], 656 | "execution_count": null, 657 | "outputs": [] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": { 662 | "id": "xUqeWsol5RAt", 663 | "colab_type": "text" 664 | }, 665 | "source": [ 666 | "## **Observation**\n", 667 | "- Recall that we want to do better than 70% with a Stacking Classifier to consider it an improvement over the Decision Tree baseline model and, although we did achieve that, we can probably do even better with this dataset. \n", 668 | "- Let's try some hyperparameter tuning via cross-validation next..." 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": { 674 | "id": "xwc_6_Qf4amu", 675 | "colab_type": "text" 676 | }, 677 | "source": [ 678 | "---\n", 679 | "\n", 680 | "## Q&A\n", 681 | "\n", 682 | "--- \n" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "metadata": { 688 | "id": "yMZ8gTb6LGCP", 689 | "colab_type": "code", 690 | "colab": {} 691 | }, 692 | "source": [ 693 | "# Import additional libraries\n", 694 | "from xgboost import XGBClassifier \n", 695 | "from sklearn.ensemble import RandomForestClassifier\n", 696 | "from sklearn.preprocessing import StandardScaler\n", 697 | "from sklearn.pipeline import Pipeline\n", 698 | "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV\n", 699 | "import xgboost as xgb\n", 700 | "from datetime import datetime" 701 | ], 702 | "execution_count": null, 703 | "outputs": [] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": { 708 | "id": "BfctBvrs4ZcQ", 709 | "colab_type": "text" 710 | }, 711 | "source": [ 712 | "## Custom function # 4: best_model(name, model)\n", 713 | "- We're going to create a Pipeline that scales the data before applying the parameter grid via cross-validation.\n", 714 | "- Then it returns the model with the best hyperparameters from the search grid for each model." 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "metadata": { 720 | "id": "5RG7lpMY3Bzz", 721 | "colab_type": "code", 722 | "colab": {} 723 | }, 724 | "source": [ 725 | "# Define best_model:\n", 726 | "def best_model(name, model):\n", 727 | " pipe = Pipeline([('scaler', StandardScaler()), ('classifier',model)]) \n", 728 | "\n", 729 | " if name == 'SVM':\n", 730 | " param_grid = {'classifier__kernel' : ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']} \n", 731 | " # Create grid search object\n", 732 | " # this uses k-fold cv\n", 733 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 734 | "\n", 735 | " # Fit on data\n", 736 | " best_clf = clf.fit(X, y)\n", 737 | "\n", 738 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 739 | "\n", 740 | " return name, best_hyperparams \n", 741 | "\n", 742 | " if name == 'Bayes': \n", 743 | " param_grid = {'classifier__var_smoothing' : np.array([1e-09, 1e-08])} \n", 744 | " # Create grid search object\n", 745 | " # this uses k-fold cv\n", 746 | "\n", 747 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 748 | "\n", 749 | " # Fit on data\n", 750 | " best_clf = clf.fit(X, y)\n", 751 | "\n", 752 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 753 | "\n", 754 | " return name, best_hyperparams \n", 755 | "\n", 756 | " if name == 'RF': \n", 757 | " param_grid = {'classifier__criterion' : np.array(['gini', 'entropy']),\n", 758 | " 'classifier__max_depth' : np.arange(5,11)} \n", 759 | " # Create grid search object\n", 760 | " # this uses k-fold cv\n", 761 | "\n", 762 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 763 | "\n", 764 | " # Fit on data\n", 765 | " best_clf = clf.fit(X, y)\n", 766 | "\n", 767 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 768 | " \n", 769 | " return name, best_hyperparams \n", 770 | "\n", 771 | " if name == 'XGB':\n", 772 | " param_grid = {'classifier__learning_rate' : np.arange(0.022,0.04,.01),\n", 773 | " 'classifier__max_depth' : np.arange(5,10)} \n", 774 | " # Create grid search object\n", 775 | " # this uses k-fold cv\n", 776 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 777 | "\n", 778 | " # Fit on data\n", 779 | " best_clf = clf.fit(X, y)\n", 780 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 781 | "\n", 782 | " return name, best_hyperparams " 783 | ], 784 | "execution_count": null, 785 | "outputs": [] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": { 790 | "id": "8Ay2mPQo39mV", 791 | "colab_type": "text" 792 | }, 793 | "source": [ 794 | "## Adding Random Forest and XGBoost to our get_stacking() custom function in layer 1 (and removing the poorest performers DT and KNN):" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "metadata": { 800 | "id": "4ow6Aqaz27GJ", 801 | "colab_type": "code", 802 | "colab": {} 803 | }, 804 | "source": [ 805 | "# Define get_stacking(): \n", 806 | "def :\n", 807 | "\n", 808 | "\t# Create an empty list for the base models called layer1\n", 809 | " \n", 810 | "\n", 811 | " # Append tuple with classifier name and instantiations (no arguments) for SVC and GaussianNB base models AND call cust fx #4 best_model on each\n", 812 | " # Hint: layer1.append((best_model('ModelName', Classifier())))\n", 813 | " \n", 814 | "\n", 815 | " # Add RandomForestClassifier and xgb.XGBClassifier as base models\n", 816 | " \n", 817 | "\n", 818 | " # Instantiate Logistic Regression as meta learner model called layer2\n", 819 | " \n", 820 | "\n", 821 | "\t# Define StackingClassifier() called model passing layer1 model list and meta learner with 5 cross-validations\n", 822 | " model = StackingClassifier(estimators=, final_estimator=, cv=)\n", 823 | "\n", 824 | " # return model\n", 825 | " " 826 | ], 827 | "execution_count": null, 828 | "outputs": [] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": { 833 | "id": "dbp5PICC4HEk", 834 | "colab_type": "text" 835 | }, 836 | "source": [ 837 | "## Adding Random Forest and XGBoost to our get_models() custom function:" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "metadata": { 843 | "id": "GQqQUH_P3Bto", 844 | "colab_type": "code", 845 | "colab": {} 846 | }, 847 | "source": [ 848 | "# Define get_models():\n", 849 | "def :\n", 850 | "\n", 851 | " # Create empty dictionary called models\n", 852 | " \n", 853 | "\n", 854 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for SVC and GaussianNB base models\n", 855 | " # Hint: models['ModelName'] = Classifier() \n", 856 | " \n", 857 | "\n", 858 | " # we'll add two more classifers to the mix - RandomForestClassifier and xgb.XGBClassifier\n", 859 | " \n", 860 | "\n", 861 | "\n", 862 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 863 | " \n", 864 | "\n", 865 | " # return dictionary\n", 866 | " " 867 | ], 868 | "execution_count": null, 869 | "outputs": [] 870 | }, 871 | { 872 | "cell_type": "code", 873 | "metadata": { 874 | "id": "JVTYjSno3B3s", 875 | "colab_type": "code", 876 | "colab": {} 877 | }, 878 | "source": [ 879 | "# Assign get_models() to a variable called models\n", 880 | "models = get_models()" 881 | ], 882 | "execution_count": null, 883 | "outputs": [] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "metadata": { 888 | "id": "lNECWtJ74tZh", 889 | "colab_type": "text" 890 | }, 891 | "source": [ 892 | "## Custom function # 3: evaluate_model(model)" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "metadata": { 898 | "id": "TsTJZKNk3XWc", 899 | "colab_type": "code", 900 | "colab": {} 901 | }, 902 | "source": [ 903 | "# Define evaluate_model(model):\n", 904 | "def :\n", 905 | "\n", 906 | " # Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 907 | " cv = \n", 908 | "\n", 909 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 910 | " scores = \n", 911 | "\n", 912 | " # return scores\n", 913 | " " 914 | ], 915 | "execution_count": null, 916 | "outputs": [] 917 | }, 918 | { 919 | "cell_type": "markdown", 920 | "metadata": { 921 | "id": "3CxRVSe_DGlI", 922 | "colab_type": "text" 923 | }, 924 | "source": [ 925 | "# 10 minute break while the following runs..." 926 | ] 927 | }, 928 | { 929 | "cell_type": "code", 930 | "metadata": { 931 | "id": "rXrusVVBAbaJ", 932 | "colab_type": "code", 933 | "colab": {} 934 | }, 935 | "source": [ 936 | "# Evaluate the models and store results\n", 937 | "# Create an empty list for the results\n", 938 | "\n", 939 | "\n", 940 | "# Create an empty list for the model names\n", 941 | "\n", 942 | "\n", 943 | "# Create a for loop that iterates over each name, model in models dictionary \n", 944 | "for :\n", 945 | "\n", 946 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 947 | "\t\n", 948 | " \n", 949 | " # Append output from scores to the results list\n", 950 | "\t\n", 951 | " \n", 952 | " # Append name to the names list\n", 953 | "\t\n", 954 | " \n", 955 | " # Print name, mean and standard deviation of scores:\n", 956 | "\tprint('>%s %.3f (%.3f)' % (, mean(), std()))\n", 957 | "\n", 958 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 959 | "sns.boxplot(x=, y=, )" 960 | ], 961 | "execution_count": null, 962 | "outputs": [] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "metadata": { 967 | "id": "uZlAHPaD419_", 968 | "colab_type": "text" 969 | }, 970 | "source": [ 971 | "## **Observation**\n", 972 | "- Before we added XGBoost and hyperparameter tuning, our Stacking Classifier got ~ 76% accuracy. \n", 973 | "- Here, we got just around 77% accuracy, a minor improvement, but an improvement nonetheless.\n", 974 | "- We could continue fiddling with other algorithms in layer 1\n", 975 | "- We could try other algorithms in layer 2.\n", 976 | "- We could add more hyperparameters to our parameter grid.\n", 977 | "- To this last point, keep in mind that the more parameters there are in a grid to search over, the longer it takes to train the Stacking Classifier." 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "metadata": { 983 | "id": "lj8WeJR__bUo", 984 | "colab_type": "text" 985 | }, 986 | "source": [ 987 | "---\n", 988 | "\n", 989 | "## Q&A\n", 990 | "\n", 991 | "--- " 992 | ] 993 | }, 994 | { 995 | "cell_type": "markdown", 996 | "metadata": { 997 | "id": "FPY3I2BlVxig", 998 | "colab_type": "text" 999 | }, 1000 | "source": [ 1001 | "# **Stacking Regressor**" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "code", 1006 | "metadata": { 1007 | "id": "ftxDhyDq2lrH", 1008 | "colab_type": "code", 1009 | "colab": {} 1010 | }, 1011 | "source": [ 1012 | "# Import libraries\n", 1013 | "from sklearn.model_selection import RepeatedKFold\n", 1014 | "from sklearn.dummy import DummyRegressor\n", 1015 | "from sklearn.svm import SVR" 1016 | ], 1017 | "execution_count": null, 1018 | "outputs": [] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": { 1023 | "id": "nqDHD8A_nhPB", 1024 | "colab_type": "text" 1025 | }, 1026 | "source": [ 1027 | "## **2nd Dataset**\n", 1028 | "\n", 1029 | "\n", 1030 | "The second dataset we'll use is a CSV file named `abalone.csv`, which contains data on physical measurements of abalone shells used to determine the age of the abalone. It contains the following columns:\n", 1031 | "\n", 1032 | "- `Sex`: M, F, and I (infant) - (removed for our purposes)\n", 1033 | "- `Length`: Longest shell measurement (mm)\n", 1034 | "- `Diameter`: Perpendicular to length (mm)\n", 1035 | "- `Height`: with meat in shell (mm)\n", 1036 | "- `Whole weight`: whole abalone (grams)\n", 1037 | "- `Shucked weight`: weight of meat (grams)\n", 1038 | "- `Viscera weight`: gut weight (grams)\n", 1039 | "- `Shell weight`: after being dried (grams)\n", 1040 | "- `Rings`: +1.5 gives the age in years\n", 1041 | "\n", 1042 | "\t" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "metadata": { 1048 | "id": "HwNnn3ZKrh1o", 1049 | "colab_type": "text" 1050 | }, 1051 | "source": [ 1052 | "### **Get the dataset**" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "metadata": { 1058 | "id": "K4LeaM4PzyAh", 1059 | "colab_type": "code", 1060 | "colab": {} 1061 | }, 1062 | "source": [ 1063 | "# Read in the dataset as Pandas DataFrame\n", 1064 | "abalone = pd.read_csv('https://github.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/blob/master/data/abalone.csv?raw=true')" 1065 | ], 1066 | "execution_count": null, 1067 | "outputs": [] 1068 | }, 1069 | { 1070 | "cell_type": "code", 1071 | "metadata": { 1072 | "id": "KfsmhIBdApVp", 1073 | "colab_type": "code", 1074 | "colab": {} 1075 | }, 1076 | "source": [ 1077 | "# Look at data using the info() function\n" 1078 | ], 1079 | "execution_count": null, 1080 | "outputs": [] 1081 | }, 1082 | { 1083 | "cell_type": "markdown", 1084 | "metadata": { 1085 | "id": "NZAeIFGwBhe6", 1086 | "colab_type": "text" 1087 | }, 1088 | "source": [ 1089 | "## **Observations:** \n", 1090 | "- Here, there are no missing values. Again, that is not typical.\n", 1091 | "- There is a mixture of object, float, and integers with the first column being `object` (categorical), the next 7 `float64` and the last 'int64`." 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "code", 1096 | "metadata": { 1097 | "id": "8D4Gfh08Avb2", 1098 | "colab_type": "code", 1099 | "colab": {} 1100 | }, 1101 | "source": [ 1102 | "# Look at data using the describe() function\n" 1103 | ], 1104 | "execution_count": null, 1105 | "outputs": [] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": { 1110 | "id": "WDGc7PPBBkGX", 1111 | "colab_type": "text" 1112 | }, 1113 | "source": [ 1114 | "## **Observations:** \n", 1115 | "- Notice that the min of the `Height` column is zero. Even though there are no missing values, this is indicative of the measurements for that feature having not been captured.\n", 1116 | "- Again, the printout makes it appear as if all numeric values are float. \n", 1117 | "\n" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "metadata": { 1123 | "id": "FVGtuWoDAvl2", 1124 | "colab_type": "code", 1125 | "colab": {} 1126 | }, 1127 | "source": [ 1128 | "# Print the first 5 rows of the data using the head() function\n" 1129 | ], 1130 | "execution_count": null, 1131 | "outputs": [] 1132 | }, 1133 | { 1134 | "cell_type": "markdown", 1135 | "metadata": { 1136 | "id": "wnmVoSl8BmMY", 1137 | "colab_type": "text" 1138 | }, 1139 | "source": [ 1140 | "## **Observation:**\n", 1141 | "- Printing out the first 5 rows, we see that the 1st column is the only non-numeric feature in this dataset and is aligned with the `object` datatype as we saw above when we called `.info()`." 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "metadata": { 1147 | "id": "xPfVhWzRrm_w", 1148 | "colab_type": "code", 1149 | "colab": {} 1150 | }, 1151 | "source": [ 1152 | "# Convert Pandas DataFrame to numpy array - Return only the values of the DataFrame with DataFrame.to_numpy()\n", 1153 | "abalone = \n", 1154 | "\n", 1155 | "# Create X matrix and y (target) array using slicing [row_start:row_end, 1:target_col],[row_start:row_end, target_col] - Removing 1st column by starting at index 1\n", 1156 | "X, y = \n", 1157 | "\n", 1158 | "# Print X matrix and y (target) array dimensions using .shape\n", 1159 | "print('Shape: %s, %s' % ())" 1160 | ], 1161 | "execution_count": null, 1162 | "outputs": [] 1163 | }, 1164 | { 1165 | "cell_type": "code", 1166 | "metadata": { 1167 | "id": "fZ6CHfsVrpE7", 1168 | "colab_type": "code", 1169 | "colab": {} 1170 | }, 1171 | "source": [ 1172 | "# Convert y (target) array to 'float32' using .astype()\n", 1173 | "y = " 1174 | ], 1175 | "execution_count": null, 1176 | "outputs": [] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": { 1181 | "id": "7bYvtBfSF7k7", 1182 | "colab_type": "text" 1183 | }, 1184 | "source": [ 1185 | "## **Creating a Naive Regressor**\n", 1186 | "Here we'll use the `DummyRegressor` from `sklearn`. This creates a so-called 'naive' regressor and is simply a model that predicts a single value for all of the rows, regardless of their original value. \n", 1187 | "\n", 1188 | "1. `DummyRegressor()` arguments:\n", 1189 | " - `strategy`: Strategy to use to generate predictions.\n", 1190 | "\n", 1191 | "2. `RepeatedKFold()` arguments:\n", 1192 | " - `n_splits`: Number of folds.\n", 1193 | " - `n_repeats`: Number of times cross-validator needs to be repeated.\n", 1194 | " - `random_state`: Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. (This is an equivalent argument to np.random.seed above, but will be specific to this naive model.)\n", 1195 | "\n", 1196 | "3. `cross_val_score()` arguments:\n", 1197 | " - The model to use.\n", 1198 | " - The data to fit. (X)\n", 1199 | " - The target variable to try to predict. (y)\n", 1200 | " - `scoring`: A single string scorer callable object/function such as 'accuracy' or 'roc_auc'. See https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for more options.\n", 1201 | " - `cv`: Cross-validation splitting strategy (default is 5)\n", 1202 | " - `n_jobs`: Number of CPU cores used when parallelizing. Set to -1 helps to avoid non-convergence errors.\n", 1203 | " - `error_score`: Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised." 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "code", 1208 | "metadata": { 1209 | "id": "jAJdcu_Hrrg8", 1210 | "colab_type": "code", 1211 | "colab": {} 1212 | }, 1213 | "source": [ 1214 | "# Evaluate naive\n", 1215 | "\n", 1216 | "# Instantiate a DummyRegressor with 'median' strategy\n", 1217 | "naive = D\n", 1218 | "\n", 1219 | "# Create RepeatedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 1220 | "cv = \n", 1221 | "\n", 1222 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator, n_jobs=-1, and error_score set to 'raise'\n", 1223 | "n_scores = \n", 1224 | "\n", 1225 | "# Print mean and standard deviation of n_scores:\n", 1226 | "print('Baseline: %.3f (%.3f)' % (mean(), std()))" 1227 | ], 1228 | "execution_count": null, 1229 | "outputs": [] 1230 | }, 1231 | { 1232 | "cell_type": "markdown", 1233 | "metadata": { 1234 | "id": "dlYQmsCQHcdJ", 1235 | "colab_type": "text" 1236 | }, 1237 | "source": [ 1238 | "## **Observation** \n", 1239 | "- We want to do better than -2.37 to consider any other models as an improvement to a totally naive regressor model with the Abalone dataset." 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "markdown", 1244 | "metadata": { 1245 | "id": "ZfiEdoUMHo-q", 1246 | "colab_type": "text" 1247 | }, 1248 | "source": [ 1249 | "## **Creating a Baseline Regressor**\n", 1250 | "Now we'll create a baseline regressor, one that seeks to correctly predict the value for each observation. Since the target variable is continuous, we'll instantiate a Support Vector Regression model.\n", 1251 | "\n", 1252 | "1. `SVR()` arguments:\n", 1253 | " - `kernel`: Specifies the kernel type to be used in the algorithm.\n", 1254 | " - `gamma`: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. \n", 1255 | " - `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty." 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "code", 1260 | "metadata": { 1261 | "id": "cFip40FPrvOn", 1262 | "colab_type": "code", 1263 | "colab": {} 1264 | }, 1265 | "source": [ 1266 | "# Evaluate baseline model\n", 1267 | "\n", 1268 | "# Instantiate a Support Vector Regressor with 'rbf' kernel, gamma set to 'scale', and regularization parameter set to 10\n", 1269 | "model = \n", 1270 | "\n", 1271 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 1272 | "m_scores = \n", 1273 | "\n", 1274 | "# Print mean and standard deviation of m_scores: \n", 1275 | "print('Good: %.3f (%.3f)' % (mean(), std()))" 1276 | ], 1277 | "execution_count": null, 1278 | "outputs": [] 1279 | }, 1280 | { 1281 | "cell_type": "markdown", 1282 | "metadata": { 1283 | "id": "Z_PMtVARKzBX", 1284 | "colab_type": "text" 1285 | }, 1286 | "source": [ 1287 | "## **Observation**\n", 1288 | "- We want to do better than -1.48 with a Stacking Regressor to consider it an improvement over this baseline support vector regression model with the Abalone dataset." 1289 | ] 1290 | }, 1291 | { 1292 | "cell_type": "markdown", 1293 | "metadata": { 1294 | "id": "J-OGF_7bupzn", 1295 | "colab_type": "text" 1296 | }, 1297 | "source": [ 1298 | "## **Getting started with Stacking Regressor**\n", 1299 | "- We're going to compare several additional baseline regressors to see if they perform better than SVR we just trained previously.\n", 1300 | "- We'll start by importing additional packages that we'll need." 1301 | ] 1302 | }, 1303 | { 1304 | "cell_type": "code", 1305 | "metadata": { 1306 | "id": "jxbxTPkPrkNb", 1307 | "colab_type": "code", 1308 | "colab": {} 1309 | }, 1310 | "source": [ 1311 | "# Compare machine learning models for regression\n", 1312 | "from sklearn.linear_model import LinearRegression\n", 1313 | "from sklearn.neighbors import KNeighborsRegressor\n", 1314 | "from sklearn.tree import DecisionTreeRegressor\n", 1315 | "from sklearn.ensemble import StackingRegressor" 1316 | ], 1317 | "execution_count": null, 1318 | "outputs": [] 1319 | }, 1320 | { 1321 | "cell_type": "markdown", 1322 | "metadata": { 1323 | "id": "yixxr2JLN9UP", 1324 | "colab_type": "text" 1325 | }, 1326 | "source": [ 1327 | "## Create custom functions\n", 1328 | "1. get_stacking() - This function will create the layers of our `StackingRegressor()`.\n", 1329 | "2. get_models() - This function will create a dictionary of models to be evaluated.\n", 1330 | "3. evaluate_model() - This function will evaluate each of the models to be compared." 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "markdown", 1335 | "metadata": { 1336 | "id": "FdF239ZRN92B", 1337 | "colab_type": "text" 1338 | }, 1339 | "source": [ 1340 | "## Custom function # 1: get_stacking()\n", 1341 | "1. `StackingRegressor()` arguments:\n", 1342 | " - `estimators`: List of baseline regressors\n", 1343 | " - `final_estimator`: Defined meta regressor \n", 1344 | " - `cv`: Number of cross validations to perform." 1345 | ] 1346 | }, 1347 | { 1348 | "cell_type": "code", 1349 | "metadata": { 1350 | "id": "qoRNxZSj72bZ", 1351 | "colab_type": "code", 1352 | "colab": {} 1353 | }, 1354 | "source": [ 1355 | "# Define get_stacking():\n", 1356 | "def :\n", 1357 | "\n", 1358 | "\t# Create an empty list for the base models called layer1\n", 1359 | " \n", 1360 | "\n", 1361 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1362 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 1363 | " \n", 1364 | "\n", 1365 | " # Instantiate Linear Regression as meta learner model called layer2\n", 1366 | " \n", 1367 | "\n", 1368 | "\t# Define Stackingregressor() called model passing layer1 model list and meta learner with 5 cross-validations\n", 1369 | " \n", 1370 | "\n", 1371 | " # return model\n", 1372 | " " 1373 | ], 1374 | "execution_count": null, 1375 | "outputs": [] 1376 | }, 1377 | { 1378 | "cell_type": "markdown", 1379 | "metadata": { 1380 | "id": "KClsJExROLAZ", 1381 | "colab_type": "text" 1382 | }, 1383 | "source": [ 1384 | "## Custom function # 2: get_models()" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "metadata": { 1390 | "id": "PtYbhE_ps4yo", 1391 | "colab_type": "code", 1392 | "colab": {} 1393 | }, 1394 | "source": [ 1395 | "# Define get_models():\n", 1396 | "def :\n", 1397 | "\n", 1398 | " # Create empty dictionary called models\n", 1399 | " \n", 1400 | "\n", 1401 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1402 | " # Hint: models['ModelName'] = Classifier()\n", 1403 | " \n", 1404 | "\n", 1405 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 1406 | " \n", 1407 | "\n", 1408 | " # return dictionary\n", 1409 | " " 1410 | ], 1411 | "execution_count": null, 1412 | "outputs": [] 1413 | }, 1414 | { 1415 | "cell_type": "markdown", 1416 | "metadata": { 1417 | "id": "SYH3KcjcOc56", 1418 | "colab_type": "text" 1419 | }, 1420 | "source": [ 1421 | "## Custom function # 3: evaluate_model(model)" 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "code", 1426 | "metadata": { 1427 | "id": "H95M82gks6EL", 1428 | "colab_type": "code", 1429 | "colab": {} 1430 | }, 1431 | "source": [ 1432 | "# Define evaluate_model:\n", 1433 | "def :\n", 1434 | "\n", 1435 | " # Create RepeatedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 1436 | "\tcv = \n", 1437 | " \n", 1438 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 1439 | "\tscores = \n", 1440 | " \n", 1441 | " # return scores\n", 1442 | "\t" 1443 | ], 1444 | "execution_count": null, 1445 | "outputs": [] 1446 | }, 1447 | { 1448 | "cell_type": "code", 1449 | "metadata": { 1450 | "id": "2C6Hw-wj56eK", 1451 | "colab_type": "code", 1452 | "colab": {} 1453 | }, 1454 | "source": [ 1455 | "# Assign get_models() to a variable called models\n" 1456 | ], 1457 | "execution_count": null, 1458 | "outputs": [] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "metadata": { 1463 | "id": "BZl3DjmU58Lm", 1464 | "colab_type": "code", 1465 | "colab": {} 1466 | }, 1467 | "source": [ 1468 | "# Evaluate the models and store results\n", 1469 | "# Create an empty list for the results\n", 1470 | "\n", 1471 | "\n", 1472 | "# Create an empty list for the model names\n", 1473 | "\n", 1474 | "\n", 1475 | "# Create a for loop that iterates over each name, model in models dictionary \n", 1476 | "for :\n", 1477 | "\n", 1478 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 1479 | "\t\n", 1480 | " \n", 1481 | " # Append output from scores to the results list\n", 1482 | "\t\n", 1483 | " \n", 1484 | " # Append name to the names list\n", 1485 | "\t\n", 1486 | " \n", 1487 | " # Print name, mean and standard deviation of scores:\n", 1488 | "\tprint('>%s %.3f (%.3f)' % (, (), ()))\n", 1489 | " \n", 1490 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 1491 | "sns.boxplot(x=, y=, )" 1492 | ], 1493 | "execution_count": null, 1494 | "outputs": [] 1495 | }, 1496 | { 1497 | "cell_type": "markdown", 1498 | "metadata": { 1499 | "id": "d6EKNBV1UOuG", 1500 | "colab_type": "text" 1501 | }, 1502 | "source": [ 1503 | "## **Observation**\n", 1504 | "- Recall that we want to do better than -1.48 with a Stacking Regressor to consider it an improvement over this baseline SVR and, although close, we did not achieve that with this dataset.\n", 1505 | "- So what else can try to improve our results with stacking?\n", 1506 | "\n", 1507 | "### We'll add another layer to the mix..." 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "markdown", 1512 | "metadata": { 1513 | "id": "N9DZ7iyZFxXo", 1514 | "colab_type": "text" 1515 | }, 1516 | "source": [ 1517 | "## **Double Stacking - 2 Layers**\n", 1518 | "- Can get a little tricky\n", 1519 | "- Just make sure that you name your layers VERY CLEARLY!\n", 1520 | "- Both the last layer (here it's layer 3) and the stacking model will use a call to `StackingRegressor()`\n", 1521 | "- The last layer will combine the 2nd layer with the final estimator while the model will combine the 1st layer with this last layer.\n", 1522 | "\n", 1523 | "

\n", 1524 | "\"Double\n", 1525 | "

\n", 1526 | "

" 1527 | ] 1528 | }, 1529 | { 1530 | "cell_type": "code", 1531 | "metadata": { 1532 | "id": "fXvUmmQQF6vq", 1533 | "colab_type": "code", 1534 | "colab": {} 1535 | }, 1536 | "source": [ 1537 | "# Define get_stacking() - adding another layer:\n", 1538 | "def :\n", 1539 | "\n", 1540 | "\t# Create an empty list for the 1st layer of base models called layer1\n", 1541 | " \n", 1542 | "\n", 1543 | " # Create an empty list for the 2nd layer of base models called layer2\n", 1544 | " \n", 1545 | "\n", 1546 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1547 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 1548 | " \n", 1549 | "\n", 1550 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1551 | " # Hint: layer2.append(('ModelName', Classifier()))\n", 1552 | " \n", 1553 | "\n", 1554 | "\t# Define meta learner StackingRegressor() called layer3 passing layer2 model list to estimators, LinearRegression() to final_estimator with 5 cross-validations\n", 1555 | " layer3 = \n", 1556 | "\n", 1557 | "\t# Define StackingRegressor() called model passing layer1 model list to estimators and meta learner (layer3) to final_estimator with 5 cross-validations\n", 1558 | " model = \n", 1559 | "\n", 1560 | " # return model\n", 1561 | " " 1562 | ], 1563 | "execution_count": null, 1564 | "outputs": [] 1565 | }, 1566 | { 1567 | "cell_type": "code", 1568 | "metadata": { 1569 | "id": "CnMMqOJ16Bft", 1570 | "colab_type": "code", 1571 | "colab": {} 1572 | }, 1573 | "source": [ 1574 | "# Assign get_models() to a variable called models\n" 1575 | ], 1576 | "execution_count": null, 1577 | "outputs": [] 1578 | }, 1579 | { 1580 | "cell_type": "code", 1581 | "metadata": { 1582 | "id": "kvzSjLOEIKUx", 1583 | "colab_type": "code", 1584 | "colab": {} 1585 | }, 1586 | "source": [ 1587 | "# Evaluate the models and store results\n", 1588 | "# Create an empty list for the results\n", 1589 | "\n", 1590 | "\n", 1591 | "# Create an empty list for the model names\n", 1592 | "\n", 1593 | "\n", 1594 | "# Create a for loop that iterates over each name, model in models dictionary \n", 1595 | "for ):\n", 1596 | "\n", 1597 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 1598 | "\t\n", 1599 | " \n", 1600 | " # Append output from scores to the results list\n", 1601 | "\t\n", 1602 | " \n", 1603 | " # Append name to the names list\n", 1604 | "\t\n", 1605 | " \n", 1606 | " # Print name, mean and standard deviation of scores:\n", 1607 | "\tprint('>%s %.3f (%.3f)' % (, (), ()))\n", 1608 | " \n", 1609 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 1610 | "sns.( , , )" 1611 | ], 1612 | "execution_count": null, 1613 | "outputs": [] 1614 | }, 1615 | { 1616 | "cell_type": "markdown", 1617 | "metadata": { 1618 | "id": "ZMgN44SwcJPG", 1619 | "colab_type": "text" 1620 | }, 1621 | "source": [ 1622 | "## **Final Observation**\n", 1623 | "- Adding a layer did not improve results.\n", 1624 | "- Complexity does not always make a better model\n", 1625 | "- Could try different base models to stack for both of the datasets and that may show improvements over baseline.\n", 1626 | "- Generate polynomial features \n", 1627 | "- Try sklearn feature selection\n", 1628 | "- Try feature engineering - creating new features from existing ones (but remember to remove the original features to avoid multicollinearity)\n", 1629 | "- Tune hyperparameters for grid search as previously with Stacking Classifier\n", 1630 | "- When there is a tie between a baseline model and a stacked model, choose the simpler model!" 1631 | ] 1632 | }, 1633 | { 1634 | "cell_type": "markdown", 1635 | "metadata": { 1636 | "id": "Z4iX02EkDujS", 1637 | "colab_type": "text" 1638 | }, 1639 | "source": [ 1640 | "---\n", 1641 | "\n", 1642 | "# Q&A\n", 1643 | "\n", 1644 | "---" 1645 | ] 1646 | }, 1647 | { 1648 | "cell_type": "markdown", 1649 | "metadata": { 1650 | "id": "kNWB_J4QD0Ad", 1651 | "colab_type": "text" 1652 | }, 1653 | "source": [ 1654 | "# Back to the slides for wrap-up..." 1655 | ] 1656 | } 1657 | ] 1658 | } -------------------------------------------------------------------------------- /notebooks/Applied_Machine_Learning_Ensemble_Modeling_Solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "python_live_session_template.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | }, 14 | "language_info": { 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "file_extension": ".py", 20 | "mimetype": "text/x-python", 21 | "name": "python", 22 | "nbconvert_exporter": "python", 23 | "pygments_lexer": "ipython3", 24 | "version": "3.7.1" 25 | } 26 | }, 27 | "cells": [ 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "colab_type": "text", 32 | "id": "6Ijg5wUCTQYG" 33 | }, 34 | "source": [ 35 | "

\n", 36 | "\"DataCamp\n", 37 | "

\n", 38 | "

\n", 39 | "\n", 40 | "\n", 41 | "## **Applied Machine Learning - Ensemble Modeling Live Training**\n", 42 | "\n", 43 | "Welcome to this hands-on training where you will immerse yourself in applied machine learning in Python where we'll explore model stacking. Using `sklearn.ensemble`, we'll learn how to create layers that are stacking-ready.\n", 44 | "\n", 45 | "The foundations of model stacking:\n", 46 | "\n", 47 | "* Create various types of baseline models, including linear and logistic regression using Scikit-Learn, for comparison to ensemble methods.\n", 48 | "* Build layers, then stack them up.\n", 49 | "* Calculate and visualize performance metrics.\n", 50 | "\n", 51 | "\n", 52 | "\n", 53 | "---\n", 54 | "\n", 55 | "\n", 56 | "\n", 57 | "## **1st Dataset**\n", 58 | "\n", 59 | "\n", 60 | "The first dataset we'll use is a CSV file named `pima-indians-diabetes.csv`, which contains data on females of Pima Indian heritage that are at least 21 years old. It contains the following columns:\n", 61 | "\n", 62 | "- `n_preg`: Number of pregnancies\n", 63 | "- `pl_glucose`: Plasma glucose concentration 2 hours after an oral glucose tolerance test\n", 64 | "- `dia_bp`: Diastolic blood pressure (mm Hg)\n", 65 | "- `tri_thick`: Triceps skin fold thickness (mm)\n", 66 | "- `serum_ins`: 2-Hour serum insulin (mu U/ml)\n", 67 | "- `bmi`: Body mass index (weight in kg/(height in m)^2)\n", 68 | "- `diab_ped`: Diabetes pedigree function\n", 69 | "- `age`: Age (years)\n", 70 | "- `class`: Class variable (0 or 1)\n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "metadata": { 76 | "colab_type": "code", 77 | "id": "EMQfyC7GUNhT", 78 | "colab": { 79 | "base_uri": "https://localhost:8080/", 80 | "height": 51 81 | }, 82 | "outputId": "d5eb31f1-293e-40bd-9dd4-8d2a693deb66" 83 | }, 84 | "source": [ 85 | "# Import libraries\n", 86 | "import pandas as pd\n", 87 | "import numpy as np\n", 88 | "from numpy import mean\n", 89 | "from numpy import std\n", 90 | "import matplotlib.pyplot as plt\n", 91 | "import seaborn as sns\n", 92 | "from collections import Counter\n", 93 | "from sklearn.preprocessing import LabelEncoder\n", 94 | "from sklearn.model_selection import cross_val_score\n", 95 | "from sklearn.model_selection import RepeatedStratifiedKFold\n", 96 | "from sklearn.dummy import DummyClassifier\n", 97 | "from sklearn.tree import DecisionTreeClassifier" 98 | ], 99 | "execution_count": null, 100 | "outputs": [ 101 | { 102 | "output_type": "stream", 103 | "text": [ 104 | "/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n", 105 | " import pandas.util.testing as tm\n" 106 | ], 107 | "name": "stderr" 108 | } 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "metadata": { 114 | "colab_type": "code", 115 | "id": "l8t_EwRNZPLB", 116 | "colab": {} 117 | }, 118 | "source": [ 119 | "# Read in the dataset as Pandas DataFrame\n", 120 | "diabetes = pd.read_csv('https://github.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/blob/master/data/pima-indians-diabetes.csv?raw=true')" 121 | ], 122 | "execution_count": null, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "metadata": { 128 | "id": "PRJPuinPZpGA", 129 | "colab_type": "code", 130 | "colab": { 131 | "base_uri": "https://localhost:8080/", 132 | "height": 289 133 | }, 134 | "outputId": "fe86f395-72c2-438a-c48c-4d16ec7a93af" 135 | }, 136 | "source": [ 137 | "# Look at data using the info() function\n", 138 | "diabetes.info()" 139 | ], 140 | "execution_count": null, 141 | "outputs": [ 142 | { 143 | "output_type": "stream", 144 | "text": [ 145 | "\n", 146 | "RangeIndex: 768 entries, 0 to 767\n", 147 | "Data columns (total 9 columns):\n", 148 | " # Column Non-Null Count Dtype \n", 149 | "--- ------ -------------- ----- \n", 150 | " 0 n_preg 768 non-null int64 \n", 151 | " 1 pl_glucose 768 non-null int64 \n", 152 | " 2 dia_bp 768 non-null int64 \n", 153 | " 3 tri_thick 768 non-null int64 \n", 154 | " 4 serum_ins 768 non-null int64 \n", 155 | " 5 bmi 768 non-null float64\n", 156 | " 6 diab_ped 768 non-null float64\n", 157 | " 7 age 768 non-null int64 \n", 158 | " 8 class 768 non-null int64 \n", 159 | "dtypes: float64(2), int64(7)\n", 160 | "memory usage: 54.1 KB\n" 161 | ], 162 | "name": "stdout" 163 | } 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": { 169 | "id": "C6OVOkU80oKP", 170 | "colab_type": "text" 171 | }, 172 | "source": [ 173 | "## **Observations:** \n", 174 | "- The `info()` function is critical to beginning to understand your data. Here, there are no missing values. However, that is not typical.\n", 175 | "- There is a mixture of integers and floats with the first 5 columns being `int64`, the next 2 `float64` and the last 2 'int64`." 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "id": "v3hAsYrhVi4L", 182 | "colab_type": "text" 183 | }, 184 | "source": [ 185 | "---\n", 186 | "\n", 187 | "## Q&A\n", 188 | "\n", 189 | "--- " 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "metadata": { 195 | "id": "E6UtlpG_Zo50", 196 | "colab_type": "code", 197 | "colab": { 198 | "base_uri": "https://localhost:8080/", 199 | "height": 297 200 | }, 201 | "outputId": "dadacd3e-88ab-4768-c3b4-a501f3b4c4aa" 202 | }, 203 | "source": [ 204 | "# Look at data using the describe() function\n", 205 | "diabetes.describe()" 206 | ], 207 | "execution_count": null, 208 | "outputs": [ 209 | { 210 | "output_type": "execute_result", 211 | "data": { 212 | "text/html": [ 213 | "
\n", 214 | "\n", 227 | "\n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | "
n_pregpl_glucosedia_bptri_thickserum_insbmidiab_pedageclass
count768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000
mean3.845052120.89453169.10546920.53645879.79947931.9925780.47187633.2408850.348958
std3.36957831.97261819.35580715.952218115.2440027.8841600.33132911.7602320.476951
min0.0000000.0000000.0000000.0000000.0000000.0000000.07800021.0000000.000000
25%1.00000099.00000062.0000000.0000000.00000027.3000000.24375024.0000000.000000
50%3.000000117.00000072.00000023.00000030.50000032.0000000.37250029.0000000.000000
75%6.000000140.25000080.00000032.000000127.25000036.6000000.62625041.0000001.000000
max17.000000199.000000122.00000099.000000846.00000067.1000002.42000081.0000001.000000
\n", 341 | "
" 342 | ], 343 | "text/plain": [ 344 | " n_preg pl_glucose dia_bp ... diab_ped age class\n", 345 | "count 768.000000 768.000000 768.000000 ... 768.000000 768.000000 768.000000\n", 346 | "mean 3.845052 120.894531 69.105469 ... 0.471876 33.240885 0.348958\n", 347 | "std 3.369578 31.972618 19.355807 ... 0.331329 11.760232 0.476951\n", 348 | "min 0.000000 0.000000 0.000000 ... 0.078000 21.000000 0.000000\n", 349 | "25% 1.000000 99.000000 62.000000 ... 0.243750 24.000000 0.000000\n", 350 | "50% 3.000000 117.000000 72.000000 ... 0.372500 29.000000 0.000000\n", 351 | "75% 6.000000 140.250000 80.000000 ... 0.626250 41.000000 1.000000\n", 352 | "max 17.000000 199.000000 122.000000 ... 2.420000 81.000000 1.000000\n", 353 | "\n", 354 | "[8 rows x 9 columns]" 355 | ] 356 | }, 357 | "metadata": { 358 | "tags": [] 359 | }, 360 | "execution_count": 4 361 | } 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": { 367 | "id": "bCK9W_gk1HG8", 368 | "colab_type": "text" 369 | }, 370 | "source": [ 371 | "\n", 372 | "## **Observations:** \n", 373 | "- The `.describe()` function gives the summary statistics of the data. Notice that the min of the 1st six columns is zero. Even though there are no missing values, this is indicative of the measurements for those features having not been captured.\n", 374 | "- Although we previously saw there is a mixture of integer and float data types (as seen with `.info()`), the printout makes it appear as if all values are float. " 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "metadata": { 380 | "id": "UE5F_JUQ2X-0", 381 | "colab_type": "code", 382 | "colab": { 383 | "base_uri": "https://localhost:8080/", 384 | "height": 204 385 | }, 386 | "outputId": "7cdeee97-80fc-4553-a3ec-c755bf8f19d2" 387 | }, 388 | "source": [ 389 | "# Print the first 5 rows of the data using the head() function\n", 390 | "diabetes.head()" 391 | ], 392 | "execution_count": null, 393 | "outputs": [ 394 | { 395 | "output_type": "execute_result", 396 | "data": { 397 | "text/html": [ 398 | "
\n", 399 | "\n", 412 | "\n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | "
n_pregpl_glucosedia_bptri_thickserum_insbmidiab_pedageclass
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", 490 | "
" 491 | ], 492 | "text/plain": [ 493 | " n_preg pl_glucose dia_bp tri_thick serum_ins bmi diab_ped age class\n", 494 | "0 6 148 72 35 0 33.6 0.627 50 1\n", 495 | "1 1 85 66 29 0 26.6 0.351 31 0\n", 496 | "2 8 183 64 0 0 23.3 0.672 32 1\n", 497 | "3 1 89 66 23 94 28.1 0.167 21 0\n", 498 | "4 0 137 40 35 168 43.1 2.288 33 1" 499 | ] 500 | }, 501 | "metadata": { 502 | "tags": [] 503 | }, 504 | "execution_count": 5 505 | } 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": { 511 | "id": "A2VCIx0K2bT1", 512 | "colab_type": "text" 513 | }, 514 | "source": [ 515 | "\n", 516 | "## **Observation:**\n", 517 | "- Printing out the first 5 rows, we see that the data types of the columns are indeed as stated previously." 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": { 523 | "id": "ajAzhMDc2b1D", 524 | "colab_type": "text" 525 | }, 526 | "source": [ 527 | "## Let's check the number in each class:\n", 528 | "\n", 529 | "This avoids getting surprised by great results that are actually a side effect of class imbalance. This happens when the majority class far outweighs the minority class." 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "metadata": { 535 | "id": "MKeXN3441-9W", 536 | "colab_type": "code", 537 | "colab": { 538 | "base_uri": "https://localhost:8080/", 539 | "height": 34 540 | }, 541 | "outputId": "a698fc39-4ac4-41bb-e77a-5cd6172f01a4" 542 | }, 543 | "source": [ 544 | "# Summarize class distribution\n", 545 | "target = diabetes['class']\n", 546 | "counter = Counter(target)\n", 547 | "print(counter)" 548 | ], 549 | "execution_count": null, 550 | "outputs": [ 551 | { 552 | "output_type": "stream", 553 | "text": [ 554 | "Counter({0: 500, 1: 268})\n" 555 | ], 556 | "name": "stdout" 557 | } 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "metadata": { 563 | "id": "FOpbGyQw55v3", 564 | "colab_type": "text" 565 | }, 566 | "source": [ 567 | "## **Observation:** For every two negative cases there is one positive case, not enough of a difference to be considered class imbalance. \n", 568 | "- Class imbalance tends to exist when the majority class is > 90% although there is no hard and fast rule about this threshold." 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "metadata": { 574 | "id": "n5XaYl9ZZ8B5", 575 | "colab_type": "code", 576 | "colab": {} 577 | }, 578 | "source": [ 579 | "# Convert Pandas DataFrame to numpy array - Return only the values of the DataFrame with DataFrame.to_numpy()\n", 580 | "diabetes = diabetes.to_numpy()" 581 | ], 582 | "execution_count": null, 583 | "outputs": [] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": { 588 | "id": "MlGa9IBc7Gsr", 589 | "colab_type": "text" 590 | }, 591 | "source": [ 592 | "### Always verify that your X matrix and target array have the same number of rows to avoid errors during model training." 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "metadata": { 598 | "id": "9FEvD6Ab6InP", 599 | "colab_type": "code", 600 | "colab": { 601 | "base_uri": "https://localhost:8080/", 602 | "height": 34 603 | }, 604 | "outputId": "5e5bd9ad-05ff-40c7-8c38-3f0c1877d1ee" 605 | }, 606 | "source": [ 607 | "# Create X matrix and y (target) array using slicing [row_start:row_end, col_start:target_col],[row_start:row_end, target_col]\n", 608 | "X, y = diabetes[:, :-1], diabetes[:, -1]\n", 609 | "\n", 610 | "# Print X matrix and y (target) array dimensions using .shape \n", 611 | "print('Shape: %s, %s' % (X.shape, y.shape))" 612 | ], 613 | "execution_count": null, 614 | "outputs": [ 615 | { 616 | "output_type": "stream", 617 | "text": [ 618 | "Shape: (768, 8), (768,)\n" 619 | ], 620 | "name": "stdout" 621 | } 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "metadata": { 627 | "id": "hoI7t4U-Z8LU", 628 | "colab_type": "code", 629 | "colab": {} 630 | }, 631 | "source": [ 632 | "# Convert X matrix data types to 'float32' for consistency using .astype()\n", 633 | "X = X.astype('float32')\n", 634 | "\n", 635 | "# Convert y (target) array to 'str' using .astype()\n", 636 | "y = y.astype('str')\n", 637 | "\n", 638 | "# Encode class labels in y array using dot notation with LabelEncoder().fit_transform()\n", 639 | "# Hint: y goes in the fit_transform function call\n", 640 | "y = LabelEncoder().fit_transform(y)" 641 | ], 642 | "execution_count": null, 643 | "outputs": [] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": { 648 | "id": "djXWv2xp9v1q", 649 | "colab_type": "text" 650 | }, 651 | "source": [ 652 | "### Don't let the `.astype('str')` throw you! This is simply taking the class labels and label encoding them – regardless of their original format.\n", 653 | "\n", 654 | "\n" 655 | ] 656 | }, 657 | { 658 | "cell_type": "markdown", 659 | "metadata": { 660 | "id": "OHHu8uz7_yVa", 661 | "colab_type": "text" 662 | }, 663 | "source": [ 664 | "## **Creating a Naive Classifier**\n", 665 | "Here we'll use the `DummyClassifier` from `sklearn`. This creates a so-called 'naive' classifer and is simply a model that predicts a single class for all of the rows, regardless of their original class. \n", 666 | "\n", 667 | "1. `DummyClassifier()` arguments:\n", 668 | " - `strategy`: Strategy to use to generate predictions.\n", 669 | "\n", 670 | "2. `RepeatedStratifiedKFold()` arguments:\n", 671 | " - `n_splits`: Number of folds.\n", 672 | " - `n_repeats`: Number of times cross-validator needs to be repeated.\n", 673 | " - `random_state`: Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. (This is an equivalent argument to np.random.seed above, but will be specific to this naive model.)\n", 674 | "\n", 675 | "3. `cross_val_score()` arguments:\n", 676 | " - The model to use.\n", 677 | " - The data to fit. (X)\n", 678 | " - The target variable to try to predict. (y)\n", 679 | " - `scoring`: A single string scorer callable object/function such as 'accuracy' or 'roc_auc'. See https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for more options.\n", 680 | " - `cv`: Cross-validation splitting strategy (default is 5)\n", 681 | " - `n_jobs`: Number of CPU cores used when parallelizing. Set to -1 helps to avoid non-convergence errors.\n", 682 | " - `error_score`: Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised." 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "metadata": { 688 | "id": "BL4huFGPZ8RA", 689 | "colab_type": "code", 690 | "colab": { 691 | "base_uri": "https://localhost:8080/", 692 | "height": 34 693 | }, 694 | "outputId": "9353fb35-5d21-4b76-f3e4-794798345c7f" 695 | }, 696 | "source": [ 697 | "# Evaluate naive\n", 698 | "\n", 699 | "# Instantiate a DummyClassifier with 'most_frequent' strategy\n", 700 | "naive = DummyClassifier(strategy='most_frequent')\n", 701 | "\n", 702 | "# Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 703 | "cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\n", 704 | "\n", 705 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator, n_jobs=-1, and error_score set to 'raise'\n", 706 | "n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\n", 707 | "\n", 708 | "# Print mean and standard deviation of n_scores: \n", 709 | "print('Naive score: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))\n" 710 | ], 711 | "execution_count": null, 712 | "outputs": [ 713 | { 714 | "output_type": "stream", 715 | "text": [ 716 | "Naive score: 0.651 (0.003)\n" 717 | ], 718 | "name": "stdout" 719 | } 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "metadata": { 725 | "id": "2tEgwsOfsoB6", 726 | "colab_type": "text" 727 | }, 728 | "source": [ 729 | "## **Observation** \n", 730 | "- We want to do better than 65% accuracy to consider any other models as an improvement to a totally naive model." 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "metadata": { 736 | "id": "l8QZOyg8s1eQ", 737 | "colab_type": "text" 738 | }, 739 | "source": [ 740 | "## **Creating a Baseline Classifier**\n", 741 | "Now we'll create a baseline classifier, one that seeks to correctly predict the class that each observation belongs to. Since the target variable is binary, we'll instantiate a `DecisionTreeClassifier` model. " 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "metadata": { 747 | "id": "QczFUGSfbQvl", 748 | "colab_type": "code", 749 | "colab": { 750 | "base_uri": "https://localhost:8080/", 751 | "height": 34 752 | }, 753 | "outputId": "3b362a0d-dd04-4f5d-ddad-0bee09b549db" 754 | }, 755 | "source": [ 756 | "# Evaluate baseline model\n", 757 | "\n", 758 | "# Instantiate a DecisionTreeClassifier\n", 759 | "model = DecisionTreeClassifier()\n", 760 | "\n", 761 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', and error_score set to 'raise'\n", 762 | "m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\n", 763 | "\n", 764 | "# Print mean and standard deviation of m_scores: \n", 765 | "print('Baseline score: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))" 766 | ], 767 | "execution_count": null, 768 | "outputs": [ 769 | { 770 | "output_type": "stream", 771 | "text": [ 772 | "Baseline score: 0.697 (0.062)\n" 773 | ], 774 | "name": "stdout" 775 | } 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": { 781 | "id": "GRUBiqqmtNA6", 782 | "colab_type": "text" 783 | }, 784 | "source": [ 785 | "## **Observation**\n", 786 | "- We want to do better than 70% with a Stacking Classifier to consider it an improvement over this baseline Decision Tree model." 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "colab_type": "text", 793 | "id": "BMYfcKeDY85K" 794 | }, 795 | "source": [ 796 | "## **Getting started with Stacking Classifier**\n", 797 | "\n", 798 | "- We're going to compare several additional baseline classifiers to see if they perform better than the Decision Tree Classifier we just trained previously.\n" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": { 804 | "id": "T2pwEXnQBEFf", 805 | "colab_type": "text" 806 | }, 807 | "source": [ 808 | "

\n", 809 | "\"Stacking\"\n", 810 | "

\n", 811 | "

\n", 812 | "\n", 813 | "- We'll start by importing additional packages that we'll need." 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "metadata": { 819 | "id": "eHCHmx7k5NeT", 820 | "colab_type": "code", 821 | "colab": {} 822 | }, 823 | "source": [ 824 | "# Import several other classifiers for ensemble\n", 825 | "from sklearn.neighbors import KNeighborsClassifier\n", 826 | "from sklearn.svm import SVC\n", 827 | "from sklearn.naive_bayes import GaussianNB\n", 828 | "from sklearn.linear_model import LogisticRegression\n", 829 | "from sklearn.ensemble import StackingClassifier" 830 | ], 831 | "execution_count": null, 832 | "outputs": [] 833 | }, 834 | { 835 | "cell_type": "markdown", 836 | "metadata": { 837 | "id": "teQMB0aWxhcN", 838 | "colab_type": "text" 839 | }, 840 | "source": [ 841 | "## Create custom functions\n", 842 | "1. get_stacking() - This function will create the layers of our `StackingClassifier()`.\n", 843 | "2. get_models() - This function will create a dictionary of models to be evaluated.\n", 844 | "3. evaluate_model() - This function will evaluate each of the models to be compared." 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": { 850 | "id": "wqtHxQFPvMqu", 851 | "colab_type": "text" 852 | }, 853 | "source": [ 854 | "## Custom function # 1: get_stacking()\n", 855 | "1. `StackingClassifier()` arguments:\n", 856 | " - `estimators`: List of baseline classifiers\n", 857 | " - `final_estimator`: Defined meta classifier \n", 858 | " - `cv`: Number of cross validations to perform." 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "metadata": { 864 | "id": "YFhBv6jR6FOe", 865 | "colab_type": "code", 866 | "colab": {} 867 | }, 868 | "source": [ 869 | "# Define get_stacking():\n", 870 | "def get_stacking():\n", 871 | "\n", 872 | "\t# Create an empty list for the base models called layer1\n", 873 | " layer1 = list()\n", 874 | "\n", 875 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsClassifier, SVC, and GaussianNB base models\n", 876 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 877 | " layer1.append(('DT', DecisionTreeClassifier()))\n", 878 | " layer1.append(('KNN', KNeighborsClassifier()))\n", 879 | " layer1.append(('SVM', SVC()))\n", 880 | " layer1.append(('Bayes', GaussianNB()))\n", 881 | "\n", 882 | " # Instantiate Logistic Regression as meta learner model called layer2\n", 883 | " layer2 = LogisticRegression()\n", 884 | "\n", 885 | "\t# Define StackingClassifier() called model passing layer1 model list and meta learner with 5 cross-validations\n", 886 | " model = StackingClassifier(estimators=layer1, final_estimator=layer2, cv=5)\n", 887 | "\n", 888 | " # return model\n", 889 | " return model" 890 | ], 891 | "execution_count": null, 892 | "outputs": [] 893 | }, 894 | { 895 | "cell_type": "markdown", 896 | "metadata": { 897 | "id": "d5szw9liyaxp", 898 | "colab_type": "text" 899 | }, 900 | "source": [ 901 | "## Custom function # 2: get_models()" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "metadata": { 907 | "id": "0hEJlDLB4kv5", 908 | "colab_type": "code", 909 | "colab": {} 910 | }, 911 | "source": [ 912 | "# Define get_models():\n", 913 | "def get_models():\n", 914 | "\n", 915 | " # Create empty dictionary called models\n", 916 | " models = dict()\n", 917 | "\n", 918 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for KNeighborsClassifier, SVC, and GaussianNB base models\n", 919 | " # Hint: models['ModelName'] = Classifier()\n", 920 | " models['DT'] = DecisionTreeClassifier() \n", 921 | " models['KNN'] = KNeighborsClassifier() \n", 922 | " models['SVM'] = SVC()\n", 923 | " models['Bayes'] = GaussianNB()\n", 924 | "\n", 925 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 926 | " models['Stacking'] = get_stacking()\n", 927 | "\n", 928 | " # return dictionary\n", 929 | " return models" 930 | ], 931 | "execution_count": null, 932 | "outputs": [] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": { 937 | "id": "flSG4dH1zCTK", 938 | "colab_type": "text" 939 | }, 940 | "source": [ 941 | "## Custom function # 3: evaluate_model(model)" 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "metadata": { 947 | "id": "mGLKRr0j5Nit", 948 | "colab_type": "code", 949 | "colab": {} 950 | }, 951 | "source": [ 952 | "# Define evaluate_model:\n", 953 | "def evaluate_model(model):\n", 954 | "\n", 955 | " # Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 42.\n", 956 | " cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)\n", 957 | "\n", 958 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 959 | " scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\n", 960 | "\n", 961 | " # return scores\n", 962 | " return scores" 963 | ], 964 | "execution_count": null, 965 | "outputs": [] 966 | }, 967 | { 968 | "cell_type": "code", 969 | "metadata": { 970 | "id": "Y5wmC-TH7B7E", 971 | "colab_type": "code", 972 | "colab": {} 973 | }, 974 | "source": [ 975 | "# Assign get_models() to a variable called models\n", 976 | "models = get_models()" 977 | ], 978 | "execution_count": null, 979 | "outputs": [] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": { 984 | "id": "02tyK34l2eh7", 985 | "colab_type": "text" 986 | }, 987 | "source": [ 988 | "## Python Dictionary Review:\n", 989 | "- The items() method is used to return the list with all dictionary keys with values. Parameters: This method takes no parameters. Returns: A view object that displays a list of a given dictionary's (key, value) tuple pair.\n", 990 | "- For our purposes, we'll use the dictionary created when we call the get_models() custom function in a for loop to iterate over each key:value pair and store the results.\n", 991 | "- Then, we will plot the results as a `boxplot` for comparison using `seaborn`.\n", 992 | "\n", 993 | "1. `sns.boxplot()` arguments:\n", 994 | " - `x`: Names of the variables in the data\n", 995 | " - `y`: Names of the variables in the data\n", 996 | " - `showmeans`: Whether or not to show mark at the mean of the data." 997 | ] 998 | }, 999 | { 1000 | "cell_type": "code", 1001 | "metadata": { 1002 | "id": "QzXmYt1o6FWh", 1003 | "colab_type": "code", 1004 | "colab": { 1005 | "base_uri": "https://localhost:8080/", 1006 | "height": 367 1007 | }, 1008 | "outputId": "b98ee82e-bf8f-4a33-f187-f2feaef5dc2e" 1009 | }, 1010 | "source": [ 1011 | "# Evaluate the models and store results\n", 1012 | "# Create an empty list for the results\n", 1013 | "results = list()\n", 1014 | "\n", 1015 | "# Create an empty list for the model names\n", 1016 | "names = list()\n", 1017 | "\n", 1018 | "# Create a for loop that iterates over each name, model in models dictionary \n", 1019 | "for name, model in models.items():\n", 1020 | "\n", 1021 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 1022 | "\tscores = evaluate_model(model)\n", 1023 | " \n", 1024 | " # Append output from scores to the results list\n", 1025 | "\tresults.append(scores)\n", 1026 | " \n", 1027 | " # Append name to the names list\n", 1028 | "\tnames.append(name)\n", 1029 | " \n", 1030 | " # Print name, mean and standard deviation of scores:\n", 1031 | "\tprint('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\n", 1032 | " \n", 1033 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 1034 | "sns.boxplot(x=names, y=results, showmeans=True)" 1035 | ], 1036 | "execution_count": null, 1037 | "outputs": [ 1038 | { 1039 | "output_type": "stream", 1040 | "text": [ 1041 | ">DT 0.707 (0.049)\n", 1042 | ">KNN 0.713 (0.058)\n", 1043 | ">SVM 0.759 (0.045)\n", 1044 | ">Bayes 0.760 (0.049)\n", 1045 | ">Stacking 0.763 (0.050)\n" 1046 | ], 1047 | "name": "stdout" 1048 | }, 1049 | { 1050 | "output_type": "execute_result", 1051 | "data": { 1052 | "text/plain": [ 1053 | "" 1054 | ] 1055 | }, 1056 | "metadata": { 1057 | "tags": [] 1058 | }, 1059 | "execution_count": 36 1060 | }, 1061 | { 1062 | "output_type": "display_data", 1063 | "data": { 1064 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAZ60lEQVR4nO3df5TddX3n8edrJoRMiBAgE6m5hARnIuLSBhxxLWsrq4GAdrFdV4LtaTjLabZrIQrqKWxdofF3e1ztUKpGzRJ1IdK6eFI3FFBh3aOhzQRSICOQS/h1A8rkFxAmJPPjvX98v+Nchvlx78z9Mfc7r8c5OXPv9/v93O/7e2fmlfd8f11FBGZmll1N9S7AzMyqy0FvZpZxDnozs4xz0JuZZZyD3sws42bVu4CRFixYEEuWLKl3GWZmDWX79u17I6J1tHnTLuiXLFlCV1dXvcswM2sokp4aa5533ZiZZZyD3sws4xz0ZmYZ56A3M8s4B72Na+/evVx11VXs27ev3qWY2SQ56G1cGzdu5MEHH2Tjxo31LsXMJslBb2Pau3cvd9xxBxHBHXfc4a7erEFNu/PobfrYuHEjQ7exHhwcZOPGjVxzzTV1rsoqqbOzk3w+X/a4QqEAQC6XK2tcW1sba9euLXt9NjXu6G1Md999N319fQD09fVx11131bkimy4OHz7M4cOH612GlcgdvY1pxYoVbNmyhb6+Po455hguuOCCepdkFTbZ7npoXGdnZyXLsSpxR29jWr16NZIAaGpqYvXq1XWuyMwmw0FvY1qwYAEXXXQRkrjooos4+eST612SmU2Cd93YuFavXs2TTz7pbt6sgTnobVwLFizgxhtvrHcZZjYF3nVjZpZxDnozs4xz0JuZZZz30duM46tBbTRZ/rlw0JuVyFeC2mga4efCQW8zjq8GtdFk+eeipH30klZKelRSXtK1o8xfLOkeSQ9IelDSxen0JZIOS9qR/vtapTfAzMzGN2FHL6kZuAlYARSAbZI2R0R30WKfBG6LiK9KOhPYAixJ5z0eEcsrW7aZmZWqlI7+XCAfEbsj4iiwCbhkxDIBHJ8+PgF4tnIlmpnZVJQS9IuAZ4qeF9JpxW4A/khSgaSbv6po3tJ0l87/lfTO0VYgaY2kLkldPT09pVdvZmYTqtR59JcBN0dEDrgY+I6kJuA5YHFEnA1cA9wi6fiRgyNifUR0RERHa2trhUoyMzMoLej3AKcWPc+l04pdAdwGEBFbgTnAgog4EhH70unbgceBZVMt2szMSldK0G8D2iUtlTQbWAVsHrHM08C7ASS9mSToeyS1pgdzkXQ60A7srlTxZmY2sQnPuomIfklXAncCzcCGiNgpaR3QFRGbgY8B35B0NcmB2csjIiT9DrBOUh8wCPxpROyv2taYmdlrlHTBVERsITnIWjztU0WPu4HzRhn3feD7U6zRzMymwDc1MzPLOAe9mVnGOejNzDLOQW9mlnEOejOzjHPQm5llnIPezCzjHPRmZhnnoDczyzgHvZlZxjnozcwyzkFvZpZxDnozs4xz0JuZZZyD3sws4xz0ZmYZ56A3M8s4B72ZWcY56M3MMs5Bb2aWcQ56M7OMc9CbmWWcg97MLOMc9GZmGeegNzPLuFn1LsBqo7Ozk3w+X/a4QqEAQC6XK2tcW1sba9euLXt9ZlZ5Dnob1+HDh+tdgplNUUlBL2kl8DdAM/DNiPjCiPmLgY3A/HSZayNiSzrvOuAKYABYGxF3Vq58K9Vku+uhcZ2dnZUsx8xqaMKgl9QM3ASsAArANkmbI6K7aLFPArdFxFclnQlsAZakj1cBbwHeAPxI0rKIGKj0hpiZDZnsrsrJ2LVrFzD5Zqpck9ktWkpHfy6Qj4jdAJI2AZcAxUEfwPHp4xOAZ9PHlwCbIuII8ISkfPp6W8uq0sysDPl8np0P/YL5cxdWfV2DRwXAnsf3VX1dB3ufn9S4UoJ+EfBM0fMC8PYRy9wA3CXpKuA44D1FY+8bMXbRyBVIWgOsAVi8eHEpdZuZjWv+3IWcf8aqepdRUfc8smlS4yp1euVlwM0RkQMuBr4jqeTXjoj1EdERER2tra0VKsnMzKC0jn4PcGrR81w6rdgVwEqAiNgqaQ6woMSxZmZWRaV03duAdklLJc0mObi6ecQyTwPvBpD0ZmAO0JMut0rSsZKWAu3Av1SqeDMzm9iEHX1E9Eu6EriT5NTJDRGxU9I6oCsiNgMfA74h6WqSA7OXR0QAOyXdRnLgth/4M59xY2ZWWyWdR5+eE79lxLRPFT3uBs4bY+xngc9OocYpmcxpVr4atHH4NLphfi9sLL4ydhS+GrRx5PN5Htj5QHKpXrUNJl8e2PNA9dd1sPwh+XyeR3bs4JTKV/MaQ/t8D+7YUfV1/bLqa8i+zAf9ZLoAXw3aYObD4LsG611FRTXdO7kT4k4BrkCVLabOvkXUu4SG57tXmpllnIPezCzjHPRmZhnnoDczyzgHvZlZxjnozcwyzkFvZpZxDnozs4xz0JuZZZyD3sws4xz0ZmYZ56A3M8s4B72ZWcY56M3MMs5Bb2aWcQ56M7OMc9CbmU3By7NeZPPp6+md9VK9SxmTg97MbAruX3gPzx33FNsX/qTepYzJQW9WokEN8vLclxlUtj62cDJeaBnkKxe/yIstM/u9eHnWizx60nZQ8OhJ26dtV++gNyvRkdlHGGge4MjsI/Uupe7uOPswj5/Szx3LD9e7lLq6f+E9RPqZtkFM267eQW9WgkEN0je7DwR9s/tmdFf/Qssg/9x+hBDct+zIjO3qh7r5waYBAAabBqZtVz+r3gWYNYKRXfyR2UdoOdJSp2pGVygUeAn4VtphVsvTZ/fSnz7uB760vJfFW4+r2vqeAw4VClV7/ckq7uaHDHX173z2kjpVNTp39GYTKO7mgRnd1fe1DLK//SiRtogxC/YvO0rfDOzqf3Xc07/u5ocMNg3wq+OerlNFY3NHbzaBsfbJT7euPpfLcXDvXq749f9Ilbfp7FdoAorjrQk4cfkrXFqlrv5bBPNzuaq89lR8YNdV9S6hZO7ozSYwMGuA12Sn0ukzzJML+xkY0R4OzIInXt8/+gCbFkrq6CWtBP4GaAa+GRFfGDH/y8D56dO5wMKImJ/OGwAeSuc9HRH/oRKFm9XKvJfn1buEaePaH5xQ7xJKUigUeKH3Je55ZFO9S6mog73PE4Xyz3SaMOglNQM3ASuAArBN0uaI6B5aJiKuLlr+KuDsopc4HBHLy67MzMwqopSO/lwgHxG7ASRtAi4BusdY/jLg+sqUZ2ZWvlwuh47s4/wzVtW7lIq655FNLMqdXPa4UvbRLwKeKXpeSKe9hqTTgKVA8VUDcyR1SbpP0vvHGLcmXaarp6enxNLNzKwUlT4Yuwr4h4goPkp1WkR0AB8CviLpjSMHRcT6iOiIiI7W1tYKl2RmNrOVEvR7gFOLnufSaaNZBdxaPCEi9qRfdwP38ur992ZmVmWl7KPfBrRLWkoS8KtIuvNXkXQGcCKwtWjaiUBvRByRtAA4D/irShRuBsnZFbwATfdm7Ezhg1CI6Xc1qDWmCYM+IvolXQncSXJ65YaI2ClpHdAVEZvTRVcBmyKi+JrgNwNflzRI8tfDF4rP1jEzs+or6Tz6iNgCbBkx7VMjnt8wyrifA2dNoT6zceVyOXrUw+C7snUJftO9TeQWTb+rQa0xNcwtEDo7O8nn8zVZ165duwBYu3ZtTdbX1tZW1rr8XphZORom6PP5PA881M3g3JOqvi4dTfY+bX/8l1VfV1Pv/rLH5PN5Hnv4fhbPq/4l+LP7kn3frzy5rerrevpQc9XXYTYTNUzQAwzOPYlXznxfvcuoqDndP5zUuMXzBvhkx6EKV1Nfn+nyrQbMqiFjpyqYmdlIDnozs4xz0JuZZZyD3sws4xz0ZmYZ56A3M8s4B72ZWcY56M3MMs5Bb2aWcQ11ZazZqA7W6DbFQxci1+IC3oOM8TluZuVz0FtDa2trq9m6hm7w1r6ovforW1TbbbNsa5igLxQKNPW+MOl7w0xXTb37KBT6yxpTKBR4+aXmzN0b5qmXmjmuUN6HbdTyTpdD6+rs7KzZOs0qwfvozcwyrmE6+lwux6+OzMrk3StzuVPKGpPL5Xil/7lM3r1yTs4ftmFWaQ0T9GY2sV8C3yImXG6q9qVfT676mpJtmj+JcQd7n+eeRzZVupzXOPTKAQDmzTmx6us62Ps8iybxrjvobVz7m+GLC+DPe+CkbH1aX+bU8uBtT3pgen579Q9Mz6f8bavtQfrkw4MWvbH6/+0t4uRJbZuD3sZ16/Gw81jYdAJ8+EC9q7Hx+MD0ML8Xr+aDsTam/c3wo3kQgrvnwX7/tJg1JP/q2phuPR4GlTweVNLVm1njcdDbqIa6+f406Pvd1Zs1LP/ajiJmHaJv6a3ErGydvliO4m5+iLt6s8bkoB/FQOtWYm6Bgdat9S6lbh45dribH9Iv+MWx9anHzCbPZ92MELMOMXjiwyAYPPFhoucdqD9btxooxY2/rHcFZlYp7uhHSLr4oQtOYkZ39WaWDSUFvaSVkh6VlJd07SjzvyxpR/rvMUkHi+atlrQr/be6ksVX2q+7+aaBZELTQNLVz+B99WbW+CbcdSOpGbgJWAEUgG2SNkdE99AyEXF10fJXAWenj08Crgc6SNrk7enYaXnpzau7+SFJVz/ruRX1KMnMbMpK6ejPBfIRsTsijgKbgEvGWf4y4Nb08YXA3RGxPw33u4GVUym4mgbn7hnu5oc0DSTTzcwaVCkHYxcBzxQ9LwBvH21BSacBS4GfjDN22n5uzuzHL693CWZmFVfps25WAf8QEQMTLllE0hpgDcDixYsrXJKZ2cQ6OzvJ5/Nljxv65LFy76/T1tZWs3vylLLrZg9watHzXDptNKsY3m1T8tiIWB8RHRHR0draWkJJZmbTQ0tLCy0tLfUuY1yldPTbgHZJS0lCehXwoZELSToDOBEoPh/xTuBzkoZu1HwBcN2UKjYzq4Ja3vGy1iYM+ojol3QlSWg3AxsiYqekdUBXRGxOF10FbIqIKBq7X9KnSf6zAFgXEfsruwlmZjaekvbRR8QWYMuIaZ8a8fyGMcZuADZMsj4zM5siXxlrZpZxDnozs4zzTc3MZrAsn1Jowxz0Zla26X46ob2ag95sBnN3PTN4H72ZWcY56M3MMs5Bb2aWcQ56M7OMc9CbmWWcg97MLOMc9GZmGeegNzPLOAe9mVnGOejNzDLOQW9mlnEOejOzjHPQm5llnO9e2aCePtTMZ7rmVX09v+pNeoHXzx2s+rqePtTMsqqvxWzmcdA3oLa2tpqt62j6ARNzlrRXfV3LqO22mc0UDvoGVMt7iA+tq7Ozs2brNLPKaqigb+rdz5zuH1Z9PXrlRQBizvFVX1dT737glKqvx4b54/NspmmYoK/ln/S7dr0EQPsbaxHAp3h3RYPwx+dZo2qYoPfuCqsUd9c20/j0SjOzjHPQm5llnIPezCzjHPRmZhlXUtBLWinpUUl5SdeOscwHJXVL2inplqLpA5J2pP82V6pwMzMrzYRn3UhqBm4CVgAFYJukzRHRXbRMO3AdcF5EHJC0sOglDkfE8grXbWZmJSqloz8XyEfE7og4CmwCLhmxzJ8AN0XEAYCIeL6yZZqZ2WSVEvSLgGeKnhfSacWWAcsk/UzSfZJWFs2bI6krnf7+0VYgaU26TFdPT09ZG2BmZuOr1AVTs4B24F1ADvippLMi4iBwWkTskXQ68BNJD0XE48WDI2I9sB6go6MjKlSTmZlRWke/Bzi16HkunVasAGyOiL6IeAJ4jCT4iYg96dfdwL3A2VOs2czMylBK0G8D2iUtlTQbWAWMPHvmByTdPJIWkOzK2S3pREnHFk0/D+jGzMxqZsJdNxHRL+lK4E6gGdgQETslrQO6ImJzOu8CSd3AAPCJiNgn6beBr0saJPlP5QvFZ+uYmVn1lbSPPiK2AFtGTPtU0eMArkn/FS/zc+CsqZdpZmaT5StjzcwyzkFvZpZxDnozs4xz0JuZZZyD3sws4xz0ZmYZ56A3M8s4B72ZWcY56M3MMs5Bb2aWcQ56M7OMc9CbmWWcg97MLOMc9GZmGeegNzPLOAe9mVnGOejNzDLOQW9mlnEOejOzjHPQm5llnIPezCzjHPRmZhk3q94FVFtnZyf5fL6sMbt27QJg7dq1ZY1ra2sre4yZWbVlPugno6Wlpd4lmJlVTOaD3h22mc103kdvZpZxDnozs4wrKeglrZT0qKS8pGvHWOaDkrol7ZR0S9H01ZJ2pf9WV6pwMzMrzYT76CU1AzcBK4ACsE3S5ojoLlqmHbgOOC8iDkhamE4/Cbge6AAC2J6OPVD5TTEzs9GU0tGfC+QjYndEHAU2AZeMWOZPgJuGAjwink+nXwjcHRH703l3AysrU7qZmZWilKBfBDxT9LyQTiu2DFgm6WeS7pO0soyxSFojqUtSV09PT+nVm5nZhCp1euUsoB14F5ADfirprFIHR8R6YD1AR0dHVKgmKzKZC8fAF4+ZZUEpHf0e4NSi57l0WrECsDki+iLiCeAxkuAvZaxNYy0tLb6AzKzBKWL8BlrSLJLgfjdJSG8DPhQRO4uWWQlcFhGrJS0AHgCWkx6ABc5JF70feGtE7B9rfR0dHdHV1TX5LTIzm4EkbY+IjtHmTbjrJiL6JV0J3Ak0AxsiYqekdUBXRGxO510gqRsYAD4REfvSlX+a5D8HgHXjhbyZmVXehB19rbmjNzMr33gdva+MNTPLOAe9mVnGOejNzDLOQW9mlnEOejOzjHPQm5ll3LQ7vVJSD/BUvesAFgB7613ENOH3Ypjfi2F+L4ZNh/fitIhoHW3GtAv66UJS11jnpM40fi+G+b0Y5vdi2HR/L7zrxsws4xz0ZmYZ56Af2/p6FzCN+L0Y5vdimN+LYdP6vfA+ejOzjHNHb2aWcQ56M7OMm/FBL2lA0g5JOyX9q6SPSWqSdGE6fYekQ5IeTR9/u941V4qkQ0WPL5b0mKTTJN0gqVfSwjGWDUlfKnr+cUk31KzwKpH0F+nPwYPp9/p6SZ8fscxySb9IHz8p6f+NmL9D0sO1rHuqin4H/lXS/ZJ+u941Vdso3+u3S/qopLmTfL3LJf3tKNP/VNIfT73iqanUZ8Y2ssMRsRwgDbZbgOMj4nqSD1RB0r3AxyMikzfKl/RuoBO4MCKekgTJxR8fA/58lCFHgD+Q9PmIqPdFIhUh6R3A+4BzIuJI+klpZwI3A9cVLboKuLXo+esknRoRz0h6c80Krqzi34ELgc8Dv1vfkqpnjO/1bOB7wHeB3kqtKyK+VqnXmooZ39EXi4jngTXAlUrTLusk/Q7wDeB9EfF40awNwKWSThplWD/JWQZX16DEWvkNYG9EHAGIiL0R8VPggKS3Fy33QV4d9LcBl6aPLxsxrxEdDxwAkDRP0o/TLv8hSZek09dJ+ujQAEmflfSR9PEnJG1LO+W/TKcdJ+n/pH8xPCzp0lHWW0uv+V4DHwDeANwj6R4ASV+V1JV2/n85NFjS2yT9PN2ef5H0uuIXl/ReSVslLUj/Ov54Ov1eSV9Mxzwm6Z3p9LmSbpPULel2Sf8sqaIXXznoR4iI3SQfmbhwomUz4FjgB8D7I+KREfMOkYT9R8YYexPwh5JOqGJ9tXQXcGr6C/h3koY62ltJungk/Vtgf0TsKhr3feAP0se/B/xjrQquoJZ098UjwDeBT6fTXwF+PyLOAc4HvpQ2QBuAPwaQ1ETy/nxX0gVAO3AuyWdGvzVtJFYCz0bEb0XEvwH+qYbbNprXfK8johN4Fjg/Is5Pl/uL9GrX3wR+V9JvShrq/D8SEb8FvAc4PPTCkn4fuBa4eIy/dmdFxLnAR4Hr02kfBg5ExJnAfwfeWukNdtDPbH3Az4ErxpjfCawe2bEARMSLwLeBtdUrr3Yi4hDJL9gaoAf4nqTLSX6pP1AUaCM79n0kXf8q4BdU8M/+GjocEcsj4gySUP52GugCPifpQeBHwCLg9RHxJLBP0tnABcAD6WdEXzD0HLgfOIMk+B8CVqTd7Dsj4oUab9+rjPO9HumDku4n2Z63kOzKexPwXERsS1/rxYjoT5f/9yS7Ot8bEQfGWP3/Tr9uB5akj/8dsCl9vYeBBye9cWPwPvoRJJ1O8gHnz9e7lhoYJNkV8WNJ/y0iPlc8MyIOSroF+LMxxn+F5Bf6f1a3zNqIiAHgXuBeSQ8BqyPiZklPkOyz/o/AO0YZ+j2Sv3Aur1GpVRMRW9N91q3AxenXt0ZEn6QngTnpot8k2d5TSDp8SP5j+HxEfH3k60o6J329z0j6cUSsq+qGTGC073XxfElLgY8Db4uIA5JuZnjbx/I4cDqwDBjreN6R9OsANcxfd/RFJLUCXwP+NmbIlWQR0Qu8l2Q3zGid/f8A/guj/FBGxH6SfdRj/UXQMCS9SVJ70aTlDN9F9Vbgy8DuiCiMMvx24K9ID943MklnkOy63AecADyfhvz5wGlFi95O0v2/jeHtvhP4z5Lmpa+1SNJCSW8AeiPiu8BfA+fUZmtGN873+iVg6K/X44GXgRckvR64KJ3+KPAbkt6WvtbrJA39bjxF0gx8W9JbyijpZyQNF5LOBM4qf6vG544+3T8JHENykPE7JOE2Y0TEfkkrgZ8quU108by9km5n7AOvXwKurHaNNTAPuFHSfJKfgzzJn/YAf0+yG+uq0QZGxEvAFwEa9Bj+0O8AJF356ogYkPS/gH9MO94u4NfHcSLiaHrQ8mDaHRMRd6VnHm1N34dDwB8BbcBfSxok2V34X2u1YWMY63t9GfBPkp6NiPMlPUCyzc+QhPHQdl+ajm8h2T//nqEXjohHJP0h8PeSfq/Eev4O2CipO13fTqCiu7d8CwQzK1t6zOJ+4D+NODhtZZLUDBwTEa9IeiPJ8ZA3RcTRSq3DHb2ZlSXdvfBD4HaHfEXMJTmt8xiSv6g+XMmQB3f0ZmaZ54OxZmYZ56A3M8s4B72ZWcY56M3MMs5Bb2aWcf8fwkxYgKc6ZcUAAAAASUVORK5CYII=\n", 1065 | "text/plain": [ 1066 | "
" 1067 | ] 1068 | }, 1069 | "metadata": { 1070 | "tags": [], 1071 | "needs_background": "light" 1072 | } 1073 | } 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "markdown", 1078 | "metadata": { 1079 | "id": "xUqeWsol5RAt", 1080 | "colab_type": "text" 1081 | }, 1082 | "source": [ 1083 | "## **Observation**\n", 1084 | "- Recall that we want to do better than 70% with a Stacking Classifier to consider it an improvement over the Decision Tree baseline model and, although we did achieve that, we can probably do even better with this dataset. \n", 1085 | "- Let's try some hyperparameter tuning via cross-validation next..." 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "markdown", 1090 | "metadata": { 1091 | "id": "xwc_6_Qf4amu", 1092 | "colab_type": "text" 1093 | }, 1094 | "source": [ 1095 | "---\n", 1096 | "\n", 1097 | "## Q&A\n", 1098 | "\n", 1099 | "--- \n" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "code", 1104 | "metadata": { 1105 | "id": "yMZ8gTb6LGCP", 1106 | "colab_type": "code", 1107 | "colab": {} 1108 | }, 1109 | "source": [ 1110 | "# Import additional libraries\n", 1111 | "from xgboost import XGBClassifier \n", 1112 | "from sklearn.ensemble import RandomForestClassifier\n", 1113 | "from sklearn.preprocessing import StandardScaler\n", 1114 | "from sklearn.pipeline import Pipeline\n", 1115 | "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV\n", 1116 | "import xgboost as xgb\n", 1117 | "from datetime import datetime" 1118 | ], 1119 | "execution_count": null, 1120 | "outputs": [] 1121 | }, 1122 | { 1123 | "cell_type": "markdown", 1124 | "metadata": { 1125 | "id": "BfctBvrs4ZcQ", 1126 | "colab_type": "text" 1127 | }, 1128 | "source": [ 1129 | "## Custom function # 4: best_model(name, model)\n", 1130 | "- We're going to create a Pipeline that scales the data before applying the parameter grid via cross-validation.\n", 1131 | "- Then it returns the model with the best hyperparameters from the search grid for each model." 1132 | ] 1133 | }, 1134 | { 1135 | "cell_type": "code", 1136 | "metadata": { 1137 | "id": "5RG7lpMY3Bzz", 1138 | "colab_type": "code", 1139 | "colab": {} 1140 | }, 1141 | "source": [ 1142 | "# Define best_model:\n", 1143 | "def best_model(name, model):\n", 1144 | " pipe = Pipeline([('scaler', StandardScaler()), ('classifier',model)]) \n", 1145 | "\n", 1146 | " if name == 'SVM':\n", 1147 | " param_grid = {'classifier__kernel' : ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']} \n", 1148 | " # Create grid search object\n", 1149 | " # this uses k-fold cv\n", 1150 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 1151 | "\n", 1152 | " # Fit on data\n", 1153 | " best_clf = clf.fit(X, y)\n", 1154 | "\n", 1155 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 1156 | "\n", 1157 | " return name, best_hyperparams \n", 1158 | "\n", 1159 | " if name == 'Bayes': \n", 1160 | " param_grid = {'classifier__var_smoothing' : np.array([1e-09, 1e-08])} \n", 1161 | " # Create grid search object\n", 1162 | " # this uses k-fold cv\n", 1163 | "\n", 1164 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 1165 | "\n", 1166 | " # Fit on data\n", 1167 | " best_clf = clf.fit(X, y)\n", 1168 | "\n", 1169 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 1170 | "\n", 1171 | " return name, best_hyperparams \n", 1172 | "\n", 1173 | " if name == 'RF': \n", 1174 | " param_grid = {'classifier__criterion' : np.array(['gini', 'entropy']),\n", 1175 | " 'classifier__max_depth' : np.arange(5,11)} \n", 1176 | " # Create grid search object\n", 1177 | " # this uses k-fold cv\n", 1178 | "\n", 1179 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 1180 | "\n", 1181 | " # Fit on data\n", 1182 | " best_clf = clf.fit(X, y)\n", 1183 | "\n", 1184 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 1185 | " \n", 1186 | " return name, best_hyperparams \n", 1187 | "\n", 1188 | " if name == 'XGB':\n", 1189 | " param_grid = {'classifier__learning_rate' : np.arange(0.022,0.04,.01),\n", 1190 | " 'classifier__max_depth' : np.arange(5,10)} \n", 1191 | " # Create grid search object\n", 1192 | " # this uses k-fold cv\n", 1193 | " clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, n_jobs=-1)\n", 1194 | "\n", 1195 | " # Fit on data\n", 1196 | " best_clf = clf.fit(X, y)\n", 1197 | " best_hyperparams = best_clf.best_estimator_.get_params()['classifier']\n", 1198 | "\n", 1199 | " return name, best_hyperparams " 1200 | ], 1201 | "execution_count": null, 1202 | "outputs": [] 1203 | }, 1204 | { 1205 | "cell_type": "markdown", 1206 | "metadata": { 1207 | "id": "8Ay2mPQo39mV", 1208 | "colab_type": "text" 1209 | }, 1210 | "source": [ 1211 | "## Adding Random Forest and XGBoost to our get_stacking() custom function in layer 1 (and removing the poorest performers DT and KNN):" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "code", 1216 | "metadata": { 1217 | "id": "4ow6Aqaz27GJ", 1218 | "colab_type": "code", 1219 | "colab": {} 1220 | }, 1221 | "source": [ 1222 | "# Define get_stacking(): \n", 1223 | "def get_stacking():\n", 1224 | "\n", 1225 | "\t# Create an empty list for the base models called layer1\n", 1226 | " layer1 = list()\n", 1227 | "\n", 1228 | " # Append tuple with classifier name and instantiations (no arguments) for SVC and GaussianNB base models AND call cust fx #4 best_model on each\n", 1229 | " # Hint: layer1.append((best_model('ModelName', Classifier())))\n", 1230 | " layer1.append((best_model('SVM', SVC())))\n", 1231 | " layer1.append((best_model('Bayes', GaussianNB())))\n", 1232 | "\n", 1233 | " # Add RandomForestClassifier and xgb.XGBClassifier as base models\n", 1234 | " layer1.append((best_model('RF', RandomForestClassifier())))\n", 1235 | " layer1.append((best_model('XGB', xgb.XGBClassifier())))\n", 1236 | "\n", 1237 | " # Instantiate Logistic Regression as meta learner model called layer2\n", 1238 | " layer2 = LogisticRegression()\n", 1239 | "\n", 1240 | "\t# Define StackingClassifier() called model passing layer1 model list and meta learner with 5 cross-validations\n", 1241 | " model = StackingClassifier(estimators=layer1, final_estimator=layer2, cv=5)\n", 1242 | "\n", 1243 | " # return model\n", 1244 | " return model" 1245 | ], 1246 | "execution_count": null, 1247 | "outputs": [] 1248 | }, 1249 | { 1250 | "cell_type": "markdown", 1251 | "metadata": { 1252 | "id": "dbp5PICC4HEk", 1253 | "colab_type": "text" 1254 | }, 1255 | "source": [ 1256 | "## Adding Random Forest and XGBoost to our get_models() custom function:" 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "code", 1261 | "metadata": { 1262 | "id": "GQqQUH_P3Bto", 1263 | "colab_type": "code", 1264 | "colab": {} 1265 | }, 1266 | "source": [ 1267 | "# Define get_models():\n", 1268 | "def get_models():\n", 1269 | "\n", 1270 | " # Create empty dictionary called models\n", 1271 | " models = dict()\n", 1272 | "\n", 1273 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for SVC and GaussianNB base models\n", 1274 | " # Hint: models['ModelName'] = Classifier() \n", 1275 | " models['SVM'] = SVC()\n", 1276 | " models['Bayes'] = GaussianNB()\n", 1277 | "\n", 1278 | " # we'll add two more classifers to the mix - RandomForestClassifier and xgb.XGBClassifier\n", 1279 | " models['RF'] = RandomForestClassifier()\n", 1280 | " models['XGB'] = xgb.XGBClassifier()\n", 1281 | "\n", 1282 | "\n", 1283 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 1284 | " models['Stacking'] = get_stacking()\n", 1285 | "\n", 1286 | " # return dictionary\n", 1287 | " return models" 1288 | ], 1289 | "execution_count": null, 1290 | "outputs": [] 1291 | }, 1292 | { 1293 | "cell_type": "code", 1294 | "metadata": { 1295 | "id": "JVTYjSno3B3s", 1296 | "colab_type": "code", 1297 | "colab": {} 1298 | }, 1299 | "source": [ 1300 | "# Assign get_models() to a variable called models\n", 1301 | "models = get_models()" 1302 | ], 1303 | "execution_count": null, 1304 | "outputs": [] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": { 1309 | "id": "lNECWtJ74tZh", 1310 | "colab_type": "text" 1311 | }, 1312 | "source": [ 1313 | "## Custom function # 3: evaluate_model(model)" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "metadata": { 1319 | "id": "TsTJZKNk3XWc", 1320 | "colab_type": "code", 1321 | "colab": {} 1322 | }, 1323 | "source": [ 1324 | "# Define evaluate_model(model):\n", 1325 | "def evaluate_model(model):\n", 1326 | "\n", 1327 | " # Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 1328 | " cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)\n", 1329 | "\n", 1330 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'accuracy' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 1331 | " scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')\n", 1332 | "\n", 1333 | " # return scores\n", 1334 | " return scores" 1335 | ], 1336 | "execution_count": null, 1337 | "outputs": [] 1338 | }, 1339 | { 1340 | "cell_type": "markdown", 1341 | "metadata": { 1342 | "id": "3CxRVSe_DGlI", 1343 | "colab_type": "text" 1344 | }, 1345 | "source": [ 1346 | "# 10 minute break while the following runs..." 1347 | ] 1348 | }, 1349 | { 1350 | "cell_type": "code", 1351 | "metadata": { 1352 | "id": "rXrusVVBAbaJ", 1353 | "colab_type": "code", 1354 | "colab": { 1355 | "base_uri": "https://localhost:8080/", 1356 | "height": 1000 1357 | }, 1358 | "outputId": "79c1a4f3-f220-4efb-971d-1ac0f00b105d" 1359 | }, 1360 | "source": [ 1361 | "# Evaluate the models and store results\n", 1362 | "# Create an empty list for the results\n", 1363 | "results = list()\n", 1364 | "\n", 1365 | "# Create an empty list for the model names\n", 1366 | "names = list()\n", 1367 | "\n", 1368 | "# Create a for loop that iterates over each name, model in models dictionary \n", 1369 | "for name, model in models.items():\n", 1370 | "\n", 1371 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 1372 | "\tscores = evaluate_model(model)\n", 1373 | " \n", 1374 | " # Append output from scores to the results list\n", 1375 | "\tresults.append(scores)\n", 1376 | " \n", 1377 | " # Append name to the names list\n", 1378 | "\tnames.append(name)\n", 1379 | " \n", 1380 | " # Print name, mean and standard deviation of scores:\n", 1381 | "\tprint('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\n", 1382 | "\n", 1383 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 1384 | "sns.boxplot(x=names, y=results, showmeans=True)" 1385 | ], 1386 | "execution_count": null, 1387 | "outputs": [ 1388 | { 1389 | "output_type": "stream", 1390 | "text": [ 1391 | ">SVM 0.757 (0.040) \n", 1392 | " SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,\n", 1393 | " decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',\n", 1394 | " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", 1395 | " tol=0.001, verbose=False) \n", 1396 | " 2020-08-05 23:22:38.978574\n", 1397 | ">Bayes 0.759 (0.055) \n", 1398 | " GaussianNB(priors=None, var_smoothing=1e-09) \n", 1399 | " 2020-08-05 23:22:39.052228\n", 1400 | ">RF 0.763 (0.047) \n", 1401 | " RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", 1402 | " criterion='gini', max_depth=None, max_features='auto',\n", 1403 | " max_leaf_nodes=None, max_samples=None,\n", 1404 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 1405 | " min_samples_leaf=1, min_samples_split=2,\n", 1406 | " min_weight_fraction_leaf=0.0, n_estimators=100,\n", 1407 | " n_jobs=None, oob_score=False, random_state=None,\n", 1408 | " verbose=0, warm_start=False) \n", 1409 | " 2020-08-05 23:22:44.240611\n", 1410 | ">XGB 0.754 (0.044) \n", 1411 | " XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", 1412 | " colsample_bynode=1, colsample_bytree=1, gamma=0,\n", 1413 | " learning_rate=0.1, max_delta_step=0, max_depth=3,\n", 1414 | " min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,\n", 1415 | " nthread=None, objective='binary:logistic', random_state=0,\n", 1416 | " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n", 1417 | " silent=None, subsample=1, verbosity=1) \n", 1418 | " 2020-08-05 23:22:45.491879\n", 1419 | ">Stacking 0.766 (0.042) \n", 1420 | " StackingClassifier(cv=5,\n", 1421 | " estimators=[('SVM',\n", 1422 | " SVC(C=1.0, break_ties=False, cache_size=200,\n", 1423 | " class_weight=None, coef0=0.0,\n", 1424 | " decision_function_shape='ovr', degree=3,\n", 1425 | " gamma='scale', kernel='linear', max_iter=-1,\n", 1426 | " probability=False, random_state=None,\n", 1427 | " shrinking=True, tol=0.001, verbose=False)),\n", 1428 | " ('Bayes',\n", 1429 | " GaussianNB(priors=None, var_smoothing=1e-09)),\n", 1430 | " ('RF',\n", 1431 | " RandomForestClassif...\n", 1432 | " seed=None, silent=None,\n", 1433 | " subsample=1, verbosity=1))],\n", 1434 | " final_estimator=LogisticRegression(C=1.0, class_weight=None,\n", 1435 | " dual=False,\n", 1436 | " fit_intercept=True,\n", 1437 | " intercept_scaling=1,\n", 1438 | " l1_ratio=None,\n", 1439 | " max_iter=100,\n", 1440 | " multi_class='auto',\n", 1441 | " n_jobs=None, penalty='l2',\n", 1442 | " random_state=None,\n", 1443 | " solver='lbfgs',\n", 1444 | " tol=0.0001, verbose=0,\n", 1445 | " warm_start=False),\n", 1446 | " n_jobs=None, passthrough=False, stack_method='auto',\n", 1447 | " verbose=0) \n", 1448 | " 2020-08-05 23:33:16.365883\n" 1449 | ], 1450 | "name": "stdout" 1451 | }, 1452 | { 1453 | "output_type": "execute_result", 1454 | "data": { 1455 | "text/plain": [ 1456 | "" 1457 | ] 1458 | }, 1459 | "metadata": { 1460 | "tags": [] 1461 | }, 1462 | "execution_count": 74 1463 | }, 1464 | { 1465 | "output_type": "display_data", 1466 | "data": { 1467 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAYu0lEQVR4nO3dfXRc9X3n8fdHso1lHGODDSQWxg42AdJ2TaKQbknSsKmpoWnYnmYT0+as2eXUp9tgN+ThlJzNIZRNNnT7kFSU0DoJJ063wZB0k3WIwU0au5vTuKll7IItHjw2jhmHgB/wEzK2JX33j3sVDbIeZqQZjfTT53WOjmfuvb+533s9+ug7996ZUURgZmbpaqh3AWZmVlsOejOzxDnozcwS56A3M0ucg97MLHGT6l1AX7Nnz4758+fXuwwzs3Fl69atByNiTn/zxlzQz58/n7a2tnqXYWY2rkj6yUDzfOjGzCxxDnozs8Q56M3MEuegNzNLnIPezCxxDnozs8Q56M3MEjfmrqO32mhtbaVQKFQ8rlgsAtDc3FzRuIULF7Jq1aqK12dm1eegt0GdPHmy3iWY2Qg56CeI4XbXPeNaW1urWY6ZjSIfozczS5yD3swscQ56M7PEOejNzBLnoDczS5yD3swscQ56M7PEOejNzBLnoDczS5zfGWsTjj/3xyYaB71Zmfy5PzZeOehtwvHn/thE42P0ZmaJKyvoJS2V9IykgqQ7+pk/T9JGSdskPSHpxnz6fEknJW3Pf/662htgZmaDG/LQjaRG4D5gCVAEtkhaFxHtJYt9Cng4Iu6XdBWwHpifz9sdEYurW7aZmZWrnI7+GqAQEXsi4jSwFripzzIBzMhvnwf8tHolmpnZSJQT9HOB50vuF/Nppe4CPiSpSNbNryyZtyA/pPNPkt7Z3wokrZDUJqntwIED5VdvZmZDqtbJ2JuBr0ZEM3Aj8LeSGoAXgHkRcTXwUeDrkmb0HRwRqyOiJSJa5syZU6WSzMwMygv6/cAlJfeb82mlbgUeBoiIzcBUYHZEnIqIQ/n0rcBu4PKRFm1mZuUrJ+i3AIskLZA0BVgGrOuzzD7gPQCSriQL+gOS5uQnc5H0RmARsKdaxZuZ2dCGvOomIjol3QZsABqBByJip6S7gbaIWAd8DPiSpNvJTszeEhEh6V3A3ZLOAN3A70fE4ZptTT+G83Z3v9XdzFJS1jtjI2I92UnW0ml3ltxuB67tZ9zfA38/whpHnd/qbmYpSf4jEIbTYfut7maWEn8EgplZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mlrjk3zBlZgMbzkeEQJofE5LyvnDQm1nF/DEhvcbDvnDQm01gw+0oU/yYkJT3hY/Rm5klzkFvZpY4B72ZWeIc9GZmiXPQm5klzkFvZpY4B72ZWeIc9GZmiXPQm5klzkFvZpY4B72ZWeIc9GZmiXPQm5klzkFvZpY4B72ZWeIc9GZmiXPQm5klzkFvZpY4B72ZWeIc9GZmiXPQm5klzkFvZpa4SfUuwMys2lpbWykUCqOyrl27dgGwatWqUVnfwoULK15XWUEvaSnwl0Aj8OWIuKfP/HnAGmBmvswdEbE+n/dJ4FagC1gVERsqqtDMrEKFQoGdTz7FzGkX1nxd3acFwP7dh2q+riMdLw1r3JBBL6kRuA9YAhSBLZLWRUR7yWKfAh6OiPslXQWsB+bnt5cBbwbeAHxf0uUR0TWsas3MyjRz2oVcd8WyepdRVRufXjusceUco78GKETEnog4DawFbuqzTAAz8tvnAT/Nb98ErI2IUxHxHFDIH8/MzEZJOUE/F3i+5H4xn1bqLuBDkopk3fzKCsYiaYWkNkltBw4cKLN0MzMrR7WuurkZ+GpENAM3An8rqezHjojVEdESES1z5sypUklmZgblnYzdD1xScr85n1bqVmApQERsljQVmF3mWDMzq6Fyuu4twCJJCyRNITu5uq7PMvuA9wBIuhKYChzIl1sm6RxJC4BFwL9Wq3gzMxvakB19RHRKug3YQHbp5AMRsVPS3UBbRKwDPgZ8SdLtZCdmb4mIAHZKehhoBzqBD/uKGzOz0VXWdfT5NfHr+0y7s+R2O3DtAGM/C3x2BDWamdkI+J2xNq75HZBmQ3PQ27hWKBTYtnNb9p7sWuvO/tm2f1vt13Wk9quwiWPcBL07NxvQTOh+d3e9q6iqhk3+vEGrnnET9IVCgW1PttM97fyar0unA4Ctu39W83U1dByu+TrMbGIbN0EP0D3tfF696r31LqOqprY/UvEYv7qx/vh5YQMZV0FvmUKhwLM7Hmfe9NpfqTrlTHYI4dW9W2q+rn0nGmu+jpQVCgWe3r6di0dhXT0Hlo5s317zddX+dXX6HPTj1LzpXXyq5US9y6iqz7RNr3cJ497FwK2o3mVU1VeIepcw7vmMj5lZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmY3AK5OOse6Nq+mYdLzepQzIn0dv41qxWISjCX7H6hEoRrHeVYxbxWKRox3H2fj02pqv68W3FDl67mHWTVnNRTvm1nRdRzpeIoonKx6X2G+Hmdno6Zx6hmPzXwbBsfmH6Zx6pt4l9csd/ThULBZ55Xhjct/I9JPjjZxbrKyLbW5u5oAO0P3u7hpV1atb3ZxsOknTySYaorY9UsOmBprnNtd0HSlrbm5Gpw5x3RXLarqeH77h/6KGBoIu1NDA1F+ZwTt/elPN1rfx6bXMbb6g4nHu6M3KdGrKKboauzg15VS9S7Ex4JVJx3jm/K10N2Tf3dzd0MUz528dk8fq3dGPQ83Nzbza+UKS3xk7tXlsdrHd6ubMlDMgODPlDOecPqfmXb2NbY9fuJHo8322QbD1wh/UtKsfDj9TbVCHG+GPLoLDE/yZ0reLd1dvL5677+fdfI/uhi5ePHdfnSoamDt6G9SDM2DnObD2PPiDl+tdTX2UdvOAu3oD4P27Vta7hLL5WdqPmHSCMwseJCaldWikUocb4fvTIQTfmz5xu/qBund39TZeTNBf3cF1zdlMTCvSNWdzvUupqwdnQHfexXYr6+onoq5JXb3dfA/l083GAQd9HzHpBN2zdoCge9aOCdvV93TznXnAdU7grn76K9OZcWzGWT/TX0nr8tZKHG3q5gs3HuNYU+0va7WRm4C/toPLuvieM+kxYbv60m6+x0Tu6u21Hr36JLsv7uTRxZW/S9NGn4O+xM+7+Z4z6Q1dE7arf/qc3m6+R6fgqXPqU4+NHUebuvnxolOE4F8uP+Wufhwo66obSUuBvwQagS9HxD195n8euC6/Ow24MCJm5vO6gCfzefsi4n3VKLwWXtvN98i6+kkvLKlHSXVz78/qXYFVqlgschz4ylnP4erad3UHnfntTuDPF3cwb/O5NVvfC8CJCt8xba81ZNBLagTuA5YARWCLpHUR0d6zTETcXrL8SuDqkoc4GRGLq1dy7XRP29/bzfdo6Mqmmxlnmro5vOg0kSdHTILDl5/m9dubmHzSBwjGqnI6+muAQkTsAZC0FrgJaB9g+ZuBT1envNE1Zfct9S7BbNiam5s5cvAgt551iVD1rL36VRqA0naoAZi1+FU+WKOu/isEM8foO6bHi3L+BM8Fni+5X8ynnUXSpcAC4Aclk6dKapP0L5L+47ArNbO623thJ1192sOuSfDcRZ39D7AxodrvjF0GfDMiSv/gXxoR+yW9EfiBpCcjYnfpIEkrgBUA8+bNq3JJZlYtd3zbl12NR+UE/X7gkpL7zfm0/iwDPlw6ISL25//ukbSJ7Pj97j7LrAZWA7S0tPR7JqlYLNLQcZSp7Y+UUfL40dBxiGLR3ZCZ1U45h262AIskLZA0hSzM1/VdSNIVwCxgc8m0WZLOyW/PBq5l4GP7ZmZWA0N29BHRKek2YAPZ5ZUPRMROSXcDbRHRE/rLgLURUdqRXwn8jaRusj8q95RerVOJ5uZmXjw1iVeveu9who9ZU9sfobn54nqXYWYJK+sYfUSsB9b3mXZnn/t39TPuR8AvjqA+MzMbIV/4amaWOAe9mVni/MUjZpakIx0vsfHptTVfz4lXs2/kmT51Vs3XdaTjJeZS+ZeDO+jNLDkLFy4ctXXt2nUYgLmXVR7AlZrLBcPaNge9mSVn1apVo76u1tbWUVtnpXyM3swscQ56M7PEOejNzBLnoDczS5yD3swscQ56M7PEjavLKxs6Do/KxxTr1WMAxNQZNV9XQ8dhoPIPNdt3opHPtE2vfkF9vNiR9QIXTav9F0DvO9HI5cMZeAQaNo1Cz9LzHfG13+1whAG+3mdwP6P23xkLcCj/t/ZXjmfbNHMU1pOycRP0o/sGiOMALLpsND5V8uKKt20098XpXbsAmDp/Uc3XdTmVb9voPi+yfbFobu33BXPH9r44kO+LmYtqvy9mMrrbliK99lOF66+lpSXa2trqWsN4eAPEaPG+6OV90cv7otdY2ReStkZES3/zfIzezCxxDnozs8Q56M3MEuegNzNLnIPezCxxDnozs8Q56M3MEuegNzNLnIPezCxxDnozs8Q56M3MEuegNzNLnIPezCxxDnozs8Q56M3MEuegNzNLnIPezCxxDnozs8Q56M3MEuegNzNLnIPezCxxZQW9pKWSnpFUkHRHP/M/L2l7/vOspCMl85ZL2pX/LK9m8WZmNrRJQy0gqRG4D1gCFIEtktZFRHvPMhFxe8nyK4Gr89vnA58GWoAAtuZjX67qVpiZ2YDK6eivAQoRsSciTgNrgZsGWf5m4MH89q8D34uIw3m4fw9YOpKCzcysMuUE/Vzg+ZL7xXzaWSRdCiwAflDJWEkrJLVJajtw4EA5dZuZWZmqfTJ2GfDNiOiqZFBErI6IlohomTNnTpVLMjOb2MoJ+v3AJSX3m/Np/VlG72GbSseamVkNlBP0W4BFkhZImkIW5uv6LiTpCmAWsLlk8gbgekmzJM0Crs+nmZnZKBnyqpuI6JR0G1lANwIPRMROSXcDbRHRE/rLgLURESVjD0v6H2R/LADujojD1d0EMzMbzJBBDxAR64H1fabd2ef+XQOMfQB4YJj1mZnZCPmdsWZmiXPQm5klzkFvZpY4B72ZWeLKOhlrZpa61tZWCoVCxeN27doFwKpVqyoat3DhworHDJeD3sxsBJqamupdwpAc9GZmVN6Rjyc+Rm9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJc9CbmSXOQW9mljgHvZlZ4hz0ZmaJm1TvAmqttbWVQqFQ0Zhdu3YBsGrVqorGLVy4sOIxNvqG85yANJ8X3hcTQ/JBPxxNTU31LsHGID8venlfjC/JB727B+vLz4le3hcTg4/Rm5klzkFvZpa4soJe0lJJz0gqSLpjgGU+IKld0k5JXy+Z3iVpe/6zrlqFm5lZeYY8Ri+pEbgPWAIUgS2S1kVEe8kyi4BPAtdGxMuSLix5iJMRsbjKdZuZWZnK6eivAQoRsSciTgNrgZv6LPN7wH0R8TJARLxU3TLNzGy4yrnqZi7wfMn9IvD2PstcDiDpn4FG4K6IeCyfN1VSG9AJ3BMR3+67AkkrgBUA8+bNq2gDrDy+Xtps4qrWydhJwCLg3cDNwJckzcznXRoRLcDvAF+QdFnfwRGxOiJaIqJlzpw5VSrJqqGpqcnXTJsN4uDBg6xcuZJDhw7Vu5QBldPR7wcuKbnfnE8rVQR+HBFngOckPUsW/FsiYj9AROyRtAm4Gtg90sKtMu6uzWpjzZo1PPHEE6xZs4aPfvSj9S6nX+V09FuARZIWSJoCLAP6Xj3zbbJuHkmzyQ7l7JE0S9I5JdOvBdoxM0vAwYMHefTRR4kIHn300THb1Q8Z9BHRCdwGbACeAh6OiJ2S7pb0vnyxDcAhSe3ARuATEXEIuBJok/Rv+fR7Sq/WMTMbz9asWUNEANDd3c2aNWvqXFH/1FPkWNHS0hJtbW31LsPMbEhLly6lo6Pj5/enTZvGY489NsiI2pG0NT8feha/M9bMbJiWLFnC5MmTAZg8eTLXX399nSvqn4PezGyYli9fjiQAGhoaWL58eZ0r6p+D3sxsmGbPns0NN9yAJG644QYuuOCCepfUr+Q/ptjMrJaWL1/O3r17x2w3Dw56M7MRmT17Nvfee2+9yxiUD92YmSXOQW9mljgHvZlZ4hz0ZmaJG3PvjJV0APhJvesAZgMH613EGOF90cv7opf3Ra+xsC8ujYh+P/53zAX9WCGpbaC3E0803he9vC96eV/0Guv7woduzMwS56A3M0ucg35gq+tdwBjifdHL+6KX90WvMb0vfIzezCxx7ujNzBLnoDczS9yEDHpJ/13STklPSNou6dOSPtdnmcWSnspv75X0wz7zt0vaMZp1j5Skrrzuf5P0uKRfqXdNY1nJ/toh6TuSZubT50s6mc/r+ZlS73qrQdIlkp6TdH5+f1Z+f76kRZIekbRb0lZJGyW9K1/uFkkH8n2xU9I3JU2r79YMrJ8MeLukjwy35nz7/6qf6b8v6T+PvOKRmXBBL+nfA+8F3hIRvwT8Gtn32X6wz6LLgAdL7r9O0iX5Y1w5GrXWwMmIWBwR/w74JPC5oQZMcD376xeAw8CHS+btzuf1/JyuU41VFRHPA/cD9+ST7iE70fgz4LvA6oi4LCLeCqwE3lgy/KF8X7wZOM3Zv1NjwgAZ8DzwEaCqf5wi4q8j4mvVfMzhmHBBD7weOBgRpwAi4mBE/D/gZUlvL1nuA7w26B+m94l7c59549EM4GUASdMl/WPe5T8p6aZ8+t2SPtIzQNJnJf1hfvsTkrbkHdEf59POlfTd/BXDDklj8hd9mDYDc+tdxCj5PPDL+f/9O4A/A34X2BwR63oWiogdEfHVvoMlTQLOJX9+jUFnZQDwfuANwEZJGwEk3S+pLe/8/7hnsKS3SfpR/jz/V0mvK31wSb8habOk2ZLukvTxfPomSX+Sj3lW0jvz6dMkPSypXdK3JP1YUnXffBURE+oHmA5sB54Fvgj8aj7948Dn89u/DLSVjNkLvAn4UX5/G3AVsKPe21Phtnfl2/40cBR4az59EjAjvz0bKAAC5gOP59MbgN3ABcD1ZF2e8umPAO8Cfhv4Usn6zqv3No9wf53I/20EvgEsze/PB07m+3I7cF+9a63Btv86EMCS/P5fAH84yPK3AAfy/fEi8EOgsd7bMUCtA2XAXmB2yXLnl/z/bwJ+CZgC7AHels+bkf/+3AL8FfBb+bbPyuffBXw8v70J+PP89o3A9/PbHwf+Jr/9C0An0FLNbZ5wHX1EnADeCqwge2I+JOkW4CHg/ZIaOPuwDcAhsq5/GfAU0MH403Mo4gpgKfA1ZV94KeB/SnoC+D5Z53pRROwFDkm6mizct0XEofz29WR/8B4HrgAWAU8CS/Ku5Z0RcXSUt6/amiRtJztscRHwvZJ5pYduPtz/8HHtBuAFsuA5S9557pD0f0omPxQRi4GLyZ4Ln6h9mZUbJAP6+oCkx8me528ma+7eBLwQEVvyxzoWEZ358v8B+CPgNyJioFczPftrK1nDANmrprX54+0Anhj2xg1gwgU9QER0RcSmiPg0cBvw25Edm3wO+FWyzvShfoY+BNzH+D9sQ0RsJuve55C9LJ9D1uEvJuvIpuaLfpmsW/kvwAP5NAGfKwm6hRHxlYh4FngL2S/5ZyTdOWobVBsn8/1xKdk2pxjoZ5G0GFhC9sr2dkmvB3aS/d8CEBG/Rfa8OL/v+Mha0++Qvcobk/rLgNL5khaQddrview4/nfp/Z0YyG7gdcDlgyxzKv+3i1H8hr8JF/SS3iRpUcmkxfR+WuaDZMcn90REsZ/h3wL+F7ChtlXWnqQryF6SHgLOA16KiDOSriMLth7fIuv+30bvdm8A/quk6fljzZV0oaQ3AB0R8b+BP6UkGMaziOgAVgEfy48/Jyt/hXc/8JGI2Ef2//hnwNeBayW9r2TxwU5cvoMs+MacQTLgOFlQQ3ZI5hXgqKSLyF7hADwDvF7S2/LHel3Jc+InZH8wvibpzRWU9M9k5wSRdBXwi5Vv1eCSftIOYDpwb36pXCfZ8egV+bxvAK1kVxOcJSKOA38CkP0+jDs9hyIg61CXR0SXpL8DviPpSaCN7Bg+ABFxOj85dSQiuvJp/5BfebQ53w8ngA8BC4E/ldQNnAH+22htWK1FxLb80NbNZMdgU/V7wL6I6DlM9UWyV3PXkF2p8heSvkD2qu848JmSsR+U9A6yBrJI1vGPRQNlwM3AY5J+GhHXSdpG9rvwPFkY9/w+fDAf30R2rubXeh44Ip6W9LvANyT9Zpn1fBFYI6k9X99OsnNoVeOPQLBB5ecsHgf+U0Tsqnc9ZqmR1AhMjohXJV1Gdp7sTVHFS3YnYkdvZcpfRj4CfMshb1Yz08gu65xM9kr7D6oZ8uCO3swseRPuZKyZ2UTjoDczS5yD3swscQ56M7PEOejNzBL3/wFHLiavsTsD7AAAAABJRU5ErkJggg==\n", 1468 | "text/plain": [ 1469 | "
" 1470 | ] 1471 | }, 1472 | "metadata": { 1473 | "tags": [], 1474 | "needs_background": "light" 1475 | } 1476 | } 1477 | ] 1478 | }, 1479 | { 1480 | "cell_type": "markdown", 1481 | "metadata": { 1482 | "id": "uZlAHPaD419_", 1483 | "colab_type": "text" 1484 | }, 1485 | "source": [ 1486 | "## **Observation**\n", 1487 | "- Before we added XGBoost and hyperparameter tuning, our Stacking Classifier got ~ 76% accuracy. \n", 1488 | "- Here, we got just around 77% accuracy, a minor improvement, but an improvement nonetheless.\n", 1489 | "- We could continue fiddling with other algorithms in layer 1\n", 1490 | "- We could try other algorithms in layer 2.\n", 1491 | "- We could add more hyperparameters to our parameter grid.\n", 1492 | "- To this last point, keep in mind that the more parameters there are in a grid to search over, the longer it takes to train the Stacking Classifier." 1493 | ] 1494 | }, 1495 | { 1496 | "cell_type": "markdown", 1497 | "metadata": { 1498 | "id": "lj8WeJR__bUo", 1499 | "colab_type": "text" 1500 | }, 1501 | "source": [ 1502 | "---\n", 1503 | "\n", 1504 | "## Q&A\n", 1505 | "\n", 1506 | "--- " 1507 | ] 1508 | }, 1509 | { 1510 | "cell_type": "markdown", 1511 | "metadata": { 1512 | "id": "FPY3I2BlVxig", 1513 | "colab_type": "text" 1514 | }, 1515 | "source": [ 1516 | "# **Stacking Regressor**" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "code", 1521 | "metadata": { 1522 | "id": "ftxDhyDq2lrH", 1523 | "colab_type": "code", 1524 | "colab": {} 1525 | }, 1526 | "source": [ 1527 | "# Import libraries\n", 1528 | "from sklearn.model_selection import RepeatedKFold\n", 1529 | "from sklearn.dummy import DummyRegressor\n", 1530 | "from sklearn.svm import SVR" 1531 | ], 1532 | "execution_count": null, 1533 | "outputs": [] 1534 | }, 1535 | { 1536 | "cell_type": "markdown", 1537 | "metadata": { 1538 | "id": "nqDHD8A_nhPB", 1539 | "colab_type": "text" 1540 | }, 1541 | "source": [ 1542 | "## **2nd Dataset**\n", 1543 | "\n", 1544 | "\n", 1545 | "The second dataset we'll use is a CSV file named `abalone.csv`, which contains data on physical measurements of abalone shells used to determine the age of the abalone. It contains the following columns:\n", 1546 | "\n", 1547 | "- `Sex`: M, F, and I (infant) - (removed for our purposes)\n", 1548 | "- `Length`: Longest shell measurement (mm)\n", 1549 | "- `Diameter`: Perpendicular to length (mm)\n", 1550 | "- `Height`: with meat in shell (mm)\n", 1551 | "- `Whole weight`: whole abalone (grams)\n", 1552 | "- `Shucked weight`: weight of meat (grams)\n", 1553 | "- `Viscera weight`: gut weight (grams)\n", 1554 | "- `Shell weight`: after being dried (grams)\n", 1555 | "- `Rings`: +1.5 gives the age in years\n", 1556 | "\n", 1557 | "\t" 1558 | ] 1559 | }, 1560 | { 1561 | "cell_type": "markdown", 1562 | "metadata": { 1563 | "id": "HwNnn3ZKrh1o", 1564 | "colab_type": "text" 1565 | }, 1566 | "source": [ 1567 | "### **Get the dataset**" 1568 | ] 1569 | }, 1570 | { 1571 | "cell_type": "code", 1572 | "metadata": { 1573 | "id": "K4LeaM4PzyAh", 1574 | "colab_type": "code", 1575 | "colab": {} 1576 | }, 1577 | "source": [ 1578 | "# Read in the dataset as Pandas DataFrame\n", 1579 | "abalone = pd.read_csv('https://github.com/datacamp/Applied-Machine-Learning-Ensemble-Modeling-live-training/blob/master/data/abalone.csv?raw=true')" 1580 | ], 1581 | "execution_count": null, 1582 | "outputs": [] 1583 | }, 1584 | { 1585 | "cell_type": "code", 1586 | "metadata": { 1587 | "id": "KfsmhIBdApVp", 1588 | "colab_type": "code", 1589 | "colab": {} 1590 | }, 1591 | "source": [ 1592 | "# Look at data using the info() function\n", 1593 | "abalone.info()" 1594 | ], 1595 | "execution_count": null, 1596 | "outputs": [] 1597 | }, 1598 | { 1599 | "cell_type": "markdown", 1600 | "metadata": { 1601 | "id": "NZAeIFGwBhe6", 1602 | "colab_type": "text" 1603 | }, 1604 | "source": [ 1605 | "## **Observations:** \n", 1606 | "- Here, there are no missing values. Again, that is not typical.\n", 1607 | "- There is a mixture of object, float, and integers with the first column being `object` (categorical), the next 7 `float64` and the last 'int64`." 1608 | ] 1609 | }, 1610 | { 1611 | "cell_type": "code", 1612 | "metadata": { 1613 | "id": "8D4Gfh08Avb2", 1614 | "colab_type": "code", 1615 | "colab": {} 1616 | }, 1617 | "source": [ 1618 | "# Look at data using the describe() function\n", 1619 | "abalone.describe()" 1620 | ], 1621 | "execution_count": null, 1622 | "outputs": [] 1623 | }, 1624 | { 1625 | "cell_type": "markdown", 1626 | "metadata": { 1627 | "id": "WDGc7PPBBkGX", 1628 | "colab_type": "text" 1629 | }, 1630 | "source": [ 1631 | "## **Observations:** \n", 1632 | "- Notice that the min of the `Height` column is zero. Even though there are no missing values, this is indicative of the measurements for that feature having not been captured.\n", 1633 | "- Again, the printout makes it appear as if all numeric values are float. \n", 1634 | "\n" 1635 | ] 1636 | }, 1637 | { 1638 | "cell_type": "code", 1639 | "metadata": { 1640 | "id": "FVGtuWoDAvl2", 1641 | "colab_type": "code", 1642 | "colab": {} 1643 | }, 1644 | "source": [ 1645 | "# Print the first 5 rows of the data using the head() function\n", 1646 | "abalone.head()" 1647 | ], 1648 | "execution_count": null, 1649 | "outputs": [] 1650 | }, 1651 | { 1652 | "cell_type": "markdown", 1653 | "metadata": { 1654 | "id": "wnmVoSl8BmMY", 1655 | "colab_type": "text" 1656 | }, 1657 | "source": [ 1658 | "## **Observation:**\n", 1659 | "- Printing out the first 5 rows, we see that the 1st column is the only non-numeric feature in this dataset and is aligned with the `object` datatype as we saw above when we called `.info()`." 1660 | ] 1661 | }, 1662 | { 1663 | "cell_type": "code", 1664 | "metadata": { 1665 | "id": "xPfVhWzRrm_w", 1666 | "colab_type": "code", 1667 | "colab": {} 1668 | }, 1669 | "source": [ 1670 | "# Convert Pandas DataFrame to numpy array - Return only the values of the DataFrame with DataFrame.to_numpy()\n", 1671 | "abalone = abalone.to_numpy()\n", 1672 | "\n", 1673 | "# Create X matrix and y (target) array using slicing [row_start:row_end, 1:target_col],[row_start:row_end, target_col] - Removing 1st column by starting at index 1\n", 1674 | "X, y = abalone[:, 1:-1], abalone[:, -1]\n", 1675 | "\n", 1676 | "# Print X matrix and y (target) array dimensions using .shape\n", 1677 | "print('Shape: %s, %s' % (X.shape,y.shape))" 1678 | ], 1679 | "execution_count": null, 1680 | "outputs": [] 1681 | }, 1682 | { 1683 | "cell_type": "code", 1684 | "metadata": { 1685 | "id": "fZ6CHfsVrpE7", 1686 | "colab_type": "code", 1687 | "colab": {} 1688 | }, 1689 | "source": [ 1690 | "# Convert y (target) array to 'float32' using .astype()\n", 1691 | "y = y.astype('float32')" 1692 | ], 1693 | "execution_count": null, 1694 | "outputs": [] 1695 | }, 1696 | { 1697 | "cell_type": "markdown", 1698 | "metadata": { 1699 | "id": "7bYvtBfSF7k7", 1700 | "colab_type": "text" 1701 | }, 1702 | "source": [ 1703 | "## **Creating a Naive Regressor**\n", 1704 | "Here we'll use the `DummyRegressor` from `sklearn`. This creates a so-called 'naive' regressor and is simply a model that predicts a single value for all of the rows, regardless of their original value. \n", 1705 | "\n", 1706 | "1. `DummyRegressor()` arguments:\n", 1707 | " - `strategy`: Strategy to use to generate predictions.\n", 1708 | "\n", 1709 | "2. `RepeatedKFold()` arguments:\n", 1710 | " - `n_splits`: Number of folds.\n", 1711 | " - `n_repeats`: Number of times cross-validator needs to be repeated.\n", 1712 | " - `random_state`: Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. (This is an equivalent argument to np.random.seed above, but will be specific to this naive model.)\n", 1713 | "\n", 1714 | "3. `cross_val_score()` arguments:\n", 1715 | " - The model to use.\n", 1716 | " - The data to fit. (X)\n", 1717 | " - The target variable to try to predict. (y)\n", 1718 | " - `scoring`: A single string scorer callable object/function such as 'accuracy' or 'roc_auc'. See https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for more options.\n", 1719 | " - `cv`: Cross-validation splitting strategy (default is 5)\n", 1720 | " - `n_jobs`: Number of CPU cores used when parallelizing. Set to -1 helps to avoid non-convergence errors.\n", 1721 | " - `error_score`: Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised." 1722 | ] 1723 | }, 1724 | { 1725 | "cell_type": "code", 1726 | "metadata": { 1727 | "id": "jAJdcu_Hrrg8", 1728 | "colab_type": "code", 1729 | "colab": {} 1730 | }, 1731 | "source": [ 1732 | "# Evaluate naive\n", 1733 | "\n", 1734 | "# Instantiate a DummyRegressor with 'median' strategy\n", 1735 | "naive = DummyRegressor(strategy='median')\n", 1736 | "\n", 1737 | "# Create RepeatedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 1738 | "cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\n", 1739 | "\n", 1740 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator, n_jobs=-1, and error_score set to 'raise'\n", 1741 | "n_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\n", 1742 | "\n", 1743 | "# Print mean and standard deviation of n_scores:\n", 1744 | "print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))" 1745 | ], 1746 | "execution_count": null, 1747 | "outputs": [] 1748 | }, 1749 | { 1750 | "cell_type": "markdown", 1751 | "metadata": { 1752 | "id": "dlYQmsCQHcdJ", 1753 | "colab_type": "text" 1754 | }, 1755 | "source": [ 1756 | "## **Observation** \n", 1757 | "- We want to do better than -2.37 to consider any other models as an improvement to a totally naive regressor model with the Abalone dataset." 1758 | ] 1759 | }, 1760 | { 1761 | "cell_type": "markdown", 1762 | "metadata": { 1763 | "id": "ZfiEdoUMHo-q", 1764 | "colab_type": "text" 1765 | }, 1766 | "source": [ 1767 | "## **Creating a Baseline Regressor**\n", 1768 | "Now we'll create a baseline regressor, one that seeks to correctly predict the value for each observation. Since the target variable is continuous, we'll instantiate a Support Vector Regression model.\n", 1769 | "\n", 1770 | "1. `SVR()` arguments:\n", 1771 | " - `kernel`: Specifies the kernel type to be used in the algorithm.\n", 1772 | " - `gamma`: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. \n", 1773 | " - `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty." 1774 | ] 1775 | }, 1776 | { 1777 | "cell_type": "code", 1778 | "metadata": { 1779 | "id": "cFip40FPrvOn", 1780 | "colab_type": "code", 1781 | "colab": {} 1782 | }, 1783 | "source": [ 1784 | "# Evaluate baseline model\n", 1785 | "\n", 1786 | "# Instantiate a Support Vector Regressor with 'rbf' kernel, gamma set to 'scale', and regularization parameter set to 10\n", 1787 | "model = SVR(kernel='rbf',gamma='scale',C=10)\n", 1788 | "\n", 1789 | "# Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 1790 | "m_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\n", 1791 | "\n", 1792 | "# Print mean and standard deviation of m_scores: \n", 1793 | "print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))" 1794 | ], 1795 | "execution_count": null, 1796 | "outputs": [] 1797 | }, 1798 | { 1799 | "cell_type": "markdown", 1800 | "metadata": { 1801 | "id": "Z_PMtVARKzBX", 1802 | "colab_type": "text" 1803 | }, 1804 | "source": [ 1805 | "## **Observation**\n", 1806 | "- We want to do better than -1.48 with a Stacking Regressor to consider it an improvement over this baseline support vector regression model with the Abalone dataset." 1807 | ] 1808 | }, 1809 | { 1810 | "cell_type": "markdown", 1811 | "metadata": { 1812 | "id": "J-OGF_7bupzn", 1813 | "colab_type": "text" 1814 | }, 1815 | "source": [ 1816 | "## **Getting started with Stacking Regressor**\n", 1817 | "- We're going to compare several additional baseline regressors to see if they perform better than SVR we just trained previously.\n", 1818 | "- We'll start by importing additional packages that we'll need." 1819 | ] 1820 | }, 1821 | { 1822 | "cell_type": "code", 1823 | "metadata": { 1824 | "id": "jxbxTPkPrkNb", 1825 | "colab_type": "code", 1826 | "colab": {} 1827 | }, 1828 | "source": [ 1829 | "# Compare machine learning models for regression\n", 1830 | "from sklearn.linear_model import LinearRegression\n", 1831 | "from sklearn.neighbors import KNeighborsRegressor\n", 1832 | "from sklearn.tree import DecisionTreeRegressor\n", 1833 | "from sklearn.ensemble import StackingRegressor" 1834 | ], 1835 | "execution_count": null, 1836 | "outputs": [] 1837 | }, 1838 | { 1839 | "cell_type": "markdown", 1840 | "metadata": { 1841 | "id": "yixxr2JLN9UP", 1842 | "colab_type": "text" 1843 | }, 1844 | "source": [ 1845 | "## Create custom functions\n", 1846 | "1. get_stacking() - This function will create the layers of our `StackingRegressor()`.\n", 1847 | "2. get_models() - This function will create a dictionary of models to be evaluated.\n", 1848 | "3. evaluate_model() - This function will evaluate each of the models to be compared." 1849 | ] 1850 | }, 1851 | { 1852 | "cell_type": "markdown", 1853 | "metadata": { 1854 | "id": "FdF239ZRN92B", 1855 | "colab_type": "text" 1856 | }, 1857 | "source": [ 1858 | "## Custom function # 1: get_stacking()\n", 1859 | "1. `StackingRegressor()` arguments:\n", 1860 | " - `estimators`: List of baseline regressors\n", 1861 | " - `final_estimator`: Defined meta regressor \n", 1862 | " - `cv`: Number of cross validations to perform." 1863 | ] 1864 | }, 1865 | { 1866 | "cell_type": "code", 1867 | "metadata": { 1868 | "id": "qoRNxZSj72bZ", 1869 | "colab_type": "code", 1870 | "colab": {} 1871 | }, 1872 | "source": [ 1873 | "# Define get_stacking():\n", 1874 | "def get_stacking():\n", 1875 | "\n", 1876 | "\t# Create an empty list for the base models called layer1\n", 1877 | " layer1 = list()\n", 1878 | "\n", 1879 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1880 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 1881 | " layer1.append(('KNN', KNeighborsRegressor()))\n", 1882 | " layer1.append(('DT', DecisionTreeRegressor()))\n", 1883 | " layer1.append(('SVM', SVR()))\n", 1884 | "\n", 1885 | " # Instantiate Linear Regression as meta learner model called layer2\n", 1886 | " layer2 = LinearRegression()\n", 1887 | "\n", 1888 | "\t# Define Stackingregressor() called model passing layer1 model list and meta learner with 5 cross-validations\n", 1889 | " model = StackingRegressor(estimators=layer1, final_estimator=layer2, cv=5)\n", 1890 | "\n", 1891 | " # return model\n", 1892 | " return model" 1893 | ], 1894 | "execution_count": null, 1895 | "outputs": [] 1896 | }, 1897 | { 1898 | "cell_type": "markdown", 1899 | "metadata": { 1900 | "id": "KClsJExROLAZ", 1901 | "colab_type": "text" 1902 | }, 1903 | "source": [ 1904 | "## Custom function # 2: get_models()" 1905 | ] 1906 | }, 1907 | { 1908 | "cell_type": "code", 1909 | "metadata": { 1910 | "id": "PtYbhE_ps4yo", 1911 | "colab_type": "code", 1912 | "colab": {} 1913 | }, 1914 | "source": [ 1915 | "# Define get_models():\n", 1916 | "def get_models():\n", 1917 | "\n", 1918 | " # Create empty dictionary called models\n", 1919 | " models = dict()\n", 1920 | "\n", 1921 | " # Add key:value pairs to dictionary with key as ModelName and value as instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 1922 | " # Hint: models['ModelName'] = Classifier()\n", 1923 | " models['KNN'] = KNeighborsRegressor()\n", 1924 | " models['DT'] = DecisionTreeRegressor()\n", 1925 | " models['SVM'] = SVR()\n", 1926 | "\n", 1927 | " # Add key:value pair to dictionary with key called Stacking and value that calls get_stacking() custom function\n", 1928 | " models['Stacking'] = get_stacking()\n", 1929 | "\n", 1930 | " # return dictionary\n", 1931 | " return models" 1932 | ], 1933 | "execution_count": null, 1934 | "outputs": [] 1935 | }, 1936 | { 1937 | "cell_type": "markdown", 1938 | "metadata": { 1939 | "id": "SYH3KcjcOc56", 1940 | "colab_type": "text" 1941 | }, 1942 | "source": [ 1943 | "## Custom function # 3: evaluate_model(model)" 1944 | ] 1945 | }, 1946 | { 1947 | "cell_type": "code", 1948 | "metadata": { 1949 | "id": "H95M82gks6EL", 1950 | "colab_type": "code", 1951 | "colab": {} 1952 | }, 1953 | "source": [ 1954 | "# Define evaluate_model:\n", 1955 | "def evaluate_model(model):\n", 1956 | "\n", 1957 | " # Create RepeatedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.\n", 1958 | "\tcv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)\n", 1959 | " \n", 1960 | " # Calculate accuracy using `cross_val_score()` with model instantiated, data to fit, target variable, 'neg_mean_absolute_error' scoring, cross validator 'cv', n_jobs=-1, and error_score set to 'raise'\n", 1961 | "\tscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')\n", 1962 | " \n", 1963 | " # return scores\n", 1964 | "\treturn scores" 1965 | ], 1966 | "execution_count": null, 1967 | "outputs": [] 1968 | }, 1969 | { 1970 | "cell_type": "code", 1971 | "metadata": { 1972 | "id": "2C6Hw-wj56eK", 1973 | "colab_type": "code", 1974 | "colab": {} 1975 | }, 1976 | "source": [ 1977 | "# Assign get_models() to a variable called models\n", 1978 | "models = get_models()" 1979 | ], 1980 | "execution_count": null, 1981 | "outputs": [] 1982 | }, 1983 | { 1984 | "cell_type": "code", 1985 | "metadata": { 1986 | "id": "BZl3DjmU58Lm", 1987 | "colab_type": "code", 1988 | "colab": {} 1989 | }, 1990 | "source": [ 1991 | "# Evaluate the models and store results\n", 1992 | "# Create an empty list for the results\n", 1993 | "results = list()\n", 1994 | "\n", 1995 | "# Create an empty list for the model names\n", 1996 | "names = list()\n", 1997 | "\n", 1998 | "# Create a for loop that iterates over each name, model in models dictionary \n", 1999 | "for name, model in models.items():\n", 2000 | "\n", 2001 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 2002 | "\tscores = evaluate_model(model)\n", 2003 | " \n", 2004 | " # Append output from scores to the results list\n", 2005 | "\tresults.append(scores)\n", 2006 | " \n", 2007 | " # Append name to the names list\n", 2008 | "\tnames.append(name)\n", 2009 | " \n", 2010 | " # Print name, mean and standard deviation of scores:\n", 2011 | "\tprint('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\n", 2012 | " \n", 2013 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 2014 | "sns.boxplot(x=names, y=results, showmeans=True)" 2015 | ], 2016 | "execution_count": null, 2017 | "outputs": [] 2018 | }, 2019 | { 2020 | "cell_type": "markdown", 2021 | "metadata": { 2022 | "id": "d6EKNBV1UOuG", 2023 | "colab_type": "text" 2024 | }, 2025 | "source": [ 2026 | "## **Observation**\n", 2027 | "- Recall that we want to do better than -1.48 with a Stacking Regressor to consider it an improvement over this baseline SVR and, although close, we did not achieve that with this dataset.\n", 2028 | "- So what else can try to improve our results with stacking?\n", 2029 | "\n", 2030 | "### We'll add another layer to the mix..." 2031 | ] 2032 | }, 2033 | { 2034 | "cell_type": "markdown", 2035 | "metadata": { 2036 | "id": "N9DZ7iyZFxXo", 2037 | "colab_type": "text" 2038 | }, 2039 | "source": [ 2040 | "## **Double Stacking - 2 Layers**\n", 2041 | "- Can get a little tricky\n", 2042 | "- Just make sure that you name your layers VERY CLEARLY!\n", 2043 | "- Both the last layer (here it's layer 3) and the stacking model will use a call to `StackingRegressor()`\n", 2044 | "- The last layer will combine the 2nd layer with the final estimator while the model will combine the 1st layer with this last layer.\n", 2045 | "\n", 2046 | "

\n", 2047 | "\"Double\n", 2048 | "

\n", 2049 | "

" 2050 | ] 2051 | }, 2052 | { 2053 | "cell_type": "code", 2054 | "metadata": { 2055 | "id": "fXvUmmQQF6vq", 2056 | "colab_type": "code", 2057 | "colab": {} 2058 | }, 2059 | "source": [ 2060 | "# Define get_stacking() - adding another layer:\n", 2061 | "def get_stacking():\n", 2062 | "\n", 2063 | "\t# Create an empty list for the 1st layer of base models called layer1\n", 2064 | " layer1 = list()\n", 2065 | "\n", 2066 | " # Create an empty list for the 2nd layer of base models called layer2\n", 2067 | " layer2 = list()\n", 2068 | "\n", 2069 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 2070 | " # Hint: layer1.append(('ModelName', Classifier()))\n", 2071 | " layer1.append(('KNN', KNeighborsRegressor()))\n", 2072 | " layer1.append(('DT', DecisionTreeRegressor()))\n", 2073 | " layer1.append(('SVM', SVR()))\n", 2074 | "\n", 2075 | " # Append tuple with classifier name and instantiations (no arguments) for KNeighborsRegressor, DecisionTreeRegressor, and SVR base models\n", 2076 | " # Hint: layer2.append(('ModelName', Classifier()))\n", 2077 | " layer2.append(('KNN', KNeighborsRegressor()))\n", 2078 | " layer2.append(('DT', DecisionTreeRegressor()))\n", 2079 | " layer2.append(('SVM', SVR()))\n", 2080 | "\n", 2081 | "\t# Define meta learner StackingRegressor() called layer3 passing layer2 model list to estimators, LinearRegression() to final_estimator with 5 cross-validations\n", 2082 | " layer3 = StackingRegressor(estimators=layer2, final_estimator=LinearRegression(), cv=5)\n", 2083 | "\n", 2084 | "\t# Define Stackingregressor() called model passing layer1 model list to estimators and meta learner (layer3) to final_estimator with 5 cross-validations\n", 2085 | " model = StackingRegressor(estimators=layer1, final_estimator=layer3, cv=5)\n", 2086 | "\n", 2087 | " # return model\n", 2088 | " return model" 2089 | ], 2090 | "execution_count": null, 2091 | "outputs": [] 2092 | }, 2093 | { 2094 | "cell_type": "code", 2095 | "metadata": { 2096 | "id": "CnMMqOJ16Bft", 2097 | "colab_type": "code", 2098 | "colab": {} 2099 | }, 2100 | "source": [ 2101 | "# Assign get_models() to a variable called models\n", 2102 | "models = get_models()" 2103 | ], 2104 | "execution_count": null, 2105 | "outputs": [] 2106 | }, 2107 | { 2108 | "cell_type": "code", 2109 | "metadata": { 2110 | "id": "kvzSjLOEIKUx", 2111 | "colab_type": "code", 2112 | "colab": {} 2113 | }, 2114 | "source": [ 2115 | "# Evaluate the models and store results\n", 2116 | "# Create an empty list for the results\n", 2117 | "results = list()\n", 2118 | "\n", 2119 | "# Create an empty list for the model names\n", 2120 | "names = list()\n", 2121 | "\n", 2122 | "# Create a for loop that iterates over each name, model in models dictionary \n", 2123 | "for name, model in models.items():\n", 2124 | "\n", 2125 | "\t# Call evaluate_model(model) and assign it to variable called scores\n", 2126 | "\tscores = evaluate_model(model)\n", 2127 | " \n", 2128 | " # Append output from scores to the results list\n", 2129 | "\tresults.append(scores)\n", 2130 | " \n", 2131 | " # Append name to the names list\n", 2132 | "\tnames.append(name)\n", 2133 | " \n", 2134 | " # Print name, mean and standard deviation of scores:\n", 2135 | "\tprint('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))\n", 2136 | " \n", 2137 | "# Plot model performance for comparison using names for x and results for y and setting showmeans to True\n", 2138 | "sns.boxplot(x=names, y=results, showmeans=True)" 2139 | ], 2140 | "execution_count": null, 2141 | "outputs": [] 2142 | }, 2143 | { 2144 | "cell_type": "markdown", 2145 | "metadata": { 2146 | "id": "ZMgN44SwcJPG", 2147 | "colab_type": "text" 2148 | }, 2149 | "source": [ 2150 | "## **Final Observation**\n", 2151 | "- Adding a layer did not improve results.\n", 2152 | "- Complexity does not always make a better model\n", 2153 | "- Could try different base models to stack for both of the datasets and that may show improvements over baseline.\n", 2154 | "- Generate polynomial features \n", 2155 | "- Try sklearn feature selection\n", 2156 | "- Try feature engineering - creating new features from existing ones (but remember to remove the original features to avoid multicollinearity)\n", 2157 | "- Tune hyperparameters for grid search as previously with Stacking Classifier\n", 2158 | "- When there is a tie between a baseline model and a stacked model, choose the simpler model!" 2159 | ] 2160 | }, 2161 | { 2162 | "cell_type": "markdown", 2163 | "metadata": { 2164 | "id": "Z4iX02EkDujS", 2165 | "colab_type": "text" 2166 | }, 2167 | "source": [ 2168 | "---\n", 2169 | "\n", 2170 | "# Q&A\n", 2171 | "\n", 2172 | "---" 2173 | ] 2174 | }, 2175 | { 2176 | "cell_type": "markdown", 2177 | "metadata": { 2178 | "id": "kNWB_J4QD0Ad", 2179 | "colab_type": "text" 2180 | }, 2181 | "source": [ 2182 | "# Back to the slides for wrap-up..." 2183 | ] 2184 | } 2185 | ] 2186 | } --------------------------------------------------------------------------------