├── 001_Decision_Tree_PlayGolf_ID3.ipynb ├── 002_Decision_Tree_PlayGolf_CART.ipynb ├── 003_Decision_Tree_Visualisation_Iris_Dataset.ipynb ├── 004_Decision_Tree_Classifier_Iris_Dataset.ipynb ├── LICENSE ├── README.md ├── dataset ├── playgolf_data.csv ├── playgolf_data.docx └── playgolf_data2.csv ├── img ├── CARTpg.png ├── ID3.png ├── ID3pg.png ├── decisiontree.png ├── dnld_rep.png ├── dt.png ├── dt0.png ├── dt1.png ├── dt2.png ├── dt3.png ├── iris.png ├── irisFlow.png └── playgolf.png └── output_DecisionTree ├── iris_DecisionTree (1).log ├── iris_DecisionTree_dtreeviz ├── iris_DecisionTree_dtreeviz.png ├── iris_DecisionTree_dtreeviz.svg ├── iris_DecisionTree_graphivz1 ├── iris_DecisionTree_graphivz1.png ├── iris_DecisionTree_graphivz2.png ├── iris_DecisionTree_plotTree.png ├── iris_DecisionTree_regression1.txt ├── iris_DecisionTree_regression2 ├── iris_DecisionTree_regression2.png ├── iris_DecisionTree_regression3 ├── iris_DecisionTree_regression3.png ├── iris_DecisionTree_regression3.svg ├── iris_DecisionTree_text.txt └── iris_DecisionTree_textRep.png /002_Decision_Tree_PlayGolf_CART.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "All the IPython Notebooks in this **Python Decision Tree and Random Forest** series by Dr. Milaan Parmar are available @ **[GitHub](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest)**\n", 9 | "" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# Decision Tree\n", 17 | "\n", 18 | "A Decision Tree is one of the popular and powerful machine learning algorithms that I have learned. It is a non-parametric supervised learning method that can be used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. For a classification model, the target values are discrete in nature, whereas, for a regression model, the target values are represented by continuous values. Unlike the black box type of algorithms such as Neural Network, Decision Trees are comparably easier to understand because it shares internal decision-making logic (you will find details in the following session).\n", 19 | "\n", 20 | "Despite the fact that many data scientists believe it’s an old method and they may have some doubts of its accuracy due to an overfitting problem, the more recent tree-based models, for example, Random forest (bagging method), gradient boosting (boosting method) and XGBoost (boosting method) are built on the top of decision tree algorithm. Therefore, the concepts and algorithms behind Decision Trees are strongly worth understanding!" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "There are *four* popular types of decision tree algorithms: \n", 28 | "\n", 29 | "1. **ID3**\n", 30 | "2. **CART (Classification And Regression Trees)**\n", 31 | "3. **Chi-Square**\n", 32 | "4. **Reduction in Variance**\n", 33 | "\n", 34 | "In this class, we'll focus on the classification trees and the explanations of **CART (Classification And Regression Trees)**." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "**Example:**\n", 42 | "\n", 43 | ">You play golf every Sunday and you invite your best friend, Arthur to come with you every time. Arthur sometimes comes to join but sometimes not. For him, it depends on a number of factors for example, **Weather**, **Temperature**, **Humidity** and **Wind**. We'll use the dataset of last two week to predict whether or not Arthur will join you to play golf. An intuitive way to do this is through a Decision Tree.\n", 44 | "\n", 45 | "
\n", 46 | "\n", 47 | "
" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "
\n", 55 | "\n", 56 | "
" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "* **Root Node:** \n", 64 | " - The attribute that best classifies the training data, use this attribute at the root of the tree. \n", 65 | " - The first split which decides the entire population or sample data should further get divided into two or more homogeneous sets.\n", 66 | " \n", 67 | "* **Splitting:** It is a process of dividing a node into two or more *sub-nodes*.\n", 68 | "\n", 69 | ">**Question:** Base on which attribute (feature) to split? What is the best split?\n", 70 | "\n", 71 | ">**Answer:** Use the attribute with the highest **Information Gain** or **Gini Gain**\n", 72 | "\n", 73 | "* **Decision Node:** This node decides whether/when a *sub-node* splits into further sub-nodes or not.\n", 74 | "\n", 75 | "* **Leaf:** Terminal Node that predicts the outcome (categorical or continues value). The *coloured nodes*, i.e., *Yes* and *No* nodes, are the leaves." 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## CART (Classification and Regression Tree)\n", 83 | "\n", 84 | "ID3 uses information gain whereas C4.5 uses gain ratio for splitting. Here, CART is an alternative decision tree building algorithm. It can handle both classification and regression tasks. CART algorithm uses a new metric named gini index to create decision points for classification tasks. Specifically, CART uses the Gini method to create split points including Gini Index (Gini Impurity) and Gini Gain.\n", 85 | "\n", 86 | "We will mention a step by step CART decision tree example by hand from scratch.\n", 87 | "\n", 88 | ">**Question:** What is **“Gini Index”**? and What is its function?\n", 89 | "\n", 90 | ">**Answer:** Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below.\n", 91 | "\n", 92 | "1. Compute the Gini Index for each attribute/feature values: **Gini (Attribute='value')** \n", 93 | " $$Gini (Attribute=value) = Gini(Av) = 1- ∑_{j} {p_{j}}^2$$ \n", 94 | " where i=1 to number of classes\n", 95 | "\n", 96 | "2. Compute the weightes sum of Gini Indexes for attribute/feature **Gini (Attribute)**\n", 97 | " $$Gini (Attribute) = ∑_{v} {p_{v}} * Gini(Av) $$ \n", 98 | " where v = values of attribute/feature\n", 99 | " \n", 100 | "3. Pick the **Lowest Gini Index Attribute**.\n", 101 | "\n", 102 | "4. **Repeat** until we get the tree we desired.\n", 103 | "\n", 104 | "After calculating Gini Gain for every attribute, **`sklearn.tree.DecisionTreeClassifier`** will choose attribute with the **largest Gini Gain** as the Root Node. A branch with Gini of 0 is a leaf node while a branch with Gini more than 0 needs further splitting. Nodes are grown recursively until all data is classified (see the detail below)." 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### Dataset\n", 112 | "\n", 113 | "We will work on same **[playgolf](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/dataset/playgolf_data.csv)** dataset in **[ID3](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/001_Decision_Tree_PlayGolf_ID3.ipynb)**. There are 14 instances of golf playing decisions based on **Outlook**, **Temperature**, **Humidity** and **Wind** factors. \n", 114 | "\n", 115 | "
\n", 116 | "\n", 117 | "
" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### 1. Calculate Gini Index for each Attribute of Dataset" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### Gini Index for each Attribute: (let say Outlook)\n", 132 | "\n", 133 | "#### Calculate Gini Index for each Values, i.e for 'Sunny', 'Rainy' and 'Overcast'.\n", 134 | "\n", 135 | "| Outlook | PlayGolf | | Outlook | PlayGolf | | Outlook | PlayGolf |\n", 136 | "|:---------:|:--------:|:---:|:---------:|:--------:|:---:|:------------:|:--------:|\n", 137 | "| **Sunny** | **No**❌ | \\| | **Rainy** | **Yes**✅| \\| | **Overcast** | **Yes**✅|\n", 138 | "| **Sunny** | **No**❌ | \\| | **Rainy** | **Yes**✅| \\| | **Overcast** | **Yes**✅|\n", 139 | "| **Sunny** | **No**❌ | \\| | **Rainy** | **No**❌ | \\| | **Overcast** | **Yes**✅|\n", 140 | "| **Sunny** | **Yes**✅| \\| | **Rainy** | **Yes**✅| \\| | **Overcast** | **Yes**✅|\n", 141 | "| **Sunny** | **Yes**✅| \\| | **Rainy** | **No**❌ | \\| | | |\n", 142 | "\n", 143 | "| Outlook | Yes | No | Total |\n", 144 | "|:------------ |:----:|:-----:|:-----:|\n", 145 | "| **Sunny** |**3** | **2** | **5** |\n", 146 | "| **Overcast** |**4** | **0** | **4** |\n", 147 | "| **Rainy** |**3** | **2** | **5** |\n", 148 | "| **Total** |**10**| **4** | **14** |\n", 149 | "\n", 150 | "1. Calculate Gini Index(Outlook='Value'):\n", 151 | "\n", 152 | "$$ Gini(Outlook = Sunny) = 1 - {\\Big(\\frac{2}{5}\\Big)}^2 - {\\Big(\\frac{3}{5}\\Big)}^2 = 0.48 $$\n", 153 | "\n", 154 | "➡$$ Gini(Outlook = Overcast) = 1 - {\\Big(\\frac{4}{4}\\Big)}^2 - {\\Big(\\frac{0}{4}\\Big)}^2 = 0 $$\n", 155 | "\n", 156 | "➡$$ Gini(Outlook = Rainy) = 1 - {\\Big(\\frac{3}{5}\\Big)}^2 - {\\Big(\\frac{2}{5}\\Big)}^2 = 0.48 $$\n", 157 | "\n", 158 | "2. Calculate weighted sum of Gini indexes for Outlook attribute.\n", 159 | "\n", 160 | "$$ Gini(Outlook) = \\frac{5}{14} * (0.48) + \\frac{4}{14} * (0) + \\frac{5}{14} * (0.48) = 0.3429 $$" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### 2. Gini Index for each Attribute: (let say Temperature)\n", 168 | "\n", 169 | "#### Calculate Gini Index for each Values, i.e for 'Hot', 'Mild' and 'Cool'.\n", 170 | "\n", 171 | "| Temperature | PlayGolf | | Temperature | PlayGolf | | Temperature | PlayGolf |\n", 172 | "|:-----------:|:--------:|:---:|:-----------:|:--------:|:---:|:-----------:|:--------:|\n", 173 | "| **Hot** | **No**❌ | \\| | **Mild** | **Yes**✅ | \\| | **Cool** | **Yes**✅ |\n", 174 | "| **Hot** | **No**❌ | \\| | **Mild** | **No**❌ | \\| | **Cool** | **No**❌ |\n", 175 | "| **Hot** | **Yes**✅ | \\| | **Mild** | **Yes**✅ | \\| | **Cool** | **Yes**✅ |\n", 176 | "| **Hot** | **Yes**✅ | \\| | **Mild** | **Yes**✅ | \\| | **Cool** | **Yes**✅ |\n", 177 | "| | | \\| | **Mild** | **Yes**✅ | \\| | | |\n", 178 | "| | | \\| | **Mild** | **No**❌ | \\| | | |\n", 179 | "\n", 180 | "| Temperature | Yes | No | Total |\n", 181 | "|:------------|:----:|:-----:|:-----:|\n", 182 | "| **Hot** |**2** | **2** | **4** |\n", 183 | "| **Mild** |**4** | **2** | **6** |\n", 184 | "| **Cool** |**3** | **1** | **4** |\n", 185 | "| **Total** |**9** | **5** | **14** |\n", 186 | "\n", 187 | "1. Calculate Gini Index(Temperature='Value'):\n", 188 | "\n", 189 | "$$ Gini(Temperature = Hot) = 1 - {\\Big(\\frac{2}{4}\\Big)}^2 - {\\Big(\\frac{2}{4}\\Big)}^2 = 0.5 $$\n", 190 | "\n", 191 | "➡$$ Gini(Temperature = Mild) = 1 - {\\Big(\\frac{4}{6}\\Big)}^2 - {\\Big(\\frac{2}{6}\\Big)}^2 = 0.445 $$\n", 192 | "\n", 193 | "➡$$ Gini(Temperature = Cool) = 1 - {\\Big(\\frac{3}{4}\\Big)}^2 - {\\Big(\\frac{1}{4}\\Big)}^2 = 0.375 $$\n", 194 | "\n", 195 | "2. Calculate weighted sum of Gini indexes for Temperature attribute.\n", 196 | "\n", 197 | "$$ Gini(Temperature) = \\frac{4}{14} * (0.5) + \\frac{6}{14} * (0.445) + \\frac{4}{14} * (0.48) = 0.439 $$" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "### Gini Index for each Attribute: (let say Humidity)\n", 205 | "\n", 206 | "#### Calculate Gini Index for each Values, i.e for 'Normal' and 'High'.\n", 207 | "\n", 208 | "| Humidity | PlayGolf | | Humidity | PlayGolf | \n", 209 | "|:--------:|:--------:|:---:|:--------:|:--------:|\n", 210 | "| **Normal** | **Yes**✅ | \\| | **High** | **No**❌ | \n", 211 | "| **Normal** | **No**❌ | \\| | **High** | **No**❌ | \n", 212 | "| **Normal** | **Yes**✅ | \\| | **High** | **Yes**✅ | \n", 213 | "| **Normal** | **Yes**✅ | \\| | **High** | **Yes**✅ | \n", 214 | "| **Normal** | **Yes**✅ | \\| | **High** | **No**❌ | \n", 215 | "| **Normal** | **Yes**✅ | \\| | **High** | **Yes**✅ | \n", 216 | "| **Normal** | **Yes**✅ | \\| | **High** | **No**❌ | \n", 217 | "\n", 218 | "| Humidity | Yes | No | Total |\n", 219 | "|:----------|:----:|:-----:|:-----:|\n", 220 | "| **Normal**|**6** | **1** | **7** |\n", 221 | "| **High** |**3** | **4** | **7** |\n", 222 | "| **Total** |**9** | **5** | **14** |\n", 223 | "\n", 224 | "1. Calculate Gini Index(Humidity='Value'):\n", 225 | "\n", 226 | "$$ Gini(Humidity = Normal) = 1 - {\\Big(\\frac{6}{7}\\Big)}^2 - {\\Big(\\frac{1}{7}\\Big)}^2 = 0.244 $$\n", 227 | "\n", 228 | "➡$$ Gini(Humidity = High) = 1 - {\\Big(\\frac{3}{7}\\Big)}^2 - {\\Big(\\frac{4}{7}\\Big)}^2 = 0.489 $$\n", 229 | "\n", 230 | "2. Calculate weighted sum of Gini indexes for Humidity attribute.\n", 231 | "\n", 232 | "$$ The Gini(Humidity) = \\frac{7}{14} * (0.244) + \\frac{7}{14} * (0.489) = 0.367 $$" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "### Gini Index for each Attribute: (let say Wind)\n", 240 | "\n", 241 | "#### Calculate Gini Index for each Values, i.e for 'Weak' and 'Strong'.\n", 242 | "\n", 243 | "| Wind | PlayGolf | | Wind | PlayGolf | \n", 244 | "|:--------:|:--------:|:---:|:--------:|:--------:|\n", 245 | "| **Weak** | **No**❌ | \\| | **Strong** | **No**❌ | \n", 246 | "| **Weak** | **Yes**✅ | \\| | **Strong** | **No**❌ | \n", 247 | "| **Weak** | **Yes**✅ | \\| | **Strong** | **Yes**✅ | \n", 248 | "| **Weak** | **Yes**✅ | \\| | **Strong** | **Yes**✅ | \n", 249 | "| **Weak** | **No**❌ | \\| | **Strong** | **Yes**✅ | \n", 250 | "| **Weak** | **Yes**✅ | \\| | **Strong** | **No**❌ | \n", 251 | "| **Weak** | **Yes**✅ | \\| | | | \n", 252 | "| **Weak** | **Yes**✅ | \\| | | | \n", 253 | "\n", 254 | "| Wind | Yes | No | Total |\n", 255 | "|:-------|:----:|:-----:|:-----:|\n", 256 | "| **Weak** |**6** | **2** | **8** |\n", 257 | "| **Strong**|**3** | **3** | **6** |\n", 258 | "| **Total** |**9** | **5** | **14** |\n", 259 | "\n", 260 | "1. Calculate Entropy(Wind='Value'):\n", 261 | "\n", 262 | "$$ Gini(Wind = Weak) = 1 - {\\Big(\\frac{6}{8}\\Big)}^2 - {\\Big(\\frac{2}{8}\\Big)}^2 = 0.375 $$\n", 263 | "\n", 264 | "➡$$ Gini(Wind = Strong) = 1 - {\\Big(\\frac{3}{6}\\Big)}^2 - {\\Big(\\frac{3}{6}\\Big)}^2 = 0.5 $$\n", 265 | "\n", 266 | "2. Calculate weighted sum of Gini indexes for Wind attribute.\n", 267 | "\n", 268 | "$$ The Gini(Wind) = \\frac{8}{14} * (0.375) + \\frac{6}{14} * (0.5) = 0.428 $$" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "### 2. Select Root Node of Dataset\n", 276 | "\n", 277 | "We’ve calculated Gini Index values for each feature. The winner will be **Outlook** attribute because its cost is the lowest (lowest Gini Index value). That's why, outlook decision will appear in the root node of the tree.\n", 278 | "\n", 279 | "| Attributes | Gini Index | | \n", 280 | "|:----------------|:----------:|:-----------:|\n", 281 | "| **Outlook** | **0.342** |⬅️ Root node |\n", 282 | "| **Temperature** | **0.439** | |\n", 283 | "| **Humidity** | **0.367** | |\n", 284 | "| **Wind** | **0.428** | |\n", 285 | "\n", 286 | "
\n", 287 | "\n", 288 | "
" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "### 3. Calculate Gini Index for each Attribute when Outlook is Sunny" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "Now, we need to test dataset for custom subsets of **Outlook** attribute.\n", 303 | "\n", 304 | "You might realize that sub dataset in the **Overcast** leaf has only **Yes** decisions. This means that **Overcast** leaf is over.\n", 305 | "\n", 306 | "**Outlook = Overcast**\n", 307 | "\n", 308 | "| Outlook | Temperature | Humidity | Wind | PlayGolf | |\n", 309 | "|:-------:|:-----------:|:--------:|:-----:|:--------:|:--------:|\n", 310 | "| **Overcast** | **Hot** | **High** | **Weak** | **Yes** | ✅ |\n", 311 | "| **Overcast** | **Cool** | **Normal** | **Strong** | **Yes** | ✅ |\n", 312 | "| **Overcast** | **Mild** | **High** | **Weak** | **Yes** | ✅ |\n", 313 | "| **Overcast** | **Hot** | **Normal** | **Strong** | **Yes** | ✅ |\n", 314 | "\n", 315 | "Basically, decision will always be **Yes** if outlook were **Overcast**.\n", 316 | "\n", 317 | "
\n", 318 | "\n", 319 | "
" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "We will apply same principles to those sub datasets in the following steps.\n", 327 | "\n", 328 | "Focus on the sub-dataset for **Sunny** **Outlook**. We need to find the Gini Index scores for **Temperature**, **Humidity** and **Wind** attributes respectively.\n", 329 | "\n", 330 | "**Outlook = Sunny**\n", 331 | "\n", 332 | "| Outlook | Temperature | Humidity | Wind | PlayGolf | |\n", 333 | "|:-------:|:-----------:|:--------:|:-----:|:--------:|:--------:|\n", 334 | "| **Sunny** | **Hot** | **High** | **Weak** | **No** | ❌ |\n", 335 | "| **Sunny** | **Hot** | **High** | **Strong** | **No** | ❌ | \n", 336 | "| **Sunny** | **Mild** | **High** | **Weak** | **No** | ❌ | \n", 337 | "| **Sunny** | **Cool** | **Normal** | **Weak** | **Yes** | ✅ | \n", 338 | "| **Sunny** | **Mild** | **Normal** | **Strong** | **Yes** | ✅ | " 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "### Gini Index for each Attribute(let say Temperature) for Sunny Outlook\n", 346 | "\n", 347 | "#### Calculate Gini Index for each Temperature, i.e for Cool', 'Hot' and 'Mild' for Sunny Outlook.\n", 348 | "\n", 349 | "| Outlook | Temperature | PlayGolf | |\n", 350 | "|:-------:|--------:|--------:|:--------:|\n", 351 | "| **Sunny** | **Cool** | **Yes** | ✅ |\n", 352 | "| **Sunny** | **Hot** | **No** | ❌ | \n", 353 | "| **Sunny** | **Hot** | **No** | ❌ | \n", 354 | "| **Sunny** | **Mild** | **No** | ❌ | \n", 355 | "| **Sunny** | **Mild** | **Yes** | ✅ |\n", 356 | "\n", 357 | "| Temperature | Yes | No | Total |\n", 358 | "|:------------|:----:|:-----:|:-----:|\n", 359 | "| **Hot** | **0** | **2** | **2** |\n", 360 | "| **Mild** | **1** | **1** | **2** |\n", 361 | "| **Cool** | **1** | **0** | **1** |\n", 362 | "| **Total** | **2** | **3** | **5** |\n", 363 | "\n", 364 | "1. Calculate Gini Index(Outlook=Sunny|Temperature='value'):\n", 365 | "\n", 366 | "$$ Gini(Outlook=Sunny|Temperature=Hot) = 1 - {\\Big(\\frac{0}{2}\\Big)}^2 - {\\Big(\\frac{2}{2}\\Big)}^2 = 0 $$\n", 367 | "\n", 368 | "➡$$ Gini(Outlook=Sunny|Temperature=Mild) = 1 - {\\Big(\\frac{1}{2}\\Big)}^2 - {\\Big(\\frac{1}{2}\\Big)}^2 = 0.5 $$\n", 369 | "\n", 370 | "➡$$ Gini(Outlook=Sunny|Temperature=Cool) = 1 - {\\Big(\\frac{1}{1}\\Big)}^2 - {\\Big(\\frac{0}{1}\\Big)}^2 = 0 $$\n", 371 | "\n", 372 | "2. Calculate weighted sum of Gini indexes for Outlook=Sunny|Temperature attribute.\n", 373 | "\n", 374 | "$$ Gini(Outlook=Sunny|Temperature) = \\frac{2}{5} * (0) + \\frac{2}{5} * (0.5) + \\frac{1}{5} * (0) = 0.2 $$" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "### Gini Index for each Attribute(let say Humidity) for Sunny Outlook\n", 382 | "\n", 383 | "Calculate Gini Index for each Humidity, i.e for 'High' and 'Normal' for Sunny Outlook.\n", 384 | "\n", 385 | "| Outlook | Humidity | PlayGolf | |\n", 386 | "|:----- --:|------- -:|--------:|:---:|\n", 387 | "| **Sunny** | **High** | **No** | ❌ |\n", 388 | "| **Sunny** | **High** | **No** | ❌ | \n", 389 | "| **Sunny** | **High** | **No** | ❌ | \n", 390 | "| **Sunny** | **Normal** | **Yes** | ✅ | \n", 391 | "| **Sunny** | **Normal** | **Yes** | ✅ |\n", 392 | "\n", 393 | "| Humidity | Yes | No | Total |\n", 394 | "|:-----------|:----:|:-----:|:-----:|\n", 395 | "| **Normal** | **2** | **0** | **2** |\n", 396 | "| **High** | **0** | **3** | **3** |\n", 397 | "| **Total** | **2** | **3** | **5** |\n", 398 | "\n", 399 | "1. Calculate Gini Index(Outlook=Sunny|Humidity='value'):\n", 400 | "\n", 401 | "$$ Gini(Outlook=Sunny|Humidity=Normal) = 1 - {\\Big(\\frac{2}{2}\\Big)}^2 - {\\Big(\\frac{0}{2}\\Big)}^2 = 0 $$\n", 402 | "\n", 403 | "➡$$ Gini(Outlook=Sunny|Humidity=High) = 1 - {\\Big(\\frac{0}{3}\\Big)}^2 - {\\Big(\\frac{3}{3}\\Big)}^2 = 0 $$\n", 404 | "\n", 405 | "2. Calculate weighted sum of Gini indexes for Outlook=Sunny|Humidity attribute.\n", 406 | "\n", 407 | "$$ Gini(Outlook=Sunny|Humidity) = \\frac{2}{5} * (0) + \\frac{3}{5} * (0) = 0 $$" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### Gini Index for each Attribute(let say Wind) for Sunny Outlook\n", 415 | "\n", 416 | "Calculate Gini Index for each Wind, i.e for 'Strong' and 'Weak' for Sunny Outlook.\n", 417 | "\n", 418 | "| Outlook | Wind |PlayGolf | |\n", 419 | "|:-------:|--------:|--------:|:--------:|\n", 420 | "| **Sunny** | **Strong** | **No** | ❌ |\n", 421 | "| **Sunny** | **Strong** | **Yes** | ✅ | \n", 422 | "| **Sunny** | **Weak** | **No** | ❌ | \n", 423 | "| **Sunny** | **Weak** | **No** | ❌ | \n", 424 | "| **Sunny** | **Weak** | **Yes** | ✅ |\n", 425 | "\n", 426 | "| Wind | Yes | No | Total |\n", 427 | "|:-----------|:----:|:-----:|:-----:|\n", 428 | "| **Weak** | **1** | **2** | **3** |\n", 429 | "| **Strong** | **1** | **1** | **2** |\n", 430 | "| **Total** | **2** | **3** | **5** |\n", 431 | "\n", 432 | "1. Calculate Gini Index(Outlook=Sunny|Wind='value'):\n", 433 | "\n", 434 | "$$ Gini(Outlook=Sunny|Wind=Weak) = 1 - {\\Big(\\frac{1}{3}\\Big)}^2 - {\\Big(\\frac{2}{3}\\Big)}^2 = 0.266 $$\n", 435 | "\n", 436 | "➡$$ Gini(Outlook=Sunny|Wind=Strong) = 1 - {\\Big(\\frac{1}{2}\\Big)}^2 - {\\Big(\\frac{1}{2}\\Big)}^2 = 0.2 $$\n", 437 | "\n", 438 | "2. Calculate weighted sum of Gini indexes for Outlook=Sunny|Wind attribute.\n", 439 | "\n", 440 | "$$ Gini(Outlook=Sunny|Wind) = \\frac{3}{5} * (0.266) + \\frac{2}{5} * (0.2) = 0.466 $$" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "### 4. Select Root Node of Sub-Dataset when Outlook is Sunny\n", 448 | "\n", 449 | "Decision for **Sunny** **Outlook**\n", 450 | "\n", 451 | "We’ve calculated Gini Index values for attribute when Outlook is Sunny. The winner will be **Humidity** attribute because its cost is the lowest (lowest Gini Index value). That's why, **Humidity** decision will appear in the next node of the Sunny.\n", 452 | "\n", 453 | "| Attributes | Gain | |\n", 454 | "|:----------------|:---------:|:---------:|\n", 455 | "| **Temperature** | **0.2** | |\n", 456 | "| **Humidity** | **0** | ⬅️ Root node|\n", 457 | "| **Wind** | **0.466** | |" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "Also, decision is always **No** for High Humidity and Sunny Outlook. On the other hand, decision will always be **Yes** for normal Humidity and Sunny Outlook. This branch is over.\n", 465 | "\n", 466 | "
\n", 467 | "\n", 468 | "
" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### 5. Calculate Gini Index for each Attribute when Outlook is Rainy" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": {}, 481 | "source": [ 482 | "Now, we need to focus on **Rainy** **Outlook**.\n", 483 | "\n", 484 | "Focus on the sub-dataset for **Rainy** **Outlook**. We need to find the Gini Index scores for **Temperature**, **Humidity** and **Wind** attributes respectively.\n", 485 | "\n", 486 | "**Outlook = Rainy**\n", 487 | "\n", 488 | "| Outlook | Temperature | Humidity | Wind | PlayGolf | |\n", 489 | "|:-------:|:-----------:|:--------:|:-----:|:--------:|:--------:|\n", 490 | "| **Rainy** | **Mild** | **High** | **Weak** | **Yes** | ✅ |\n", 491 | "| **Rainy** | **Cool** | **Normal** | **Weak** | **Yes** | ✅ |\n", 492 | "| **Rainy** | **Cool** | **Normal** | **Strong** | **No** | ❌ |\n", 493 | "| **Rainy** | **Mild** | **Normal** | **Weak** | **Yes** | ✅ |\n", 494 | "| **Rainy** | **Mild** | **High** | **Strong** | **No** | ❌ |" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "### Gini Index for each Attribute(let say Temperature) for Rainy Outlook\n", 502 | "\n", 503 | "#### Calculate Gini Index for each Temperature, i.e for Cool', 'Hot' and 'Mild' for Rainy Outlook.\n", 504 | "\n", 505 | "\n", 506 | "| Outlook | Temperature | PlayGolf | |\n", 507 | "|:-------:|:-----------:|:--------:|:--------:|\n", 508 | "| **Rainy** | **Mild** | **Yes** | ✅ |\n", 509 | "| **Rainy** | **Cool** | **Yes** | ✅ | \n", 510 | "| **Rainy** | **Cool** | **No** | ❌ | \n", 511 | "| **Rainy** | **Mild** | **Yes** | ✅ | \n", 512 | "| **Rainy** | **Mild** | **No** | ❌ |\n", 513 | "\n", 514 | "| Temperature | Yes | No | Total |\n", 515 | "|:------------|:----:|:-----:|:-----:|\n", 516 | "| **Mild** | **2** | **1** | **3** |\n", 517 | "| **Cool** | **1** | **1** | **2** |\n", 518 | "| **Total** | **3** | **2** | **5** |\n", 519 | "\n", 520 | "1. Calculate Gini Index(Outlook=Rainy|Temperature='value'):\n", 521 | "\n", 522 | "$$ Gini(Outlook=Rainy|Temperature=Mild) = 1 - {\\Big(\\frac{2}{3}\\Big)}^2 - {\\Big(\\frac{1}{3}\\Big)}^2 = 0.444 $$\n", 523 | "\n", 524 | "➡$$ Gini(Outlook=Rainy|Temperature=Cool) = 1 - {\\Big(\\frac{1}{2}\\Big)}^2 - {\\Big(\\frac{1}{2}\\Big)}^2 = 0.5 $$\n", 525 | "\n", 526 | "2. Calculate weighted sum of Gini indexes for Outlook=Rainy|Temperature attribute.\n", 527 | "\n", 528 | "$$ Gini(Outlook=Rainy|Temperature) = \\frac{3}{5} * (0.444) + \\frac{2}{5} * (0.5) + \\frac{1}{5} * (0) = 0.466 $$" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "### Gini Index for each Attribute(let say Humidity) for Rainy Outlook\n", 536 | "\n", 537 | "#### Calculate Gini Index for each Humidity, i.e for 'High' and 'Normal' for Rainy Outlook.\n", 538 | "\n", 539 | "| Outlook | Humidity | PlayGolf | |\n", 540 | "|:-------:|--------:|--------:|:--------:|\n", 541 | "| **Rainy** | **High** | **Yes** | ✅ |\n", 542 | "| **Rainy** | **High** | **No** | ❌ | \n", 543 | "| **Rainy** | **Normal** | **Yes** | ✅ | \n", 544 | "| **Rainy** | **Normal** | **No** | ❌ | \n", 545 | "| **Rainy** | **Normal** | **Yes** | ✅ |\n", 546 | "\n", 547 | "| Humidity | Yes | No | Total |\n", 548 | "|:-----------|:----:|:-----:|:-----:|\n", 549 | "| **Normal** | **2** | **1** | **3** |\n", 550 | "| **High** | **1** | **1** | **2** |\n", 551 | "| **Total** | **3** | **2** | **5** |\n", 552 | "\n", 553 | "1. Calculate Gini Index(Outlook=Rainy|Humidity='value'):\n", 554 | "\n", 555 | "$$ Gini(Outlook=Rainy|Humidity=Normal) = 1 - {\\Big(\\frac{2}{3}\\Big)}^2 - {\\Big(\\frac{1}{3}\\Big)}^2 = 0.444 $$\n", 556 | "\n", 557 | "➡$$ Gini(Outlook=Rainy|Humidity=High) = 1 - {\\Big(\\frac{1}{2}\\Big)}^2 - {\\Big(\\frac{1}{2}\\Big)}^2 = 0.5 $$\n", 558 | "\n", 559 | "2. Calculate weighted sum of Gini indexes for Outlook=Rainy|Humidity attribute.\n", 560 | "\n", 561 | "$$ Gini(Outlook=Rainy|Humidity) = \\frac{3}{5} * (0.444) + \\frac{2}{5} * (0.5) = 0.466 $$" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "### Gini Index for each Attribute(let say Wind) for Rainy Outlook\n", 569 | "\n", 570 | "#### Calculate Gini Index for each Wind, i.e for 'Strong' and 'Weak' for Rainy Outlook.\n", 571 | "\n", 572 | "| Outlook | Wind | PlayGolf | |\n", 573 | "|:-------:|--------:|--------:|:--------:|\n", 574 | "| **Rainy** | **Strong** | **No** | ❌ |\n", 575 | "| **Rainy** | **Strong** | **No** | ❌ | \n", 576 | "| **Rainy** | **Weak** | **Yes** | ✅ | \n", 577 | "| **Rainy** | **Weak** | **Yes** | ✅ | \n", 578 | "| **Rainy** | **Weak** | **Yes** | ✅ |\n", 579 | "\n", 580 | "| Wind | Yes | No | Total |\n", 581 | "|:-----------|:----:|:-----:|:-----:|\n", 582 | "| **Weak** | **3** | **0** | **3** |\n", 583 | "| **Strong** | **0** | **2** | **2** |\n", 584 | "| **Total** | **3** | **2** | **5** |\n", 585 | "\n", 586 | "1. Calculate Gini Index(Outlook=Rainy|Wind='value'):\n", 587 | "\n", 588 | "$$ Gini(Outlook=Rainy|Wind=Weak) = 1 - {\\Big(\\frac{3}{3}\\Big)}^2 - {\\Big(\\frac{0}{3}\\Big)}^2 = 0 $$\n", 589 | "\n", 590 | "➡$$ Gini(Outlook=Rainy|Wind=Strong) = 1 - {\\Big(\\frac{0}{2}\\Big)}^2 - {\\Big(\\frac{2}{2}\\Big)}^2 = 0 $$\n", 591 | "\n", 592 | "2. Calculate weighted sum of Gini indexes for Outlook=Rainy|Wind attribute.\n", 593 | "\n", 594 | "$$ Gini(Outlook=Rainy|Wind) = \\frac{3}{5} * (0) + \\frac{2}{5} * (0) = 0 $$" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "### 6. Select Root Node of Sub-Dataset when Outlook is Rainy\n", 602 | "\n", 603 | "Decision for **Rainy** **Outlook**\n", 604 | "\n", 605 | "We’ve calculated Gini Index values for attribute when Outlook is Rainy. The winner will be **Wind** attribute because its cost is the lowest (lowest Gini Index value). That's why, **Wind** decision will appear in the next node of the Rainy.\n", 606 | "\n", 607 | "\n", 608 | "| Attributes | Gain | |\n", 609 | "|:----------------|:---------:|:---------:|\n", 610 | "| **Temperature** | **0.466** | |\n", 611 | "| **Humidity** | **0.466** | |\n", 612 | "| **Wind** | **0** | ⬅️ Root node|" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "Also, decision is always **Yes** when Wind is Weak. On the other hand, decision is always **No** if Wind is Strong. This means that this branch is over.\n", 620 | "\n", 621 | "
\n", 622 | "\n", 623 | "
\n", 624 | "\n", 625 | "So, decision tree construction is over. We can use the following rules for decisioning." 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | ">**Question:** so…which should I use? Gini Index or Entropy? \n", 633 | "\n", 634 | ">**Answer:** Generally the result should be the same… I personally prefer Gini Index becasue it doesn’t involve a more computationally intensive log to calculate. But why not try both.\n", 635 | "\n", 636 | "Let us compare both:\n", 637 | "\n", 638 | "| Training Algorithm | CART (Classification and Regression Tree)| ID3 (Iterative Dichotomiser) |\n", 639 | "|:-------------------|:--|:-----|\n", 640 | "|**Target(s)** | **Classification and Regression** | **Classification** |\n", 641 | "|**Metric** | **Gini Index** | **Entropy function and Information gain** |\n", 642 | "|**Constant Function (Based on what to split?)** | **Select its splits to achieve the subsets that minimize Gini Impurity** | **Yield the largest Information Gain for categorical targets** |" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "## Building a Decision Tree using `scikit-learn`" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": 1, 655 | "metadata": { 656 | "ExecuteTime": { 657 | "end_time": "2021-08-19T07:33:24.442589Z", 658 | "start_time": "2021-08-19T07:33:05.345081Z" 659 | } 660 | }, 661 | "outputs": [], 662 | "source": [ 663 | "# Importing the necessary module!\n", 664 | "\n", 665 | "import numpy as np\n", 666 | "import pandas as pd" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": 2, 672 | "metadata": { 673 | "ExecuteTime": { 674 | "end_time": "2021-08-19T07:33:33.717907Z", 675 | "start_time": "2021-08-19T07:33:33.172016Z" 676 | }, 677 | "scrolled": true 678 | }, 679 | "outputs": [ 680 | { 681 | "data": { 682 | "text/html": [ 683 | "
\n", 684 | "\n", 697 | "\n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | "
OutlookTemperatureHumidityWindPlayGolf
0SunnyHotHighWeakNo
1SunnyHotHighStrongNo
2OvercastHotHighWeakYes
3RainyMildHighWeakYes
4RainyCoolNormalWeakYes
5RainyCoolNormalStrongNo
6OvercastCoolNormalStrongYes
7SunnyMildHighWeakNo
8SunnyCoolNormalWeakYes
9RainyMildNormalWeakYes
10SunnyMildNormalStrongYes
11OvercastMildHighStrongYes
12OvercastHotNormalWeakYes
13RainyMildHighStrongNo
\n", 823 | "
" 824 | ], 825 | "text/plain": [ 826 | " Outlook Temperature Humidity Wind PlayGolf\n", 827 | "0 Sunny Hot High Weak No\n", 828 | "1 Sunny Hot High Strong No\n", 829 | "2 Overcast Hot High Weak Yes\n", 830 | "3 Rainy Mild High Weak Yes\n", 831 | "4 Rainy Cool Normal Weak Yes\n", 832 | "5 Rainy Cool Normal Strong No\n", 833 | "6 Overcast Cool Normal Strong Yes\n", 834 | "7 Sunny Mild High Weak No\n", 835 | "8 Sunny Cool Normal Weak Yes\n", 836 | "9 Rainy Mild Normal Weak Yes\n", 837 | "10 Sunny Mild Normal Strong Yes\n", 838 | "11 Overcast Mild High Strong Yes\n", 839 | "12 Overcast Hot Normal Weak Yes\n", 840 | "13 Rainy Mild High Strong No" 841 | ] 842 | }, 843 | "execution_count": 2, 844 | "metadata": {}, 845 | "output_type": "execute_result" 846 | } 847 | ], 848 | "source": [ 849 | "# Importing data\n", 850 | "\n", 851 | "df = pd.read_csv(\"dataset/playgolf_data.csv\")\n", 852 | "df" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": 3, 858 | "metadata": { 859 | "ExecuteTime": { 860 | "end_time": "2021-08-19T07:33:38.375095Z", 861 | "start_time": "2021-08-19T07:33:38.341896Z" 862 | } 863 | }, 864 | "outputs": [ 865 | { 866 | "data": { 867 | "text/plain": [ 868 | "Outlook object\n", 869 | "Temperature object\n", 870 | "Humidity object\n", 871 | "Wind object\n", 872 | "PlayGolf object\n", 873 | "dtype: object" 874 | ] 875 | }, 876 | "execution_count": 3, 877 | "metadata": {}, 878 | "output_type": "execute_result" 879 | } 880 | ], 881 | "source": [ 882 | "df.dtypes" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": 4, 888 | "metadata": { 889 | "ExecuteTime": { 890 | "end_time": "2021-08-19T07:33:40.311117Z", 891 | "start_time": "2021-08-19T07:33:40.115321Z" 892 | } 893 | }, 894 | "outputs": [ 895 | { 896 | "name": "stdout", 897 | "output_type": "stream", 898 | "text": [ 899 | "\n", 900 | "RangeIndex: 14 entries, 0 to 13\n", 901 | "Data columns (total 5 columns):\n", 902 | " # Column Non-Null Count Dtype \n", 903 | "--- ------ -------------- ----- \n", 904 | " 0 Outlook 14 non-null object\n", 905 | " 1 Temperature 14 non-null object\n", 906 | " 2 Humidity 14 non-null object\n", 907 | " 3 Wind 14 non-null object\n", 908 | " 4 PlayGolf 14 non-null object\n", 909 | "dtypes: object(5)\n", 910 | "memory usage: 688.0+ bytes\n" 911 | ] 912 | } 913 | ], 914 | "source": [ 915 | "df.info()" 916 | ] 917 | }, 918 | { 919 | "cell_type": "code", 920 | "execution_count": 5, 921 | "metadata": { 922 | "ExecuteTime": { 923 | "end_time": "2021-08-19T07:35:27.905990Z", 924 | "start_time": "2021-08-19T07:35:27.805410Z" 925 | }, 926 | "scrolled": true 927 | }, 928 | "outputs": [ 929 | { 930 | "data": { 931 | "text/html": [ 932 | "
\n", 933 | "\n", 946 | "\n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | "
PlayGolfTemperature_CoolTemperature_HotTemperature_MildHumidity_HighHumidity_NormalOutlook_OvercastOutlook_RainyOutlook_SunnyWind_StrongWind_Weak
0No0101000101
1No0101000110
2Yes0101010001
3Yes0011001001
4Yes1000101001
5No1000101010
6Yes1000110010
7No0011000101
8Yes1000100101
9Yes0010101001
10Yes0010100110
11Yes0011010010
12Yes0100110001
13No0011001010
\n", 1162 | "
" 1163 | ], 1164 | "text/plain": [ 1165 | " PlayGolf Temperature_Cool Temperature_Hot Temperature_Mild \\\n", 1166 | "0 No 0 1 0 \n", 1167 | "1 No 0 1 0 \n", 1168 | "2 Yes 0 1 0 \n", 1169 | "3 Yes 0 0 1 \n", 1170 | "4 Yes 1 0 0 \n", 1171 | "5 No 1 0 0 \n", 1172 | "6 Yes 1 0 0 \n", 1173 | "7 No 0 0 1 \n", 1174 | "8 Yes 1 0 0 \n", 1175 | "9 Yes 0 0 1 \n", 1176 | "10 Yes 0 0 1 \n", 1177 | "11 Yes 0 0 1 \n", 1178 | "12 Yes 0 1 0 \n", 1179 | "13 No 0 0 1 \n", 1180 | "\n", 1181 | " Humidity_High Humidity_Normal Outlook_Overcast Outlook_Rainy \\\n", 1182 | "0 1 0 0 0 \n", 1183 | "1 1 0 0 0 \n", 1184 | "2 1 0 1 0 \n", 1185 | "3 1 0 0 1 \n", 1186 | "4 0 1 0 1 \n", 1187 | "5 0 1 0 1 \n", 1188 | "6 0 1 1 0 \n", 1189 | "7 1 0 0 0 \n", 1190 | "8 0 1 0 0 \n", 1191 | "9 0 1 0 1 \n", 1192 | "10 0 1 0 0 \n", 1193 | "11 1 0 1 0 \n", 1194 | "12 0 1 1 0 \n", 1195 | "13 1 0 0 1 \n", 1196 | "\n", 1197 | " Outlook_Sunny Wind_Strong Wind_Weak \n", 1198 | "0 1 0 1 \n", 1199 | "1 1 1 0 \n", 1200 | "2 0 0 1 \n", 1201 | "3 0 0 1 \n", 1202 | "4 0 0 1 \n", 1203 | "5 0 1 0 \n", 1204 | "6 0 1 0 \n", 1205 | "7 1 0 1 \n", 1206 | "8 1 0 1 \n", 1207 | "9 0 0 1 \n", 1208 | "10 1 1 0 \n", 1209 | "11 0 1 0 \n", 1210 | "12 0 0 1 \n", 1211 | "13 0 1 0 " 1212 | ] 1213 | }, 1214 | "execution_count": 5, 1215 | "metadata": {}, 1216 | "output_type": "execute_result" 1217 | } 1218 | ], 1219 | "source": [ 1220 | "# Converting categorical variables into dummies/indicator variables\n", 1221 | "\n", 1222 | "df_getdummy=pd.get_dummies(data=df, columns=['Temperature', 'Humidity', 'Outlook', 'Wind'])\n", 1223 | "df_getdummy" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "markdown", 1228 | "metadata": {}, 1229 | "source": [ 1230 | "The major disadvantage of Decision Trees is overfitting, especially when a tree is particularly deep. Fortunately, the more recent tree-based models including random forest and XGBoost are built on the top of decision tree algorithm and they generally perform better with a strong modeling technique and much more dynamic than a single decision tree. Therefore, understanding the concepts and algorithms behind Decision Trees thoroughly is super helpful to construct a good foundation of learning data science and machine learning." 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "markdown", 1235 | "metadata": {}, 1236 | "source": [ 1237 | "### Summary: Now you should know:\n", 1238 | "\n", 1239 | "1. How to construct a Decision Tree \n", 1240 | "2. How to calculate ‘Entropy’ and ‘Information Gain’ \n", 1241 | "3. How to calculate ‘Gini Index’ and ‘Gini Gain’ \n", 1242 | "4. What is the best split? \n", 1243 | "5. How to plot a Decision Tree Diagram in Python" 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "code", 1248 | "execution_count": null, 1249 | "metadata": {}, 1250 | "outputs": [], 1251 | "source": [] 1252 | } 1253 | ], 1254 | "metadata": { 1255 | "hide_input": false, 1256 | "kernelspec": { 1257 | "display_name": "Python 3", 1258 | "language": "python", 1259 | "name": "python3" 1260 | }, 1261 | "language_info": { 1262 | "codemirror_mode": { 1263 | "name": "ipython", 1264 | "version": 3 1265 | }, 1266 | "file_extension": ".py", 1267 | "mimetype": "text/x-python", 1268 | "name": "python", 1269 | "nbconvert_exporter": "python", 1270 | "pygments_lexer": "ipython3", 1271 | "version": "3.8.8" 1272 | }, 1273 | "toc": { 1274 | "base_numbering": 1, 1275 | "nav_menu": {}, 1276 | "number_sections": true, 1277 | "sideBar": true, 1278 | "skip_h1_title": false, 1279 | "title_cell": "Table of Contents", 1280 | "title_sidebar": "Contents", 1281 | "toc_cell": false, 1282 | "toc_position": {}, 1283 | "toc_section_display": true, 1284 | "toc_window_display": false 1285 | }, 1286 | "varInspector": { 1287 | "cols": { 1288 | "lenName": 16, 1289 | "lenType": 16, 1290 | "lenVar": 40 1291 | }, 1292 | "kernels_config": { 1293 | "python": { 1294 | "delete_cmd_postfix": "", 1295 | "delete_cmd_prefix": "del ", 1296 | "library": "var_list.py", 1297 | "varRefreshCmd": "print(var_dic_list())" 1298 | }, 1299 | "r": { 1300 | "delete_cmd_postfix": ") ", 1301 | "delete_cmd_prefix": "rm(", 1302 | "library": "var_list.r", 1303 | "varRefreshCmd": "cat(var_dic_list()) " 1304 | } 1305 | }, 1306 | "types_to_exclude": [ 1307 | "module", 1308 | "function", 1309 | "builtin_function_or_method", 1310 | "instance", 1311 | "_Feature" 1312 | ], 1313 | "window_display": false 1314 | } 1315 | }, 1316 | "nbformat": 4, 1317 | "nbformat_minor": 4 1318 | } 1319 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 milaan9 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | Last Commit 3 | 4 | 5 | 6 | 7 | Stars Badge 8 | Forks Badge 9 | Size 10 | Pull Requests Badge 11 | Issues Badge 12 | Language 13 | MIT License 14 |

15 | 16 |

17 | binder 18 | colab 19 |

20 | 21 | 22 | # Python Decision Tree and Random Forest 23 | 24 | ## Decision Tree 25 | 26 | A Decision Tree is one of the popular and powerful machine learning algorithms that I have learned. The basics of Decision Tree is explained in detail with clear explanation. 27 | 28 | 29 |

30 | 31 |

32 | 33 | I have given complete theoritical stepwise explanation of computing decision tree using **`ID3 (Iterative Dichotomiser)`** and **`CART (Classification And Regression Trees)`** along sucessfully implemention of decision tree on **`ID3`** and **`CART`** using Python on **[playgolf_data](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/dataset/playgolf_data.csv)** and **[Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris)** 34 | 35 | ### Play Golf dataset: 36 | | ![ID3](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/img/ID3pg.png) | ![CART](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/img/CARTpg.png) | 37 | |:---:|:---:| 38 | | ID3 dataset analysis| CART dataset analysis | 39 | 40 |

41 | 42 |

43 | 44 | ### Iris dataset 45 | 46 | 1. Method 1: Print Text Representation 47 | 48 |

49 | 50 |

51 | 52 | 2. Method 2: Plot Tree with plot_tree 53 | 54 |

55 | 56 |

57 | 58 | 3. Method 3: Plot Decision Tree with graphviz 59 | 60 |

61 | 62 |

63 | 64 | 4. Method 4: Plot Decision Tree with dtreeviz Package 65 | 66 |

67 | 68 |

69 | 70 | 5. Method 5: Visualizing the Decision Tree in Regression Task 71 | 72 |

73 | 74 |

75 | 76 |

77 | 78 |

79 | 80 | 81 | --- 82 | 83 | ## Table of contents 📋 84 | 85 | | **No.** | **Name** | 86 | | ------- | -------- | 87 | | 01 | **[Decision_Tree_PlayGolf_ID3](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/001_Decision_Tree_PlayGolf_ID3.ipynb)** | 88 | | 02 | **[Decision_Tree_PlayGolf_CART](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/001_Decision_Tree_PlayGolf_CART.ipynb)** | 89 | | 03 | **[Decision_Tree_Visualisation_Iris_Dataset](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/003_Decision_Tree_Visualisation_Iris_Dataset.ipynb)** | 90 | | 04 | **[Decision_Tree_Classifier_Iris_Dataset](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/blob/main/004_Decision_Tree_Classifier_Iris_Dataset.ipynb)** | 91 | 92 | 93 | These are online **read-only** versions. However you can **`Run ▶`** all the codes **online** by clicking here ➞ binder 94 | 95 | --- 96 | 97 | ## Frequently asked questions ❔ 98 | 99 | ### How can I thank you for writing and sharing this tutorial? 🌷 100 | 101 | You can Star Badge and Fork Badge Starring and Forking is free for you, but it tells me and other people that it was helpful and you like this tutorial. 102 | 103 | Go [**`here`**](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest) if you aren't here already and click ➞ **`✰ Star`** and **`ⵖ Fork`** button in the top right corner. You will be asked to create a GitHub account if you don't already have one. 104 | 105 | --- 106 | 107 | ### How can I read this tutorial without an Internet connection? GIF 108 | 109 | 1. Go [**`here`**](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest) and click the big green ➞ **`Code`** button in the top right of the page, then click ➞ [**`Download ZIP`**](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/archive/refs/heads/main.zip). 110 | 111 | ![Download ZIP](img/dnld_rep.png) 112 | 113 | 2. Extract the ZIP and open it. Unfortunately I don't have any more specific instructions because how exactly this is done depends on which operating system you run. 114 | 115 | 3. Launch ipython notebook from the folder which contains the notebooks. Open each one of them 116 | 117 | **`Kernel > Restart & Clear Output`** 118 | 119 | This will clear all the outputs and now you can understand each statement and learn interactively. 120 | 121 | If you have git and you know how to use it, you can also clone the repository instead of downloading a zip and extracting it. An advantage with doing it this way is that you don't need to download the whole tutorial again to get the latest version of it, all you need to do is to pull with git and run ipython notebook again. 122 | 123 | --- 124 | 125 | ## Authors ✍️ 126 | 127 | I'm Dr. Milaan Parmar and I have written this tutorial. If you think you can add/correct/edit and enhance this tutorial you are most welcome🙏 128 | 129 | See [github's contributors page](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/graphs/contributors) for details. 130 | 131 | If you have trouble with this tutorial please tell me about it by [Create an issue on GitHub](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest/issues/new). and I'll make this tutorial better. This is probably the best choice if you had trouble following the tutorial, and something in it should be explained better. You will be asked to create a GitHub account if you don't already have one. 132 | 133 | If you like this tutorial, please [give it a ⭐ star](https://github.com/milaan9/Python_Decision_Tree_and_Random_Forest). 134 | 135 | --- 136 | 137 | ## Licence 📜 138 | 139 | You may use this tutorial freely at your own risk. See [LICENSE](./LICENSE). 140 | -------------------------------------------------------------------------------- /dataset/playgolf_data.csv: -------------------------------------------------------------------------------- 1 | Outlook,Temperature,Humidity,Wind,PlayGolf 2 | Sunny,Hot,High,Weak,No 3 | Sunny,Hot,High,Strong,No 4 | Overcast,Hot,High,Weak,Yes 5 | Rainy,Mild,High,Weak,Yes 6 | Rainy,Cool,Normal,Weak,Yes 7 | Rainy,Cool,Normal,Strong,No 8 | Overcast,Cool,Normal,Strong,Yes 9 | Sunny,Mild,High,Weak,No 10 | Sunny,Cool,Normal,Weak,Yes 11 | Rainy,Mild,Normal,Weak,Yes 12 | Sunny,Mild,Normal,Strong,Yes 13 | Overcast,Mild,High,Strong,Yes 14 | Overcast,Hot,Normal,Weak,Yes 15 | Rainy,Mild,High,Strong,No 16 | -------------------------------------------------------------------------------- /dataset/playgolf_data.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/dataset/playgolf_data.docx -------------------------------------------------------------------------------- /dataset/playgolf_data2.csv: -------------------------------------------------------------------------------- 1 | Day,Outlook,Temperature,Humidity,Wind,PlayGolf 2 | D1,Sunny,Mild,80,No,Yes 3 | D2,Sunny,Hot,75,Yes,No 4 | D3,Overcast,Hot,77,No,Yes 5 | D4,Rainy,Cool,70,No,Yes 6 | D5,Overcast,Cool,72,Yes,Yes 7 | D6,Sunny,Mild,77,No,No 8 | D7,Sunny,Cool,70,No,Yes 9 | D8,Rainy,Mild,69,No,Yes 10 | D9,Sunny,Mild,65,Yes,Yes 11 | D10,Overcast,Mild,77,Yes,Yes 12 | D11,Overcast,Hot,74,No,Yes 13 | D12,Rain,Mild,77,Yes,No 14 | D13,Rain,Cool,73,Yes,No 15 | D14,Rain,Mild,78,No,Yes 16 | -------------------------------------------------------------------------------- /img/CARTpg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/CARTpg.png -------------------------------------------------------------------------------- /img/ID3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/ID3.png -------------------------------------------------------------------------------- /img/ID3pg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/ID3pg.png -------------------------------------------------------------------------------- /img/decisiontree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/decisiontree.png -------------------------------------------------------------------------------- /img/dnld_rep.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dnld_rep.png -------------------------------------------------------------------------------- /img/dt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dt.png -------------------------------------------------------------------------------- /img/dt0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dt0.png -------------------------------------------------------------------------------- /img/dt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dt1.png -------------------------------------------------------------------------------- /img/dt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dt2.png -------------------------------------------------------------------------------- /img/dt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/dt3.png -------------------------------------------------------------------------------- /img/iris.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/iris.png -------------------------------------------------------------------------------- /img/irisFlow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/irisFlow.png -------------------------------------------------------------------------------- /img/playgolf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/img/playgolf.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree (1).log: -------------------------------------------------------------------------------- 1 | |--- feature_2 <= 2.45 2 | | |--- class: 0 3 | |--- feature_2 > 2.45 4 | | |--- feature_3 <= 1.75 5 | | | |--- feature_2 <= 4.95 6 | | | | |--- feature_3 <= 1.65 7 | | | | | |--- class: 1 8 | | | | |--- feature_3 > 1.65 9 | | | | | |--- class: 2 10 | | | |--- feature_2 > 4.95 11 | | | | |--- feature_3 <= 1.55 12 | | | | | |--- class: 2 13 | | | | |--- feature_3 > 1.55 14 | | | | | |--- feature_2 <= 5.45 15 | | | | | | |--- class: 1 16 | | | | | |--- feature_2 > 5.45 17 | | | | | | |--- class: 2 18 | | |--- feature_3 > 1.75 19 | | | |--- feature_2 <= 4.85 20 | | | | |--- feature_1 <= 3.10 21 | | | | | |--- class: 2 22 | | | | |--- feature_1 > 3.10 23 | | | | | |--- class: 1 24 | | | |--- feature_2 > 4.85 25 | | | | |--- class: 2 26 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_dtreeviz: -------------------------------------------------------------------------------- 1 | 2 | digraph G { 3 | splines=line; 4 | nodesep=0.1; 5 | ranksep=.2; 6 | rankdir=TD; 7 | margin=0.0; 8 | 9 | node [margin="0.03" penwidth="0.5" width=.1, height=.1]; 10 | edge [arrowsize=.4 penwidth="0.3"] 11 | 12 | node4 [margin="0" shape=none label=< 13 | 14 | 15 | 16 | 17 |
>] 18 | node9 [margin="0" shape=none label=< 19 | 20 | 21 | 22 | 23 |
>] 24 | node7 [margin="0" shape=none label=< 25 | 26 | 27 | 28 | 29 |
>] 30 | node3 [margin="0" shape=none label=< 31 | 32 | 33 | 34 | 35 |
>] 36 | node13 [margin="0" shape=none label=< 37 | 38 | 39 | 40 | 41 |
>] 42 | node12 [margin="0" shape=none label=< 43 | 44 | 45 | 46 | 47 |
>] 48 | node2 [margin="0" shape=none label=< 49 | 50 | 51 | 52 | 53 |
>] 54 | node0 [margin="0" shape=none label=< 55 | 56 | 57 | 58 | 59 |
>] 60 | node4 -> leaf5 [penwidth=0.3 color="#444443" label=<>] 61 | node4 -> leaf6 [penwidth=0.3 color="#444443" label=<>] 62 | 63 | { 64 | rank=same; 65 | leaf5 -> leaf6 [style=invis] 66 | } 67 | 68 | node9 -> leaf10 [penwidth=0.3 color="#444443" label=<>] 69 | node9 -> leaf11 [penwidth=0.3 color="#444443" label=<>] 70 | 71 | { 72 | rank=same; 73 | leaf10 -> leaf11 [style=invis] 74 | } 75 | 76 | node7 -> leaf8 [penwidth=0.3 color="#444443" label=<>] 77 | node7 -> node9 [penwidth=0.3 color="#444443" label=<>] 78 | 79 | { 80 | rank=same; 81 | leaf8 -> node9 [style=invis] 82 | } 83 | 84 | node3 -> node4 [penwidth=0.3 color="#444443" label=<>] 85 | node3 -> node7 [penwidth=0.3 color="#444443" label=<>] 86 | 87 | { 88 | rank=same; 89 | node4 -> node7 [style=invis] 90 | } 91 | 92 | node13 -> leaf14 [penwidth=0.3 color="#444443" label=<>] 93 | node13 -> leaf15 [penwidth=0.3 color="#444443" label=<>] 94 | 95 | { 96 | rank=same; 97 | leaf14 -> leaf15 [style=invis] 98 | } 99 | 100 | node12 -> node13 [penwidth=0.3 color="#444443" label=<>] 101 | node12 -> leaf16 [penwidth=0.3 color="#444443" label=<>] 102 | 103 | { 104 | rank=same; 105 | node13 -> leaf16 [style=invis] 106 | } 107 | 108 | node2 -> node3 [penwidth=0.3 color="#444443" label=<>] 109 | node2 -> node12 [penwidth=0.3 color="#444443" label=<>] 110 | 111 | { 112 | rank=same; 113 | node3 -> node12 [style=invis] 114 | } 115 | 116 | node0 -> leaf1 [penwidth=0.3 color="#444443" label=<≤>] 117 | node0 -> node2 [penwidth=0.3 color="#444443" label=<>>] 118 | 119 | { 120 | rank=same; 121 | leaf1 -> node2 [style=invis] 122 | } 123 | 124 | leaf1 [margin="0" shape=box penwidth="0" color="#444443" label=< 125 | 126 | 127 | 128 | 129 |
>] 130 | leaf5 [margin="0" shape=box penwidth="0" color="#444443" label=< 131 | 132 | 133 | 134 | 135 |
>] 136 | leaf6 [margin="0" shape=box penwidth="0" color="#444443" label=< 137 | 138 | 139 | 140 | 141 |
>] 142 | leaf8 [margin="0" shape=box penwidth="0" color="#444443" label=< 143 | 144 | 145 | 146 | 147 |
>] 148 | leaf10 [margin="0" shape=box penwidth="0" color="#444443" label=< 149 | 150 | 151 | 152 | 153 |
>] 154 | leaf11 [margin="0" shape=box penwidth="0" color="#444443" label=< 155 | 156 | 157 | 158 | 159 |
>] 160 | leaf14 [margin="0" shape=box penwidth="0" color="#444443" label=< 161 | 162 | 163 | 164 | 165 |
>] 166 | leaf15 [margin="0" shape=box penwidth="0" color="#444443" label=< 167 | 168 | 169 | 170 | 171 |
>] 172 | leaf16 [margin="0" shape=box penwidth="0" color="#444443" label=< 173 | 174 | 175 | 176 | 177 |
>] 178 | 179 | 180 | subgraph cluster_legend { 181 | style=invis; 182 | legend [penwidth="0" margin="0" shape=box margin="0.03" width=.1, height=.1 label=< 183 | 184 | 185 | 186 | 187 | 188 |
189 | 190 | >] 191 | } 192 | 193 | 194 | } 195 | 196 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_dtreeviz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_dtreeviz.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_graphivz1: -------------------------------------------------------------------------------- 1 | digraph Tree { 2 | node [shape=box, style="filled", color="black"] ; 3 | 0 [label="petal width (cm) <= 0.8\ngini = 0.667\nsamples = 150\nvalue = [50, 50, 50]\nclass = setosa", fillcolor="#ffffff"] ; 4 | 1 [label="gini = 0.0\nsamples = 50\nvalue = [50, 0, 0]\nclass = setosa", fillcolor="#e58139"] ; 5 | 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 6 | 2 [label="petal width (cm) <= 1.75\ngini = 0.5\nsamples = 100\nvalue = [0, 50, 50]\nclass = versicolor", fillcolor="#ffffff"] ; 7 | 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; 8 | 3 [label="petal length (cm) <= 4.95\ngini = 0.168\nsamples = 54\nvalue = [0, 49, 5]\nclass = versicolor", fillcolor="#4de88e"] ; 9 | 2 -> 3 ; 10 | 4 [label="petal width (cm) <= 1.65\ngini = 0.041\nsamples = 48\nvalue = [0, 47, 1]\nclass = versicolor", fillcolor="#3de684"] ; 11 | 3 -> 4 ; 12 | 5 [label="gini = 0.0\nsamples = 47\nvalue = [0, 47, 0]\nclass = versicolor", fillcolor="#39e581"] ; 13 | 4 -> 5 ; 14 | 6 [label="gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]\nclass = virginica", fillcolor="#8139e5"] ; 15 | 4 -> 6 ; 16 | 7 [label="petal width (cm) <= 1.55\ngini = 0.444\nsamples = 6\nvalue = [0, 2, 4]\nclass = virginica", fillcolor="#c09cf2"] ; 17 | 3 -> 7 ; 18 | 8 [label="gini = 0.0\nsamples = 3\nvalue = [0, 0, 3]\nclass = virginica", fillcolor="#8139e5"] ; 19 | 7 -> 8 ; 20 | 9 [label="sepal length (cm) <= 6.95\ngini = 0.444\nsamples = 3\nvalue = [0, 2, 1]\nclass = versicolor", fillcolor="#9cf2c0"] ; 21 | 7 -> 9 ; 22 | 10 [label="gini = 0.0\nsamples = 2\nvalue = [0, 2, 0]\nclass = versicolor", fillcolor="#39e581"] ; 23 | 9 -> 10 ; 24 | 11 [label="gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]\nclass = virginica", fillcolor="#8139e5"] ; 25 | 9 -> 11 ; 26 | 12 [label="petal length (cm) <= 4.85\ngini = 0.043\nsamples = 46\nvalue = [0, 1, 45]\nclass = virginica", fillcolor="#843de6"] ; 27 | 2 -> 12 ; 28 | 13 [label="sepal width (cm) <= 3.1\ngini = 0.444\nsamples = 3\nvalue = [0, 1, 2]\nclass = virginica", fillcolor="#c09cf2"] ; 29 | 12 -> 13 ; 30 | 14 [label="gini = 0.0\nsamples = 2\nvalue = [0, 0, 2]\nclass = virginica", fillcolor="#8139e5"] ; 31 | 13 -> 14 ; 32 | 15 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]\nclass = versicolor", fillcolor="#39e581"] ; 33 | 13 -> 15 ; 34 | 16 [label="gini = 0.0\nsamples = 43\nvalue = [0, 0, 43]\nclass = virginica", fillcolor="#8139e5"] ; 35 | 12 -> 16 ; 36 | } 37 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_graphivz1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_graphivz1.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_graphivz2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_graphivz2.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_plotTree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_plotTree.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_regression1.txt: -------------------------------------------------------------------------------- 1 | |--- feature_2 <= 2.45 2 | | |--- value: [0.00] 3 | |--- feature_2 > 2.45 4 | | |--- feature_3 <= 1.75 5 | | | |--- feature_2 <= 4.95 6 | | | | |--- value: [1.02] 7 | | | |--- feature_2 > 4.95 8 | | | | |--- value: [1.67] 9 | | |--- feature_3 > 1.75 10 | | | |--- feature_2 <= 4.85 11 | | | | |--- value: [1.67] 12 | | | |--- feature_2 > 4.85 13 | | | | |--- value: [2.00] 14 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_regression2: -------------------------------------------------------------------------------- 1 | digraph Tree { 2 | node [shape=box, style="filled", color="black"] ; 3 | 0 [label="petal length (cm) <= 2.45\nmse = 0.667\nsamples = 150\nvalue = 1.0", fillcolor="#f2c09c"] ; 4 | 1 [label="mse = 0.0\nsamples = 50\nvalue = 0.0", fillcolor="#ffffff"] ; 5 | 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 6 | 2 [label="petal width (cm) <= 1.75\nmse = 0.25\nsamples = 100\nvalue = 1.5", fillcolor="#eca06a"] ; 7 | 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; 8 | 3 [label="petal length (cm) <= 4.95\nmse = 0.084\nsamples = 54\nvalue = 1.093", fillcolor="#f1ba93"] ; 9 | 2 -> 3 ; 10 | 4 [label="mse = 0.02\nsamples = 48\nvalue = 1.021", fillcolor="#f2bf9a"] ; 11 | 3 -> 4 ; 12 | 5 [label="mse = 0.222\nsamples = 6\nvalue = 1.667", fillcolor="#e9965a"] ; 13 | 3 -> 5 ; 14 | 6 [label="petal length (cm) <= 4.85\nmse = 0.021\nsamples = 46\nvalue = 1.978", fillcolor="#e5823b"] ; 15 | 2 -> 6 ; 16 | 7 [label="mse = 0.222\nsamples = 3\nvalue = 1.667", fillcolor="#e9965a"] ; 17 | 6 -> 7 ; 18 | 8 [label="mse = 0.0\nsamples = 43\nvalue = 2.0", fillcolor="#e58139"] ; 19 | 6 -> 8 ; 20 | } 21 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_regression2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_regression2.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_regression3: -------------------------------------------------------------------------------- 1 | 2 | digraph G { 3 | splines=line; 4 | nodesep=0.1; 5 | ranksep=.2; 6 | rankdir=TD; 7 | margin=0.0; 8 | 9 | node [margin="0.03" penwidth="0.5" width=.1, height=.1]; 10 | edge [arrowsize=.4 penwidth="0.3"] 11 | 12 | node3 [margin="0" shape=none label=< 13 | 14 | 15 | 16 | 17 |
>] 18 | node6 [margin="0" shape=none label=< 19 | 20 | 21 | 22 | 23 |
>] 24 | node2 [margin="0" shape=none label=< 25 | 26 | 27 | 28 | 29 |
>] 30 | node0 [margin="0" shape=none label=< 31 | 32 | 33 | 34 | 35 |
>] 36 | node3 -> leaf4 [penwidth=0.3 color="#444443" label=<>] 37 | node3 -> leaf5 [penwidth=0.3 color="#444443" label=<>] 38 | 39 | { 40 | rank=same; 41 | leaf4 -> leaf5 [style=invis] 42 | } 43 | 44 | node6 -> leaf7 [penwidth=0.3 color="#444443" label=<>] 45 | node6 -> leaf8 [penwidth=0.3 color="#444443" label=<>] 46 | 47 | { 48 | rank=same; 49 | leaf7 -> leaf8 [style=invis] 50 | } 51 | 52 | node2 -> node3 [penwidth=0.3 color="#444443" label=<>] 53 | node2 -> node6 [penwidth=0.3 color="#444443" label=<>] 54 | 55 | { 56 | rank=same; 57 | node3 -> node6 [style=invis] 58 | } 59 | 60 | node0 -> leaf1 [penwidth=0.3 color="#444443" label=<≤>] 61 | node0 -> node2 [penwidth=0.3 color="#444443" label=<>>] 62 | 63 | { 64 | rank=same; 65 | leaf1 -> node2 [style=invis] 66 | } 67 | 68 | leaf1 [margin="0" shape=box penwidth="0" color="#444443" label=< 69 | 70 | 71 | 72 | 73 |
>] 74 | leaf4 [margin="0" shape=box penwidth="0" color="#444443" label=< 75 | 76 | 77 | 78 | 79 |
>] 80 | leaf5 [margin="0" shape=box penwidth="0" color="#444443" label=< 81 | 82 | 83 | 84 | 85 |
>] 86 | leaf7 [margin="0" shape=box penwidth="0" color="#444443" label=< 87 | 88 | 89 | 90 | 91 |
>] 92 | leaf8 [margin="0" shape=box penwidth="0" color="#444443" label=< 93 | 94 | 95 | 96 | 97 |
>] 98 | 99 | 100 | 101 | } 102 | 103 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_regression3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_regression3.png -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_text.txt: -------------------------------------------------------------------------------- 1 | |--- feature_3 <= 0.80 2 | | |--- class: 0 3 | |--- feature_3 > 0.80 4 | | |--- feature_3 <= 1.75 5 | | | |--- feature_2 <= 4.95 6 | | | | |--- feature_3 <= 1.65 7 | | | | | |--- class: 1 8 | | | | |--- feature_3 > 1.65 9 | | | | | |--- class: 2 10 | | | |--- feature_2 > 4.95 11 | | | | |--- feature_3 <= 1.55 12 | | | | | |--- class: 2 13 | | | | |--- feature_3 > 1.55 14 | | | | | |--- feature_0 <= 6.95 15 | | | | | | |--- class: 1 16 | | | | | |--- feature_0 > 6.95 17 | | | | | | |--- class: 2 18 | | |--- feature_3 > 1.75 19 | | | |--- feature_2 <= 4.85 20 | | | | |--- feature_1 <= 3.10 21 | | | | | |--- class: 2 22 | | | | |--- feature_1 > 3.10 23 | | | | | |--- class: 1 24 | | | |--- feature_2 > 4.85 25 | | | | |--- class: 2 26 | -------------------------------------------------------------------------------- /output_DecisionTree/iris_DecisionTree_textRep.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/milaan9/Python_Decision_Tree_and_Random_Forest/67db760669d9d5e1813ee690fbba989c0a55c53e/output_DecisionTree/iris_DecisionTree_textRep.png --------------------------------------------------------------------------------