\n", 171 | " | sl_no | \n", 172 | "gender | \n", 173 | "ssc_p | \n", 174 | "ssc_b | \n", 175 | "hsc_p | \n", 176 | "hsc_b | \n", 177 | "hsc_s | \n", 178 | "degree_p | \n", 179 | "degree_t | \n", 180 | "workex | \n", 181 | "etest_p | \n", 182 | "specialisation | \n", 183 | "mba_p | \n", 184 | "status | \n", 185 | "salary | \n", 186 | "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", 191 | "1 | \n", 192 | "M | \n", 193 | "67.00 | \n", 194 | "Others | \n", 195 | "91.00 | \n", 196 | "Others | \n", 197 | "Commerce | \n", 198 | "58.00 | \n", 199 | "Sci&Tech | \n", 200 | "No | \n", 201 | "55.0 | \n", 202 | "Mkt&HR | \n", 203 | "58.80 | \n", 204 | "Placed | \n", 205 | "270000.0 | \n", 206 | "
1 | \n", 209 | "2 | \n", 210 | "M | \n", 211 | "79.33 | \n", 212 | "Central | \n", 213 | "78.33 | \n", 214 | "Others | \n", 215 | "Science | \n", 216 | "77.48 | \n", 217 | "Sci&Tech | \n", 218 | "Yes | \n", 219 | "86.5 | \n", 220 | "Mkt&Fin | \n", 221 | "66.28 | \n", 222 | "Placed | \n", 223 | "200000.0 | \n", 224 | "
2 | \n", 227 | "3 | \n", 228 | "M | \n", 229 | "65.00 | \n", 230 | "Central | \n", 231 | "68.00 | \n", 232 | "Central | \n", 233 | "Arts | \n", 234 | "64.00 | \n", 235 | "Comm&Mgmt | \n", 236 | "No | \n", 237 | "75.0 | \n", 238 | "Mkt&Fin | \n", 239 | "57.80 | \n", 240 | "Placed | \n", 241 | "250000.0 | \n", 242 | "
3 | \n", 245 | "4 | \n", 246 | "M | \n", 247 | "56.00 | \n", 248 | "Central | \n", 249 | "52.00 | \n", 250 | "Central | \n", 251 | "Science | \n", 252 | "52.00 | \n", 253 | "Sci&Tech | \n", 254 | "No | \n", 255 | "66.0 | \n", 256 | "Mkt&HR | \n", 257 | "59.43 | \n", 258 | "Not Placed | \n", 259 | "NaN | \n", 260 | "
4 | \n", 263 | "5 | \n", 264 | "M | \n", 265 | "85.80 | \n", 266 | "Central | \n", 267 | "73.60 | \n", 268 | "Central | \n", 269 | "Commerce | \n", 270 | "73.30 | \n", 271 | "Comm&Mgmt | \n", 272 | "No | \n", 273 | "96.8 | \n", 274 | "Mkt&Fin | \n", 275 | "55.50 | \n", 276 | "Placed | \n", 277 | "425000.0 | \n", 278 | "
AdaBoostClassifier(n_estimators=100, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(n_estimators=100, random_state=0)
Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
\n", 469 | "\n", 470 | "Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.
" 473 | ] 474 | } 475 | ], 476 | "metadata": { 477 | "kernelspec": { 478 | "display_name": "Python 3", 479 | "language": "python", 480 | "name": "python3" 481 | }, 482 | "language_info": { 483 | "codemirror_mode": { 484 | "name": "ipython", 485 | "version": 3 486 | }, 487 | "file_extension": ".py", 488 | "mimetype": "text/x-python", 489 | "name": "python", 490 | "nbconvert_exporter": "python", 491 | "pygments_lexer": "ipython3", 492 | "version": "3.6.6" 493 | }, 494 | "widgets": { 495 | "state": {}, 496 | "version": "1.1.2" 497 | } 498 | }, 499 | "nbformat": 4, 500 | "nbformat_minor": 2 501 | } 502 | -------------------------------------------------------------------------------- /Sklearn/supervised algorithm/Reg-NoneLinearRegression-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
\n", 536 | "\n", 537 | "Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.
" 540 | ] 541 | } 542 | ], 543 | "metadata": { 544 | "kernelspec": { 545 | "display_name": "Python 3", 546 | "language": "python", 547 | "name": "python3" 548 | }, 549 | "language_info": { 550 | "codemirror_mode": { 551 | "name": "ipython", 552 | "version": 3 553 | }, 554 | "file_extension": ".py", 555 | "mimetype": "text/x-python", 556 | "name": "python", 557 | "nbconvert_exporter": "python", 558 | "pygments_lexer": "ipython3", 559 | "version": "3.6.6" 560 | } 561 | }, 562 | "nbformat": 4, 563 | "nbformat_minor": 2 564 | } 565 | -------------------------------------------------------------------------------- /Sklearn/supervised algorithm/Voting_Classifiers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Voting Classifiers.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "gL0AffZGOZxZ" 24 | }, 25 | "source": [ 26 | "# **Introduction**\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "id": "sEtW2gI7KVy5" 33 | }, 34 | "source": [ 35 | "Similarly, if we aggregate the predictions of a group of models (such as classifiers or regressors), we will often get better predictions than the best individual predictor. A group of predictors is called an **ensemble**. Thus this technique is called **ensemble learning**, and an ensemble learning algorithm is called an Ensemble Method." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": { 41 | "id": "lIDjC_tsK0HY" 42 | }, 43 | "source": [ 44 | "As an example of an ensemble method, we can train a **group of decision tree classifiers**, each on a random subset of the training data. **Such an ensemble of decision trees is called a random forest**. Despite its simplicity, this is one of the most powerful machine learning algorithms available today. In this chapter, we will discuss the most famous ensemble learning methods, including: **Bagging, Boosting, & Stacking.**" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "id": "Lr8nF4iiAxoS" 51 | }, 52 | "source": [ 53 | "# **Voting Classifiers**" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": { 59 | "id": "eJIKaFHZA7MB" 60 | }, 61 | "source": [ 62 | "Suppose we have trained a few classifiers, each achieving an 80% accuracy. A very simple way to create an even better classifiers is to aggregate the predictions of all our classifiers and choose the prediction that is the most frequent.\n", 63 | "\n", 64 | "**Majority voting classification is called Hard Voting**" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "id": "cUPzZRuhB53n" 71 | }, 72 | "source": [ 73 | "" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": { 80 | "id": "ytZeVqhkNGE_" 81 | }, 82 | "source": [ 83 | "Somewhat surprisingly, this classifier achieves an even better accuracy than the best predictor in the ensemble. Even if each classifier is a weak learner (does slightly better then random guessing). Assuming that we have a sufficient number of weak learners and enough diversity.\n", 84 | "\n", 85 | "Due to the law of large numbers, if we build an ensemble containing 1,000 classifiers with individual accuracies of $51%$ & trained for binary classification, If we predict the majority voting class, we can hope for up to $75%$ accuracy.\n", 86 | "\n", 87 | "This is only true if all classifiers are completely independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data.\n", 88 | "\n", 89 | "One way to get diverse classifiers is use different algorithms for each one of them & train them on different subset of the training data.\n", 90 | "\n", 91 | "Let's implement a hard voting ensemble learner using scikit-learn:" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": { 97 | "id": "Bh-YhPsZCG7S" 98 | }, 99 | "source": [ 100 | "**Python implmentation**" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "metadata": { 106 | "id": "gMfrhXQhNVob" 107 | }, 108 | "source": [ 109 | "import numpy as np\n", 110 | "import pandas as pd\n", 111 | "import matplotlib.pyplot as plt\n", 112 | "import sklearn" 113 | ], 114 | "execution_count": 1, 115 | "outputs": [] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "metadata": { 120 | "id": "hprKmZLBNdNZ" 121 | }, 122 | "source": [ 123 | "from sklearn.ensemble import RandomForestClassifier\n", 124 | "from sklearn.ensemble import VotingClassifier\n", 125 | "from sklearn.linear_model import LogisticRegression\n", 126 | "from sklearn.svm import SVC" 127 | ], 128 | "execution_count": 2, 129 | "outputs": [] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "metadata": { 134 | "id": "oRZKUesUNiNn" 135 | }, 136 | "source": [ 137 | "log_clf = LogisticRegression(solver='lbfgs')\n", 138 | "rf_clf = RandomForestClassifier(n_estimators=100)\n", 139 | "svm_clf = SVC(gamma='scale')" 140 | ], 141 | "execution_count": 3, 142 | "outputs": [] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "metadata": { 147 | "id": "fXyBuAIjNnBZ" 148 | }, 149 | "source": [ 150 | "from sklearn import datasets\n", 151 | "from sklearn.model_selection import train_test_split" 152 | ], 153 | "execution_count": 4, 154 | "outputs": [] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "metadata": { 159 | "id": "xBa0B1EhNspZ" 160 | }, 161 | "source": [ 162 | "X, y = datasets.make_moons(n_samples=10000, noise=0.5)\n", 163 | "X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33)" 164 | ], 165 | "execution_count": 5, 166 | "outputs": [] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "metadata": { 171 | "colab": { 172 | "base_uri": "https://localhost:8080/" 173 | }, 174 | "id": "57JwyteWNw1Q", 175 | "outputId": "967b0632-dda0-4799-cf94-6a17410fdea2" 176 | }, 177 | "source": [ 178 | "X_train.shape, y_train.shape, X_val.shape, y_val.shape\n" 179 | ], 180 | "execution_count": 6, 181 | "outputs": [ 182 | { 183 | "output_type": "execute_result", 184 | "data": { 185 | "text/plain": [ 186 | "((6700, 2), (6700,), (3300, 2), (3300,))" 187 | ] 188 | }, 189 | "metadata": { 190 | "tags": [] 191 | }, 192 | "execution_count": 6 193 | } 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "metadata": { 199 | "id": "xUzphfgFN2V2" 200 | }, 201 | "source": [ 202 | "voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rf_clf), ('svc', svm_clf)], voting='hard')" 203 | ], 204 | "execution_count": 7, 205 | "outputs": [] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "metadata": { 210 | "colab": { 211 | "base_uri": "https://localhost:8080/" 212 | }, 213 | "id": "zNROV_7TN6dO", 214 | "outputId": "fc519296-81a5-47c0-af5e-f6c8e8a599fb" 215 | }, 216 | "source": [ 217 | "voting_clf.fit(X_train, y_train)\n" 218 | ], 219 | "execution_count": 8, 220 | "outputs": [ 221 | { 222 | "output_type": "execute_result", 223 | "data": { 224 | "text/plain": [ 225 | "VotingClassifier(estimators=[('lr',\n", 226 | " LogisticRegression(C=1.0, class_weight=None,\n", 227 | " dual=False, fit_intercept=True,\n", 228 | " intercept_scaling=1,\n", 229 | " l1_ratio=None, max_iter=100,\n", 230 | " multi_class='auto',\n", 231 | " n_jobs=None, penalty='l2',\n", 232 | " random_state=None,\n", 233 | " solver='lbfgs', tol=0.0001,\n", 234 | " verbose=0, warm_start=False)),\n", 235 | " ('rf',\n", 236 | " RandomForestClassifier(bootstrap=True,\n", 237 | " ccp_alpha=0.0,\n", 238 | " class_weight=None,\n", 239 | " cr...\n", 240 | " oob_score=False,\n", 241 | " random_state=None,\n", 242 | " verbose=0,\n", 243 | " warm_start=False)),\n", 244 | " ('svc',\n", 245 | " SVC(C=1.0, break_ties=False, cache_size=200,\n", 246 | " class_weight=None, coef0=0.0,\n", 247 | " decision_function_shape='ovr', degree=3,\n", 248 | " gamma='scale', kernel='rbf', max_iter=-1,\n", 249 | " probability=False, random_state=None,\n", 250 | " shrinking=True, tol=0.001, verbose=False))],\n", 251 | " flatten_transform=True, n_jobs=None, voting='hard',\n", 252 | " weights=None)" 253 | ] 254 | }, 255 | "metadata": { 256 | "tags": [] 257 | }, 258 | "execution_count": 8 259 | } 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": { 265 | "id": "HqIKJAZrOD6u" 266 | }, 267 | "source": [ 268 | "Let's take a look at the performance of each classifier + ensemble method on the validation set:\n", 269 | "\n" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "metadata": { 275 | "id": "I77QMfXHOHRq" 276 | }, 277 | "source": [ 278 | "from sklearn.metrics import accuracy_score\n" 279 | ], 280 | "execution_count": 9, 281 | "outputs": [] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "metadata": { 286 | "colab": { 287 | "base_uri": "https://localhost:8080/" 288 | }, 289 | "id": "lQLu9-ZtOI7n", 290 | "outputId": "a09fc59f-13d1-4d4c-99f1-3206e4c877d9" 291 | }, 292 | "source": [ 293 | "for clf in [log_clf, rf_clf, svm_clf, voting_clf]:\n", 294 | " clf.fit(X_train, y_train)\n", 295 | " y_hat = clf.predict(X_val)\n", 296 | " print(clf.__class__.__name__, accuracy_score(y_val, y_hat))" 297 | ], 298 | "execution_count": 10, 299 | "outputs": [ 300 | { 301 | "output_type": "stream", 302 | "text": [ 303 | "LogisticRegression 0.8151515151515152\n", 304 | "RandomForestClassifier 0.803939393939394\n", 305 | "SVC 0.8303030303030303\n", 306 | "VotingClassifier 0.8254545454545454\n" 307 | ], 308 | "name": "stdout" 309 | } 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": { 315 | "id": "vodRbHN8OVam" 316 | }, 317 | "source": [ 318 | "There we have it! The voting classifier slightly outperforms the individual classifiers.\n", 319 | "\n", 320 | "If all ensemble method learners can estimate class probabilities, we can average their probabilities per class then predict the class with the highest probability. This is called Soft voting. It often yields results better than hard voting because it weights confidence." 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": { 326 | "id": "EgWJOp-40ADb" 327 | }, 328 | "source": [ 329 | "# **References**" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": { 335 | "id": "MIXjbz4hOO6i" 336 | }, 337 | "source": [ 338 | "[Chapter 7. Ensemble Learning & Random Forests](https://github.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/blob/master/07.Ensembles_RFs.ipynb)" 339 | ] 340 | } 341 | ] 342 | } -------------------------------------------------------------------------------- /Sklearn/supervised algorithm/XGBoost_in_Machine_Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "XGBoost in Machine Learning.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "bIsI9dS-FNPO" 23 | }, 24 | "source": [ 25 | "# **Introduction**" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "ZdupLwSYFUHh" 32 | }, 33 | "source": [ 34 | "XGBoost or Gradient Boosting is a machine learning algorithm that goes through cycles to iteratively add models to a set. In this article, I will take you through the XGBoost algorithm in Machine Learning." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": { 40 | "id": "0EpPpnxScrD4" 41 | }, 42 | "source": [ 43 | "The cycle of the XGBoost algorithm begins by initializing the whole with a unique model, the predictions of which can be quite naive." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": { 49 | "id": "22KV7aATHIU5" 50 | }, 51 | "source": [ 52 | "# **The Process of XGBoost Algorithm:**\n" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "id": "nQkZsGzvc1pi" 59 | }, 60 | "source": [ 61 | "- First, we use the current set to generate predictions for each observation in the dataset. To make a prediction, we add the predictions of all the models in the set.\n", 62 | "- These predictions are used to calculate a loss function.\n", 63 | "- Then we use the loss function to fit a new model which will be added to the set. Specifically, we determine the parameters of the model so that adding this new model to the set reduces the loss.\n", 64 | "- Finally, we add the new model to the set, and …\n", 65 | "then repeat!" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": { 71 | "id": "zpOJPL34rP5Z" 72 | }, 73 | "source": [ 74 | "# **XGBoost Algorithm in Action**\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": { 80 | "id": "ePqKoUWTre6h" 81 | }, 82 | "source": [ 83 | "I’ll start by loading the training and validation data into X_train, X_valid, y_train and y_valid. The dataset, I am using here can be easily downloaded from here." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "metadata": { 89 | "colab": { 90 | "base_uri": "https://localhost:8080/" 91 | }, 92 | "id": "HoOKm_KQsBWL", 93 | "outputId": "1d16352d-393f-46a2-88dc-22c1ac52da25" 94 | }, 95 | "source": [ 96 | "\n", 97 | "from google.colab import drive\n", 98 | "drive.mount('/content/drive')" 99 | ], 100 | "execution_count": 9, 101 | "outputs": [ 102 | { 103 | "output_type": "stream", 104 | "text": [ 105 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" 106 | ], 107 | "name": "stdout" 108 | } 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "metadata": { 114 | "id": "Fw0YMemUsOWj" 115 | }, 116 | "source": [ 117 | "import pandas as pd\n", 118 | "from sklearn.model_selection import train_test_split\n", 119 | "\n", 120 | "# Read the data\n", 121 | "data = pd.read_csv('/content/drive/MyDrive/Datasets/melb_data.csv')\n", 122 | "\n", 123 | "# Select subset of predictors\n", 124 | "cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']\n", 125 | "X = data[cols_to_use]\n", 126 | "\n", 127 | "# Select target\n", 128 | "y = data.Price\n", 129 | "\n", 130 | "# Separate data into training and validation sets\n", 131 | "X_train, X_valid, y_train, y_valid = train_test_split(X,y) " 132 | ], 133 | "execution_count": 6, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": { 139 | "id": "otrycWU6sqTK" 140 | }, 141 | "source": [ 142 | "Now, here you will learn how to use the XGBoost algorithm. Here we need to import the scikit-learn API for XGBoost (xgboost.XGBRegressor). This allows us to create and adjust a model like we would in scikit-learn. As you will see in the output, the XGBRegressor class has many adjustable parameters:" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "metadata": { 148 | "id": "IsTLbK6GssO6" 149 | }, 150 | "source": [ 151 | "from xgboost import XGBRegressor\n", 152 | "\n", 153 | "my_model = XGBRegressor()\n", 154 | "my_model.fit(X_train, y_train)" 155 | ], 156 | "execution_count": null, 157 | "outputs": [] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": { 162 | "id": "ZAYMrKnXs3aj" 163 | }, 164 | "source": [ 165 | "Now, we need to make predictions and evaluate our model:\n", 166 | "\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "metadata": { 172 | "colab": { 173 | "base_uri": "https://localhost:8080/" 174 | }, 175 | "id": "V_A5LMrws-5i", 176 | "outputId": "2c451630-6381-4aec-a62c-590e92000d0e" 177 | }, 178 | "source": [ 179 | "from sklearn.metrics import mean_absolute_error\n", 180 | "\n", 181 | "predictions = my_model.predict(X_valid)\n", 182 | "print(\"Mean Absolute Error: \" + str(mean_absolute_error(predictions, y_valid)))" 183 | ], 184 | "execution_count": 8, 185 | "outputs": [ 186 | { 187 | "output_type": "stream", 188 | "text": [ 189 | "Mean Absolute Error: 279829.9009295499\n" 190 | ], 191 | "name": "stdout" 192 | } 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": { 198 | "id": "XD-d44CUH_8n" 199 | }, 200 | "source": [ 201 | "# **Parameter Tuning**\n" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": { 207 | "id": "fYkOlpS737BL" 208 | }, 209 | "source": [ 210 | "XGBoost has a few features that can drastically affect the accuracy and speed of training. The first feature you need to understand are:\n", 211 | "\n" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": { 217 | "id": "eJwABNAd3_Jc" 218 | }, 219 | "source": [ 220 | "**n_estimators**\n" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": { 226 | "id": "0zaqY98h4Ct6" 227 | }, 228 | "source": [ 229 | "n_estimators specifies the number of times to skip the modelling cycle described above. It is equal to the number of models we include in the set." 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": { 235 | "id": "BECHRnND4Ih2" 236 | }, 237 | "source": [ 238 | "- Too low a value results in an underfitting, leading to inaccurate predictions on training data and test data.\n", 239 | "- Too high a value results in overfitting, resulting in accurate predictions on training data, but inaccurate predictions on test data (which is important to us)." 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": { 245 | "id": "m45TqHHI4THz" 246 | }, 247 | "source": [ 248 | "Typical the values lie between 100 to 1000, although it all depends a lot on the learning_rate parameter described below. Here is the code to set the number of models in the set:" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "metadata": { 254 | "id": "Uc_9nhDP4U19" 255 | }, 256 | "source": [ 257 | "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", 258 | " colsample_bynode=1, colsample_bytree=1, gamma=0,\n", 259 | " importance_type='gain', learning_rate=0.1, max_delta_step=0,\n", 260 | " max_depth=3, min_child_weight=1, missing=None, n_estimators=500,\n", 261 | " n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n", 262 | " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n", 263 | " silent=None, subsample=1, verbosity=1)" 264 | ], 265 | "execution_count": null, 266 | "outputs": [] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": { 271 | "id": "OA6GZob_4cCd" 272 | }, 273 | "source": [ 274 | "**early_stopping_rounds**\n" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": { 280 | "id": "U6wv3Tnv4eNL" 281 | }, 282 | "source": [ 283 | "early_stopping_rounds provides a way to automatically find the ideal value for n_estimators. Stopping early causes the iteration of the model to stop when the validation score stops improving, even though we are not stopping hard for n_estimators. It’s a good idea to set n_estimators high and then use early_stopping_rounds to find the optimal time to stop the iteration." 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": { 289 | "id": "tbHxfNEq4jH1" 290 | }, 291 | "source": [ 292 | "Since random chance sometimes causes a single round where validation scores do not improve, you must specify a number for the number of direct deterioration turns to allow before stopping. Setting early_stopping_rounds = 5 is a reasonable choice. In this case, we stop after 5 consecutive rounds of deterioration of validation scores. Now let’s see how we can use early_stopping:" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "metadata": { 298 | "id": "Q4nFErKq4r1T" 299 | }, 300 | "source": [ 301 | "my_model = XGBRegressor(n_estimators=500)\n", 302 | "my_model.fit(X_train, y_train, \n", 303 | " early_stopping_rounds=5, \n", 304 | " eval_set=[(X_valid, y_valid)],\n", 305 | " verbose=False)" 306 | ], 307 | "execution_count": null, 308 | "outputs": [] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": { 313 | "id": "4FdGAi104xp0" 314 | }, 315 | "source": [ 316 | "**learning_rate**\n" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": { 322 | "id": "dTaHClVE43Z0" 323 | }, 324 | "source": [ 325 | "Instead of getting predictions by simply adding up the predictions of each component model, we can multiply the predictions of each model by a small number before adding them.\n", 326 | "\n", 327 | "This means that every tree we add to the set helps us less. So we can set a high value for the n_estimators without overfitting. If we use early shutdown, the appropriate number of trees will be determined automatically. Now, let’s see how we can use learning_rate in XGBoost algorithm:" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "metadata": { 333 | "colab": { 334 | "base_uri": "https://localhost:8080/" 335 | }, 336 | "id": "C0TxFTgp45S6", 337 | "outputId": "cc0ed6cb-e257-431b-a1f9-28c3d9d9f431" 338 | }, 339 | "source": [ 340 | "my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)\n", 341 | "my_model.fit(X_train, y_train, \n", 342 | " early_stopping_rounds=5, \n", 343 | " eval_set=[(X_valid, y_valid)], \n", 344 | " verbose=False)" 345 | ], 346 | "execution_count": 12, 347 | "outputs": [ 348 | { 349 | "output_type": "stream", 350 | "text": [ 351 | "[07:22:39] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n" 352 | ], 353 | "name": "stdout" 354 | }, 355 | { 356 | "output_type": "execute_result", 357 | "data": { 358 | "text/plain": [ 359 | "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", 360 | " colsample_bynode=1, colsample_bytree=1, gamma=0,\n", 361 | " importance_type='gain', learning_rate=0.05, max_delta_step=0,\n", 362 | " max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,\n", 363 | " n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n", 364 | " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n", 365 | " silent=None, subsample=1, verbosity=1)" 366 | ] 367 | }, 368 | "metadata": {}, 369 | "execution_count": 12 370 | } 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": { 376 | "id": "8fxR6L_c5E6F" 377 | }, 378 | "source": [ 379 | "**n_jobs**\n" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": { 385 | "id": "sxepfNxS5HMr" 386 | }, 387 | "source": [ 388 | "On larger datasets where execution is a consideration, you can use parallelism to build your models faster. It is common to set the n_jobs parameter equal to the number of cores on your machine. On smaller data sets, this won’t help.\n", 389 | "\n", 390 | "The resulting model will not be better, so micro-optimizing the timing of the fit is usually just a distraction. But it’s very useful in large datasets where you would spend a lot of time waiting for the fit command. Now, let’s see how to use this parameter in the XGBoost algorithm:" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "metadata": { 396 | "id": "k4CmOX1O5Ov2" 397 | }, 398 | "source": [ 399 | "my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)\n", 400 | "my_model.fit(X_train, y_train, \n", 401 | " early_stopping_rounds=5, \n", 402 | " eval_set=[(X_valid, y_valid)], \n", 403 | " verbose=False)" 404 | ], 405 | "execution_count": null, 406 | "outputs": [] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": { 411 | "id": "Fby7CCu5E0-C" 412 | }, 413 | "source": [ 414 | "# **References**\n", 415 | "[XGBoost in Machine Learning](https://thecleverprogrammer.com/2020/09/04/xgboost-in-machine-learning/)" 416 | ] 417 | } 418 | ] 419 | } -------------------------------------------------------------------------------- /Sklearn/supervised algorithm/dataset/readme: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Sklearn/supervised algorithm/readm: -------------------------------------------------------------------------------- 1 | Supervised learning is a type of machine learning problem where users are given targets which they need to predict. 2 | Classification is a type of supervised learning where an algorithm predicts one output from a list of given classes. 3 | It can be a binary classification task where there are 2-classes or multi-class problems where there are more than 2-classes. 4 | Scikit-Learn - Naive Bayes¶ 5 | https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-naive-bayes?fbclid=IwAR2EUHN0XwJlCQ8hxjvYHh9Vl4g0AjllmD1ktHsNd7Mwu5g2bOLZEjdKld4 6 | -------------------------------------------------------------------------------- /Statistics/Readme: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /The-Art-of-Linear-Algebra.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dr-mushtaq/Machine-Learning/6c5b957b7088d99ac86cc65988448f064b6fdd98/The-Art-of-Linear-Algebra.pdf -------------------------------------------------------------------------------- /readme: -------------------------------------------------------------------------------- 1 | Semi Supervised Learning – A Gentle Introduction for Beginners 2 | https://machinelearningknowledge.ai/semi-supervised-learning-a-gentle-introduction-for-beginners/?fbclid=IwAR2hWWec_bhDJpjr9nOSEUkS1zjW4LJ-IsqLXtL8dm3mCPT-JHFfjVUThWY 3 | Machine Learning with python for everyone 4 | https://drive.google.com/file/d/16q7D0W0CIGS4qOAjt18BpEquodqpE7EV/view?usp%3Ddrivesdk&fbclid=IwAR0y98UMt5ts7FFCN32AN29o8gUHnTGlB1sMNR_wvqEXV_GCefLqvCVlheE 5 | 6 | 7 | --------------------------------------------------------------------------------