├── GB KNN Model.ipynb ├── Linear_Regression.ipynb └── Logistic Regression.ipynb /Linear_Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "GzfdMfk10NE6" 7 | }, 8 | "source": [ 9 | "## **Linear Regression with Python Scikit Learn**\n", 10 | "In this section we will see how the Python Scikit-Learn library for machine learning can be used to implement regression functions. We will start with simple linear regression involving two variables.\n", 11 | "\n", 12 | "### **Simple Linear Regression**\n", 13 | "In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables." 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 1, 19 | "metadata": { 20 | "id": "V9QN2ZxC38pB" 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "# Importing all libraries required in this notebook\n", 25 | "import pandas as pd\n", 26 | "import numpy as np \n", 27 | "import matplotlib.pyplot as plt " 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": { 34 | "colab": { 35 | "base_uri": "https://localhost:8080/", 36 | "height": 380 37 | }, 38 | "id": "LtU4YMEhqm9m", 39 | "outputId": "cae4d898-e81c-4cb0-cae3-b668a921a074" 40 | }, 41 | "outputs": [ 42 | { 43 | "name": "stdout", 44 | "output_type": "stream", 45 | "text": [ 46 | "Data imported successfully\n" 47 | ] 48 | }, 49 | { 50 | "data": { 51 | "text/html": [ 52 | "
\n", 53 | "\n", 66 | "\n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | "
HoursScores
02.521
15.147
23.227
38.575
43.530
51.520
69.288
75.560
88.381
92.725
\n", 127 | "
" 128 | ], 129 | "text/plain": [ 130 | " Hours Scores\n", 131 | "0 2.5 21\n", 132 | "1 5.1 47\n", 133 | "2 3.2 27\n", 134 | "3 8.5 75\n", 135 | "4 3.5 30\n", 136 | "5 1.5 20\n", 137 | "6 9.2 88\n", 138 | "7 5.5 60\n", 139 | "8 8.3 81\n", 140 | "9 2.7 25" 141 | ] 142 | }, 143 | "execution_count": 2, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "# Reading data from remote link\n", 150 | "url = \"https://raw.githubusercontent.com/mhassandata/Regression_model/main/score.csv\"\n", 151 | "s_data = pd.read_csv(url)\n", 152 | "print(\"Data imported successfully\")\n", 153 | "s_data.head(10)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": { 159 | "id": "RHsPneuM4NgB" 160 | }, 161 | "source": [ 162 | "Let's plot our data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data. We can create the plot with the following script:" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 3, 168 | "metadata": { 169 | "colab": { 170 | "base_uri": "https://localhost:8080/", 171 | "height": 472 172 | }, 173 | "id": "qxYBZkhAqpn9", 174 | "outputId": "36a7e99a-3ffa-41bf-da42-ce089c3e5dbd" 175 | }, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "image/png": "", 180 | "text/plain": [ 181 | "
" 182 | ] 183 | }, 184 | "metadata": {}, 185 | "output_type": "display_data" 186 | } 187 | ], 188 | "source": [ 189 | "# Plotting the distribution of scores\n", 190 | "s_data.plot(x='Hours', y='Scores', style='o') \n", 191 | "plt.title('Hours vs Percentage') \n", 192 | "plt.xlabel('Hours Studied') \n", 193 | "plt.ylabel('Percentage Score') \n", 194 | "plt.show()" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": { 200 | "id": "fiQaULio4Rzr" 201 | }, 202 | "source": [ 203 | "**From the graph above, we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.**" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "id": "WWtEr64M4jdz" 210 | }, 211 | "source": [ 212 | "### **Preparing the data**\n", 213 | "\n", 214 | "The next step is to divide the data into \"attributes\" (inputs) and \"labels\" (outputs)." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 4, 220 | "metadata": { 221 | "id": "LiJ5210e4tNX" 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "X = s_data.iloc[:, :-1].values \n", 226 | "y = s_data.iloc[:, 1].values " 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 5, 232 | "metadata": { 233 | "colab": { 234 | "base_uri": "https://localhost:8080/" 235 | }, 236 | "id": "0DrNCxfV_0sS", 237 | "outputId": "31344c84-6445-4b80-f7a9-17ea56604245" 238 | }, 239 | "outputs": [ 240 | { 241 | "data": { 242 | "text/plain": [ 243 | "array([[2.5],\n", 244 | " [5.1],\n", 245 | " [3.2],\n", 246 | " [8.5],\n", 247 | " [3.5],\n", 248 | " [1.5],\n", 249 | " [9.2],\n", 250 | " [5.5],\n", 251 | " [8.3],\n", 252 | " [2.7],\n", 253 | " [7.7],\n", 254 | " [5.9],\n", 255 | " [4.5],\n", 256 | " [3.3],\n", 257 | " [1.1],\n", 258 | " [8.9],\n", 259 | " [2.5],\n", 260 | " [1.9],\n", 261 | " [6.1],\n", 262 | " [7.4],\n", 263 | " [2.7],\n", 264 | " [4.8],\n", 265 | " [3.8],\n", 266 | " [6.9],\n", 267 | " [7.8]])" 268 | ] 269 | }, 270 | "execution_count": 5, 271 | "metadata": {}, 272 | "output_type": "execute_result" 273 | } 274 | ], 275 | "source": [ 276 | "X" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 6, 282 | "metadata": { 283 | "colab": { 284 | "base_uri": "https://localhost:8080/" 285 | }, 286 | "id": "HKJA37KL_0sT", 287 | "outputId": "6ecc3cb9-084d-42eb-de9f-1c57256cca6d" 288 | }, 289 | "outputs": [ 290 | { 291 | "data": { 292 | "text/plain": [ 293 | "array([21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85, 62, 41, 42, 17, 95, 30,\n", 294 | " 24, 67, 69, 30, 54, 35, 76, 86])" 295 | ] 296 | }, 297 | "execution_count": 6, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "y" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": { 309 | "id": "Riz-ZiZ34fO4" 310 | }, 311 | "source": [ 312 | "Now that we have our attributes and labels, the next step is to split this data into training and test sets. We'll do this by using Scikit-Learn's built-in train_test_split() method:" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 7, 318 | "metadata": { 319 | "id": "udFYso1M4BNw" 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "from sklearn.model_selection import train_test_split\n", 324 | "\n", 325 | "X_train, X_test, y_train, y_test = train_test_split(X, y, \n", 326 | " test_size=0.2, random_state=0) " 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": 8, 332 | "metadata": { 333 | "colab": { 334 | "base_uri": "https://localhost:8080/" 335 | }, 336 | "id": "LDQkaigQ_0sT", 337 | "outputId": "6162a217-b98c-4c7d-a51a-e417cd8b95cd" 338 | }, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "array([[3.8],\n", 344 | " [1.9],\n", 345 | " [7.8],\n", 346 | " [6.9],\n", 347 | " [1.1],\n", 348 | " [5.1],\n", 349 | " [7.7],\n", 350 | " [3.3],\n", 351 | " [8.3],\n", 352 | " [9.2],\n", 353 | " [6.1],\n", 354 | " [3.5],\n", 355 | " [2.7],\n", 356 | " [5.5],\n", 357 | " [2.7],\n", 358 | " [8.5],\n", 359 | " [2.5],\n", 360 | " [4.8],\n", 361 | " [8.9],\n", 362 | " [4.5]])" 363 | ] 364 | }, 365 | "execution_count": 8, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "X_train" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": { 377 | "id": "a6WXptFU5CkC" 378 | }, 379 | "source": [ 380 | "### **Training the Algorithm**\n", 381 | "Now that we have split our data into training and testing sets, we can finally train our algorithm on it. " 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 9, 387 | "metadata": { 388 | "colab": { 389 | "base_uri": "https://localhost:8080/" 390 | }, 391 | "id": "qddCuaS84fpK", 392 | "outputId": "41a29249-38ab-4773-f7f6-e1ede34e7536" 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "from sklearn.linear_model import LinearRegression \n", 397 | "\n", 398 | "regressor = LinearRegression()" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 10, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | "Training complete.\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "regressor.fit(X_train, y_train) \n", 416 | "\n", 417 | "print(\"Training complete.\")" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 11, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "data": { 427 | "text/plain": [ 428 | "array([16.88414476, 33.73226078, 75.357018 , 26.79480124, 60.49103328])" 429 | ] 430 | }, 431 | "execution_count": 11, 432 | "metadata": {}, 433 | "output_type": "execute_result" 434 | } 435 | ], 436 | "source": [ 437 | "regressor.predict(X_test)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 12, 443 | "metadata": { 444 | "colab": { 445 | "base_uri": "https://localhost:8080/", 446 | "height": 430 447 | }, 448 | "id": "LO3cIcQV_0sU", 449 | "outputId": "9f399f85-9909-4797-b6e2-8b54b017aeda" 450 | }, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "" 456 | ] 457 | }, 458 | "execution_count": 12, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | }, 462 | { 463 | "data": { 464 | "image/png": "", 465 | "text/plain": [ 466 | "
" 467 | ] 468 | }, 469 | "metadata": {}, 470 | "output_type": "display_data" 471 | } 472 | ], 473 | "source": [ 474 | "plt.scatter(X,y)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 13, 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "name": "stdout", 484 | "output_type": "stream", 485 | "text": [ 486 | "The calculated parameters are theta_1: 9.910656480642233, and theta_2: 2.0181600414346974\n" 487 | ] 488 | } 489 | ], 490 | "source": [ 491 | "print(f\"The calculated parameters are theta_1: {regressor.coef_[0]}, and theta_2: {regressor.intercept_}\")" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 14, 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "# Plotting the regression line\n", 501 | "line = regressor.coef_*X_test+regressor.intercept_" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 15, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "array([21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85, 62, 41, 42, 17, 95, 30,\n", 513 | " 24, 67, 69, 30, 54, 35, 76, 86])" 514 | ] 515 | }, 516 | "execution_count": 15, 517 | "metadata": {}, 518 | "output_type": "execute_result" 519 | } 520 | ], 521 | "source": [ 522 | "y" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 16, 528 | "metadata": { 529 | "colab": { 530 | "base_uri": "https://localhost:8080/", 531 | "height": 430 532 | }, 533 | "id": "J61NX2_2-px7", 534 | "outputId": "20d96bf4-8f2c-4a15-c004-b34a63f0f815" 535 | }, 536 | "outputs": [ 537 | { 538 | "data": { 539 | "image/png": "", 540 | "text/plain": [ 541 | "
" 542 | ] 543 | }, 544 | "metadata": {}, 545 | "output_type": "display_data" 546 | }, 547 | { 548 | "data": { 549 | "image/png": "", 550 | "text/plain": [ 551 | "
" 552 | ] 553 | }, 554 | "metadata": {}, 555 | "output_type": "display_data" 556 | } 557 | ], 558 | "source": [ 559 | "preds = regressor.predict(X)\n", 560 | "\n", 561 | "plt.subplot(1, 2, 1)\n", 562 | "plt.scatter(X, y)\n", 563 | "plt.plot(X, preds)\n", 564 | "plt.show()\n", 565 | "\n", 566 | "# Plotting for the test data\n", 567 | "plt.subplot(1, 2, 2)\n", 568 | "plt.scatter(X_test, y_test, color=\"red\")\n", 569 | "plt.plot(X_test, line)\n", 570 | "plt.show()" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": { 576 | "id": "JCQn-g4m5OK2" 577 | }, 578 | "source": [ 579 | "### **Making Predictions**\n", 580 | "Now that we have trained our algorithm, it's time to make some predictions." 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 15, 586 | "metadata": { 587 | "colab": { 588 | "base_uri": "https://localhost:8080/" 589 | }, 590 | "id": "Tt-Fmzu55EGM", 591 | "outputId": "f03a010b-399e-49c2-ab18-1f41f4b13a89" 592 | }, 593 | "outputs": [ 594 | { 595 | "name": "stdout", 596 | "output_type": "stream", 597 | "text": [ 598 | "[[1.5]\n", 599 | " [3.2]\n", 600 | " [7.4]\n", 601 | " [2.5]\n", 602 | " [5.9]]\n" 603 | ] 604 | } 605 | ], 606 | "source": [ 607 | "print(X_test) # Testing data - In Hours\n", 608 | "y_pred = regressor.predict(X_test) # Predicting the scores" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 16, 614 | "metadata": { 615 | "colab": { 616 | "base_uri": "https://localhost:8080/", 617 | "height": 206 618 | }, 619 | "id": "6bmZUMZh5QLb", 620 | "outputId": "943e567f-fe9d-43ad-9c81-a841d1026ef1" 621 | }, 622 | "outputs": [ 623 | { 624 | "data": { 625 | "text/html": [ 626 | "\n", 627 | "
\n", 628 | "
\n", 629 | "
\n", 630 | "\n", 643 | "\n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | "
ActualPredicted
02016.884145
12733.732261
26975.357018
33026.794801
46260.491033
\n", 679 | "
\n", 680 | " \n", 690 | " \n", 691 | " \n", 728 | "\n", 729 | " \n", 753 | "
\n", 754 | "
\n", 755 | " " 756 | ], 757 | "text/plain": [ 758 | " Actual Predicted\n", 759 | "0 20 16.884145\n", 760 | "1 27 33.732261\n", 761 | "2 69 75.357018\n", 762 | "3 30 26.794801\n", 763 | "4 62 60.491033" 764 | ] 765 | }, 766 | "execution_count": 16, 767 | "metadata": {}, 768 | "output_type": "execute_result" 769 | } 770 | ], 771 | "source": [ 772 | "# Comparing Actual vs Predicted\n", 773 | "df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) \n", 774 | "df " 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": 21, 780 | "metadata": { 781 | "colab": { 782 | "base_uri": "https://localhost:8080/" 783 | }, 784 | "id": "KAFO8zbx-AH1", 785 | "outputId": "3c003ec3-e681-47fb-a684-e71ab2cee7f5" 786 | }, 787 | "outputs": [ 788 | { 789 | "name": "stdout", 790 | "output_type": "stream", 791 | "text": [ 792 | "No of Hours = 9.5\n", 793 | "Predicted Score = 96.16939660753593\n" 794 | ] 795 | } 796 | ], 797 | "source": [ 798 | "#You can also test with your own data\n", 799 | "hours = 9.5\n", 800 | "own_pred = regressor.predict([[9.5]])\n", 801 | "print(\"No of Hours = {}\".format(hours))\n", 802 | "print(\"Predicted Score = {}\".format(own_pred[0]))" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": { 808 | "id": "0AAsPVA_6KmK" 809 | }, 810 | "source": [ 811 | "### **Evaluating the model**\n", 812 | "\n", 813 | "The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics." 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "execution_count": 22, 819 | "metadata": { 820 | "colab": { 821 | "base_uri": "https://localhost:8080/" 822 | }, 823 | "id": "r5UOrRH-5VCQ", 824 | "outputId": "a4bc5295-e596-40c4-faee-f72ef065f366" 825 | }, 826 | "outputs": [ 827 | { 828 | "name": "stdout", 829 | "output_type": "stream", 830 | "text": [ 831 | "Mean Absolute Error: 4.183859899002982\n" 832 | ] 833 | } 834 | ], 835 | "source": [ 836 | "from sklearn import metrics \n", 837 | "print('Mean Absolute Error:', \n", 838 | " metrics.mean_absolute_error(y_test, y_pred)) " 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": 18, 844 | "metadata": { 845 | "id": "1MzDbtLh_0sX" 846 | }, 847 | "outputs": [], 848 | "source": [] 849 | } 850 | ], 851 | "metadata": { 852 | "colab": { 853 | "provenance": [] 854 | }, 855 | "kernelspec": { 856 | "display_name": "Python 3 (ipykernel)", 857 | "language": "python", 858 | "name": "python3" 859 | }, 860 | "language_info": { 861 | "codemirror_mode": { 862 | "name": "ipython", 863 | "version": 3 864 | }, 865 | "file_extension": ".py", 866 | "mimetype": "text/x-python", 867 | "name": "python", 868 | "nbconvert_exporter": "python", 869 | "pygments_lexer": "ipython3", 870 | "version": "3.12.4" 871 | } 872 | }, 873 | "nbformat": 4, 874 | "nbformat_minor": 0 875 | } 876 | --------------------------------------------------------------------------------