├── README.md
├── feature-encoding.ipynb
├── feature-scaling.ipynb
└── missing-values-imputation.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # Scikit-learn Tutorial
2 |
3 | ## Introduction
4 | [Scikit-learn](https://scikit-learn.org/stable/) is a free software machine learning library for the Python programming language. It features various classification,
5 | regression and clustering algorithms including support vector machines, random forests, gradient boosting and k-means and is
6 | designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
7 |
8 | ## Repository description
9 | This repository contains 3 separate notebooks, each covering different aspects of data preprocessing for machine learning
10 | using scikit-learn, namely:
11 | - Feature encoding
12 | - Feature scaling
13 | - Missing values imputation
14 |
15 | ## Medium (Towards Data Science) articles
16 | - [Guide to Encoding Categorical Features using Scikit-Learn for Machine Learning](https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79)
17 | - [What is Feature Scaling & Why is it Important in Machine Learning?](https://towardsdatascience.com/what-is-feature-scaling-why-is-it-important-in-machine-learning-2854ae877048)
18 | - [Stop Wasting Useful Information When Imputing Missing Values](https://towardsdatascience.com/stop-wasting-useful-information-when-imputing-missing-values-d6ef91ef4c21)
19 |
--------------------------------------------------------------------------------
/feature-scaling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 0. Introduction"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Feature scaling is the process of normalising the range of features in a dataset. \n",
15 | "\n",
16 | "Real-world datasets often contain features that are varying in degrees of magnitude, range and units. Therefore, in order for machine learning models to interpret these features on the same scale, we have to perform feature scaling.\n",
17 | "\n",
18 | "In science, we all know the importance of comparing apples to apples and yet many people, especially beginners, have a tendency to overlook feature scaling as part of the preprocessing steps for machine learning. This has proven to cause models to make inaccurate predictions. \n",
19 | "\n",
20 | "In this tutorial, we will discuss why feature scaling is important, the difference between normalisation and standardisation as well as how feature scaling affects model accuracy. More specifically, we will explore the applications of 3 different types of scalers in the Scikit-learn library: \n",
21 | "\n",
22 | "1. [MixMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)\n",
23 | "2. [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\n",
24 | "3. [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "# 1. Import libraries"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 1,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "# Data wrangling\n",
41 | "import pandas as pd\n",
42 | "import numpy as np\n",
43 | "\n",
44 | "# Data visualisation\n",
45 | "import seaborn as sns\n",
46 | "import matplotlib.pyplot as plt\n",
47 | "\n",
48 | "# Machine learning\n",
49 | "from sklearn.model_selection import train_test_split\n",
50 | "from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler\n",
51 | "from sklearn.neighbors import KNeighborsRegressor\n",
52 | "from sklearn.svm import SVR\n",
53 | "from sklearn.tree import DecisionTreeRegressor\n",
54 | "from sklearn.pipeline import make_pipeline\n",
55 | "from sklearn.metrics import mean_squared_error"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "# 2. Import dataset\n",
63 | "\n",
64 | "For the purpose of this tutorial, we will use one of the toy datasets in the Scikit-learn library, the [Boston house prices dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html).\n",
65 | "\n",
66 | "You can find the description of the features [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset)."
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 2,
72 | "metadata": {},
73 | "outputs": [
74 | {
75 | "data": {
76 | "text/html": [
77 | "
"
479 | ],
480 | "text/plain": [
481 | " count mean std min 25% 50% \\\n",
482 | "CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 \n",
483 | "ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 \n",
484 | "INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 \n",
485 | "CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 \n",
486 | "NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 \n",
487 | "RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 \n",
488 | "AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 \n",
489 | "DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 \n",
490 | "RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 \n",
491 | "TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 \n",
492 | "PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 \n",
493 | "B 506.0 356.674032 91.294864 0.32000 375.377500 391.44000 \n",
494 | "LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 \n",
495 | "target 506.0 22.532806 9.197104 5.00000 17.025000 21.20000 \n",
496 | "\n",
497 | " 75% max \n",
498 | "CRIM 3.677083 88.9762 \n",
499 | "ZN 12.500000 100.0000 \n",
500 | "INDUS 18.100000 27.7400 \n",
501 | "CHAS 0.000000 1.0000 \n",
502 | "NOX 0.624000 0.8710 \n",
503 | "RM 6.623500 8.7800 \n",
504 | "AGE 94.075000 100.0000 \n",
505 | "DIS 5.188425 12.1265 \n",
506 | "RAD 24.000000 24.0000 \n",
507 | "TAX 666.000000 711.0000 \n",
508 | "PTRATIO 20.200000 22.0000 \n",
509 | "B 396.225000 396.9000 \n",
510 | "LSTAT 16.955000 37.9700 \n",
511 | "target 25.000000 50.0000 "
512 | ]
513 | },
514 | "execution_count": 4,
515 | "metadata": {},
516 | "output_type": "execute_result"
517 | }
518 | ],
519 | "source": [
520 | "# Summary statistics\n",
521 | "\n",
522 | "data.describe().transpose()"
523 | ]
524 | },
525 | {
526 | "cell_type": "markdown",
527 | "metadata": {},
528 | "source": [
529 | "We can clearly observe that our features span across different range of values. This is largely attributed to the different units in which these features were measured and recorded.\n",
530 | "\n",
531 | "This is where feature scaling can help us solve this issue."
532 | ]
533 | },
534 | {
535 | "cell_type": "markdown",
536 | "metadata": {},
537 | "source": [
538 | "# 4. Understand the effects of different scalers\n",
539 | "\n",
540 | "In this section, we will learn the distinction between normalisation and standardisation. Subsequently, we will look at the effects of 3 different feature scaling techniques in Scikit-learn. "
541 | ]
542 | },
543 | {
544 | "cell_type": "markdown",
545 | "metadata": {},
546 | "source": [
547 | "# 4.1 Theory\n",
548 | "\n",
549 | "Before we examine the effects of feature scaling, let us first go over some theories behind normalisation and standardisation."
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "metadata": {},
555 | "source": [
556 | "## 4.2.1 Normalisation\n",
557 | "\n",
558 | "Normalisation, also known as min-max scaling, is a scaling technique whereby the values in a column are shifted so that they are bounded between a fixed range of 0 and 1.\n",
559 | "\n",
560 | "X_new = (X - X_min) / (X_max - X_min)\n",
561 | "\n",
562 | "[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) is the Scikit-learn function for normalisation. "
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {},
568 | "source": [
569 | "## 4.2.2 Standardisation\n",
570 | "\n",
571 | "On the other hand, standardisation or Z-score normalisation is another scaling technique whereby the values in a column are rescaled so that they demonstrate the properties of a standard Gaussian distribution, that is mean = 0 and variance = 1. \n",
572 | "\n",
573 | "X_new = (X - mean) / std\n",
574 | "\n",
575 | "[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is the Scikit-learn function for standardisation.\n",
576 | "\n",
577 | "Unlike StandardScaler, [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) scales features using statistics that are robust to outliers. More specifically, this scaler removes the median and scales the data according to the quantile range or by default, the interquartile range, thus making it less susceptible to outliers. "
578 | ]
579 | },
580 | {
581 | "cell_type": "markdown",
582 | "metadata": {},
583 | "source": [
584 | "## 4.2.3 Normalisation vs standardisation \n",
585 | "\n",
586 | "The choice between normalisation or standardisation comes down to the application.\n",
587 | "\n",
588 | "Standardisation is generally preferred over normalisation in most machine learning context as it is especially important in order to compare the similarities between features based on certain distance measures. This is most prominent in Principal Component Analysis (PCA) where we are interested in the components that maximise the variance.\n",
589 | "\n",
590 | "Normalisation, on the other hand, also offers many practical applications particularly in computer vision and image processing where pixel intensities have to be normalised to fit within a the RGB colour range between 0 and 255. Furthermore, neural network algorithms typically require data to be normalised to a 0-1 scale before model training. \n",
591 | "\n",
592 | "At the end of the day, there is no definitive answer as to whether you should normalise or standardise your data. One can always apply both techniques and compare the model performance for the best results. "
593 | ]
594 | },
595 | {
596 | "cell_type": "markdown",
597 | "metadata": {},
598 | "source": [
599 | "# 4.2 Application \n",
600 | "\n",
601 | "Now that we have a theoretical understanding of feature scaling, let's see how they work in practice. "
602 | ]
603 | },
604 | {
605 | "cell_type": "code",
606 | "execution_count": 5,
607 | "metadata": {},
608 | "outputs": [
609 | {
610 | "name": "stdout",
611 | "output_type": "stream",
612 | "text": [
613 | "X shape: (506, 13)\n",
614 | "Y shape: (506,)\n"
615 | ]
616 | }
617 | ],
618 | "source": [
619 | "# Get predictor and target variables\n",
620 | "X = data.drop('target', axis = 1)\n",
621 | "Y = data['target']\n",
622 | "\n",
623 | "# X, Y shape\n",
624 | "print(\"X shape: \", X.shape)\n",
625 | "print(\"Y shape: \", Y.shape)"
626 | ]
627 | },
628 | {
629 | "cell_type": "code",
630 | "execution_count": 6,
631 | "metadata": {},
632 | "outputs": [],
633 | "source": [
634 | "# Instantiate MinMaxScaler, StandardScaler and RobustScaler\n",
635 | "\n",
636 | "norm = MinMaxScaler()\n",
637 | "standard = StandardScaler()\n",
638 | "robust = RobustScaler()"
639 | ]
640 | },
641 | {
642 | "cell_type": "code",
643 | "execution_count": 7,
644 | "metadata": {},
645 | "outputs": [],
646 | "source": [
647 | "# MinMaxScaler\n",
648 | "normalised_features = norm.fit_transform(X)\n",
649 | "normalised_df = pd.DataFrame(normalised_features, index = X.index, columns = X.columns)\n",
650 | "\n",
651 | "# StandardScaler\n",
652 | "standardised_features = standard.fit_transform(X)\n",
653 | "standardised_df = pd.DataFrame(standardised_features, index = X.index, columns = X.columns)\n",
654 | "\n",
655 | "# RobustScaler\n",
656 | "robust_features = robust.fit_transform(X)\n",
657 | "robust_df = pd.DataFrame(robust_features, index = X.index, columns = X.columns)"
658 | ]
659 | },
660 | {
661 | "cell_type": "markdown",
662 | "metadata": {},
663 | "source": [
664 | "To demonstrate the effects of different scalers, I have chosen to examine the following features in our dataset before and after implementing feature scaling: \n",
665 | "\n",
666 | "- ZN\n",
667 | "- AGE\n",
668 | "- TAX\n",
669 | "- B "
670 | ]
671 | },
672 | {
673 | "cell_type": "code",
674 | "execution_count": 8,
675 | "metadata": {
676 | "scrolled": false
677 | },
678 | "outputs": [
679 | {
680 | "data": {
681 | "text/plain": [
682 | "Text(0, 0.5, '')"
683 | ]
684 | },
685 | "execution_count": 8,
686 | "metadata": {},
687 | "output_type": "execute_result"
688 | },
689 | {
690 | "data": {
691 | "image/png": "\n",
692 | "text/plain": [
693 | "
"
694 | ]
695 | },
696 | "metadata": {
697 | "needs_background": "light"
698 | },
699 | "output_type": "display_data"
700 | }
701 | ],
702 | "source": [
703 | "# Create subplots\n",
704 | "fig, ax = plt.subplots(2, 2, figsize = (12, 9))\n",
705 | "\n",
706 | "# Original\n",
707 | "sns.boxplot(x = 'variable', y = 'value', data = pd.melt(data[['ZN', 'AGE', 'TAX', 'B']]), ax = ax[0, 0])\n",
708 | "ax[0, 0].set_title('Original')\n",
709 | "ax[0, 0].set_xlabel('')\n",
710 | "ax[0, 0].set_ylabel('')\n",
711 | "\n",
712 | "# MinMaxScaler\n",
713 | "sns.boxplot(x = 'variable', y = 'value', data = pd.melt(normalised_df[['ZN', 'AGE', 'TAX', 'B']]), ax = ax[0, 1])\n",
714 | "ax[0, 1].set_title('MinMaxScaler')\n",
715 | "ax[0, 1].set_xlabel('')\n",
716 | "ax[0, 1].set_ylabel('')\n",
717 | "\n",
718 | "# StandardScaler\n",
719 | "sns.boxplot(x = 'variable', y = 'value', data = pd.melt(standardised_df[['ZN', 'AGE', 'TAX', 'B']]), ax = ax[1, 0])\n",
720 | "ax[1, 0].set_title('StandardScaler')\n",
721 | "ax[1, 0].set_xlabel('')\n",
722 | "ax[1, 0].set_ylabel('')\n",
723 | "\n",
724 | "# RobustScaler\n",
725 | "sns.boxplot(x = 'variable', y = 'value', data = pd.melt(robust_df[['ZN', 'AGE', 'TAX', 'B']]), ax = ax[1, 1])\n",
726 | "ax[1, 1].set_title('RobustScaler')\n",
727 | "ax[1, 1].set_xlabel('')\n",
728 | "ax[1, 1].set_ylabel('')"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "As we can see, our original features have wildly different ranges.\n",
736 | "\n",
737 | "MinMaxScaler has rescaled our features so that their values are bounded between 0 and 1.\n",
738 | "\n",
739 | "StandardScaler and RobustScaler, on the other hand, have rescaled our features so that they are distributed around the mean of 0. "
740 | ]
741 | },
742 | {
743 | "cell_type": "markdown",
744 | "metadata": {},
745 | "source": [
746 | "# 5. Compare model accuracy"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "I mentioned in the introduction of this tutorial that unscaled data can adversely impact a model's ability to make accurate predictions but so far, we have not discussed exactly how and why they do. In fact, feature scaling does not always improve a model's performance. Some models do not require feature scaling. \n",
754 | "\n",
755 | "In this section, we will explore the following classes of machine learning algorithms and discuss whether or not feature scaling impact their performance:\n",
756 | "\n",
757 | "1. Gradient descent based algorithms\n",
758 | "2. Distance-based algorithms\n",
759 | "3. Tree-based algorithms "
760 | ]
761 | },
762 | {
763 | "cell_type": "markdown",
764 | "metadata": {},
765 | "source": [
766 | "# 5.1 Theory\n",
767 | "\n",
768 | "Let's first go over some concepts behind those algorithms and think about how and why feature scaling might be important to each of them."
769 | ]
770 | },
771 | {
772 | "cell_type": "markdown",
773 | "metadata": {},
774 | "source": [
775 | "## 5.1.1 Gradient descent based algorithms\n",
776 | "\n",
777 | "Gradient desent is an iterative optimisation algorithm that takes us to the minimum of a function. Machine learning algorithms like linear regression and logistic regression rely on gradient descent to minimise their loss functions or in other words, to reduce the error between the predicted values and the actual values. \n",
778 | "\n",
779 | "Having features with varying range of values will cause different step sizes for each feature. Therefore, to ensure that gradient descent converges more smoothly and quickly, we need to scale our features so that they have a similar scale."
780 | ]
781 | },
782 | {
783 | "cell_type": "markdown",
784 | "metadata": {},
785 | "source": [
786 | "## 5.1.2 Distance-based algorithms\n",
787 | "\n",
788 | "The underlying algorithms to distance-based models make them the most susceptible to unscaled data. \n",
789 | "\n",
790 | "Algorithms like k-nearest neighbours, support vector machines and k-means clustering use the distance between data points to determine their similarity. Hence, features with a greater magnitude will be given a higher weightage by the model. This is not an ideal scenario as we do not want our algorithm to be heavily biased towards a single feature.\n",
791 | "\n",
792 | "Evidently, it is important that we implement feature scaling to our data before fitting them to distance-based algorithms to ensure that all features contribute equally to the result. "
793 | ]
794 | },
795 | {
796 | "cell_type": "markdown",
797 | "metadata": {},
798 | "source": [
799 | "## 5.1.3 Tree-based algorithms \n",
800 | "\n",
801 | "Each node in a classification and regression trees (CART) model, otherwise known as decision trees represents a single feature in a dataset. The tree splits each node in such a way that it increases the homogeneity of that node. This split is not affected by the other features in the dataset. \n",
802 | "\n",
803 | "For that reason, we can conclude that decision trees are invariant to the scale of the features and therefore do not require feature scaling. This includes other ensemble models that are also tree-based such as random forest and gradient boosting. "
804 | ]
805 | },
806 | {
807 | "cell_type": "markdown",
808 | "metadata": {},
809 | "source": [
810 | "# 5.2 Proof of concept\n",
811 | "\n",
812 | "Now that we understand the types of models that are sensitive and insensitive to feature scaling, let us now convince ourselves with a concrete example using the Boston house prices dataset. \n",
813 | "\n",
814 | "Here, I have chosen 2 distance-based algorithms (KNN and SVR) as well as 1 tree-based algorithm (decision trees regressor) to predict the house prices.\n",
815 | "\n",
816 | "We should expect to see an improved model performance with feature scaling under KNN and SVR and a constant model performance under decision trees with and without feature scaling.\n",
817 | "\n",
818 | "Feel free to experiment with other types of models like linear regression, random forest and gradient boosting!"
819 | ]
820 | },
821 | {
822 | "cell_type": "code",
823 | "execution_count": 9,
824 | "metadata": {},
825 | "outputs": [],
826 | "source": [
827 | "# Instantiate models \n",
828 | "knn = KNeighborsRegressor()\n",
829 | "svr = SVR()\n",
830 | "tree = DecisionTreeRegressor(max_depth = 10, random_state = 42)\n",
831 | "\n",
832 | "# Create a list which contains different scalers \n",
833 | "scalers = [norm, standard, robust]"
834 | ]
835 | },
836 | {
837 | "cell_type": "code",
838 | "execution_count": 10,
839 | "metadata": {},
840 | "outputs": [
841 | {
842 | "name": "stdout",
843 | "output_type": "stream",
844 | "text": [
845 | "X_train shape: (354, 13)\n",
846 | "Y_train shape: (354,)\n",
847 | "X_test shape: (152, 13)\n",
848 | "Y_test shape: (152,)\n"
849 | ]
850 | }
851 | ],
852 | "source": [
853 | "# Train test split\n",
854 | "\n",
855 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)\n",
856 | "\n",
857 | "print(\"X_train shape: \", X_train.shape)\n",
858 | "print(\"Y_train shape: \", Y_train.shape)\n",
859 | "print(\"X_test shape: \", X_test.shape)\n",
860 | "print(\"Y_test shape: \", Y_test.shape)"
861 | ]
862 | },
863 | {
864 | "cell_type": "markdown",
865 | "metadata": {},
866 | "source": [
867 | "Before we get started, I think it is important to highlight the good practice of first fitting the scalers to the training set and then use that to transform the data in the test set. This is to prevent any data leakage and misleading accuracy scores.\n",
868 | "\n",
869 | "Here, I will construct a pipeline which contains a scaler and a model to fit and transform the features and subsequently make predictions using each model. The accuracy of these predictions are then evaluated using root mean squared error. The smaller the error, the better the model performance. "
870 | ]
871 | },
872 | {
873 | "cell_type": "markdown",
874 | "metadata": {},
875 | "source": [
876 | "## 5.2.1 KNN"
877 | ]
878 | },
879 | {
880 | "cell_type": "code",
881 | "execution_count": 11,
882 | "metadata": {},
883 | "outputs": [
884 | {
885 | "data": {
886 | "text/html": [
887 | "
"
692 | ],
693 | "text/plain": [
694 | " SibSp Fare Age\n",
695 | "0 1 7.2500 22.0\n",
696 | "1 1 71.2833 38.0\n",
697 | "2 0 7.9250 26.0\n",
698 | "3 1 53.1000 35.0\n",
699 | "4 0 8.0500 35.0\n",
700 | "5 0 8.4583 NaN"
701 | ]
702 | },
703 | "execution_count": 11,
704 | "metadata": {},
705 | "output_type": "execute_result"
706 | }
707 | ],
708 | "source": [
709 | "# Create sample dataframe\n",
710 | "\n",
711 | "df = pd.DataFrame({'SibSp': [1, 1, 0, 1, 0, 0], \n",
712 | " 'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583], \n",
713 | " 'Age': [22, 38, 26, 35, 35, np.nan]})\n",
714 | "df"
715 | ]
716 | },
717 | {
718 | "cell_type": "markdown",
719 | "metadata": {},
720 | "source": [
721 | "# 4.1 Simple imputer\n",
722 | "\n",
723 | "Simple imputer follows a univariate approach to imputing missing values i.e. it only takes a single feature into consideration. Some of the most common uses of simple imputer are:\n",
724 | "\n",
725 | "- Mean\n",
726 | "- Median\n",
727 | "- Most frequent (mode)"
728 | ]
729 | },
730 | {
731 | "cell_type": "code",
732 | "execution_count": 12,
733 | "metadata": {},
734 | "outputs": [
735 | {
736 | "name": "stdout",
737 | "output_type": "stream",
738 | "text": [
739 | "Average age: 31.2\n"
740 | ]
741 | },
742 | {
743 | "data": {
744 | "text/plain": [
745 | "array([[ 1. , 7.25 , 22. ],\n",
746 | " [ 1. , 71.2833, 38. ],\n",
747 | " [ 0. , 7.925 , 26. ],\n",
748 | " [ 1. , 53.1 , 35. ],\n",
749 | " [ 0. , 8.05 , 35. ],\n",
750 | " [ 0. , 8.4583, 31.2 ]])"
751 | ]
752 | },
753 | "execution_count": 12,
754 | "metadata": {},
755 | "output_type": "execute_result"
756 | }
757 | ],
758 | "source": [
759 | "print(\"Average age: \", df['Age'].mean())\n",
760 | "simple_imp = SimpleImputer(missing_values = np.nan, strategy = 'mean')\n",
761 | "simple_imp.fit_transform(df)"
762 | ]
763 | },
764 | {
765 | "cell_type": "markdown",
766 | "metadata": {},
767 | "source": [
768 | "As we can see, simple imputer has filled the missing value in the Age column with the average age which is 31.2."
769 | ]
770 | },
771 | {
772 | "cell_type": "markdown",
773 | "metadata": {},
774 | "source": [
775 | "# 4.2 Iterative imputer\n",
776 | "\n",
777 | "Iterative imputer is an example of a multivariate approach to imputation. It models the missing values in a column by using information from the other columns in a dataset. More specifically, it treats the column with missing values as a target variable while the remaining columns are used are predictor variables to predict the target variable. \n",
778 | "\n",
779 | "In our sample data frame, the Age column has one missing value on row 6 and is therefore assigned as the target variable in this scenario. This leaves the SibSp and Fare columns as our predictor variables. \n",
780 | "\n",
781 | "Iterative imputer will use the first 5 rows of the data frame to train a predictive model. Once the model is ready, it will then values in the SibSp and Fare columns of row 6 as inputs and predict the Age value for that row."
782 | ]
783 | },
784 | {
785 | "cell_type": "code",
786 | "execution_count": 13,
787 | "metadata": {
788 | "scrolled": true
789 | },
790 | "outputs": [
791 | {
792 | "data": {
793 | "text/plain": [
794 | "array([[ 1. , 7.25 , 22. ],\n",
795 | " [ 1. , 71.2833 , 38. ],\n",
796 | " [ 0. , 7.925 , 26. ],\n",
797 | " [ 1. , 53.1 , 35. ],\n",
798 | " [ 0. , 8.05 , 35. ],\n",
799 | " [ 0. , 8.4583 , 28.50639495]])"
800 | ]
801 | },
802 | "execution_count": 13,
803 | "metadata": {},
804 | "output_type": "execute_result"
805 | }
806 | ],
807 | "source": [
808 | "iterative_imp = IterativeImputer()\n",
809 | "iterative_imp.fit_transform(df)"
810 | ]
811 | },
812 | {
813 | "cell_type": "markdown",
814 | "metadata": {},
815 | "source": [
816 | "As we can see, the value predicted under iterative imputer is different to that under simple imputer.\n",
817 | "\n",
818 | "This is a more accurate approach to predict the missing Age value as it takes other features in our dataframe into account. "
819 | ]
820 | },
821 | {
822 | "cell_type": "markdown",
823 | "metadata": {},
824 | "source": [
825 | "# 4.3 KNN imputer\n",
826 | "\n",
827 | "Last but not least, we have KNN Imputer which is another multivariate imputation technique. KNN Imputer scans our dataframe for k nearest observations to the row with missing value. It will then proceed to fill the missing value with the average of those nearest observations. \n",
828 | "\n",
829 | "Here, I have set k to equal to 2 or in other words, I want KNN imputer to look for 2 observations that are nearest to row 6 and fill the missing age with the average age of those 2 rows."
830 | ]
831 | },
832 | {
833 | "cell_type": "code",
834 | "execution_count": 14,
835 | "metadata": {},
836 | "outputs": [
837 | {
838 | "data": {
839 | "text/plain": [
840 | "array([[ 1. , 7.25 , 22. ],\n",
841 | " [ 1. , 71.2833, 38. ],\n",
842 | " [ 0. , 7.925 , 26. ],\n",
843 | " [ 1. , 53.1 , 35. ],\n",
844 | " [ 0. , 8.05 , 35. ],\n",
845 | " [ 0. , 8.4583, 30.5 ]])"
846 | ]
847 | },
848 | "execution_count": 14,
849 | "metadata": {},
850 | "output_type": "execute_result"
851 | }
852 | ],
853 | "source": [
854 | "knn_imp = KNNImputer(n_neighbors = 2)\n",
855 | "knn_imp.fit_transform(df)"
856 | ]
857 | },
858 | {
859 | "cell_type": "markdown",
860 | "metadata": {},
861 | "source": [
862 | "As a result, KNN imputer has taken row 3 and row 5 as the nearest observations for row 6.\n",
863 | "\n",
864 | "Therefore, the average age between row 3 and row 5 is (26 + 35) / 2 = 30.5. "
865 | ]
866 | },
867 | {
868 | "cell_type": "markdown",
869 | "metadata": {},
870 | "source": [
871 | "# 5. Model accuracy under simple imputer and iterative imputer\n",
872 | "\n",
873 | "Now that we have a better understanding of how the different imputers work, we can move on to apply these techniques to our Titanic dataset and compare the model accuracy under each approach.\n",
874 | "\n",
875 | "We should expect to see our model perform better under multivariate imputation than univariate imputation as multivariate imputation provides a more accurate prediction of the missing values and thus allowing our model to make better predictions. \n",
876 | "\n",
877 | "In this section, we will build a column transformer which consists of a OneHotEncoder for encoding the Sex and Embarked columns as well as an imputer to impute the missing values in the Age column.\n",
878 | "\n",
879 | "Following that, we will chain the column transformer with a random forest classifier to predict the surival of the passengers on the Titanic. Finally, we will perform 10-fold cross-validation to compare the prediction results under univariate imputation versus under multivariate imputation. "
880 | ]
881 | },
882 | {
883 | "cell_type": "code",
884 | "execution_count": 15,
885 | "metadata": {},
886 | "outputs": [
887 | {
888 | "data": {
889 | "text/html": [
890 | "