├── .gitignore ├── Automated Loan Repayment.ipynb ├── LICENSE ├── README.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /Automated Loan Repayment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![](../images/featuretools.png)\n", 8 | "\n", 9 | "# Predicting Loan Repayment with Automated Feature Engineering in Featuretools\n", 10 | "\n", 11 | "Feature engineering is the process of creating new features (also called predictors or explanatory variables) out of an existing dataset. Traditionally, this process is done by hand using domain knowledge to build new features one at a time. Feature engineering is crucial for a data science problem and a manual approach is time-consuming, tedious, error-prone, and must be re-done for each problem. Automated feature engineering aims to aid the data scientist in this critical process by automatically creating hundreds or thousands of new features from a set of related tables in a fraction of the time as the manual approach. In this notebook, we will apply automated feature engineering to the Home Credit Default Risk loan dataset using [Featuretools, an open-source Python library](https://www.featuretools.com/) for automated feature engineering. \n", 12 | "\n", 13 | "This problem is a machine learning competition on Kaggle where the objective is to predict if an applicant will default on a loan given comprehensive data on past loans and applicants. The data is spread across seven different tables making this an ideal problem for automated feature engineering: all of the data must be gathered into a single dataframe for training (and one for testing) with the aim of capturing as much usable information for the prediction problem as possible. As we will see, featuretools can efficiently carry out the tedious process of using all of these tables to make new features with only a few lines of code. Moreover, this code is generally applicable to any data science problem! \n", 14 | "\n", 15 | "The general idea of automated feature engineering is picture below:\n", 16 | "\n", 17 | "![](../../images/AutomatedFeatureEngineering.png)\n", 18 | "\n", 19 | "\n", 20 | "## Approach \n", 21 | "\n", 22 | "In this notebook, we will implement an automated feature engineering approach to the loan repayment problem. While Featuretools allows plenty of options for customization of the library to improve accuracy, we'll focus on a fairly high-level implementation.\n", 23 | "\n", 24 | "Our approach will be as follows with the background covered as we go:\n", 25 | "\n", 26 | "1. Read in the set of related data tables\n", 27 | "2. Create a featuretools `EntitySet` and add `entities` to it \n", 28 | " * Identify correct variable types as required\n", 29 | " * Identify indices in data\n", 30 | "3. Add relationships between `entities`\n", 31 | "4. Select feature primitives to use to create new features\n", 32 | " * Use basic set of primitives\n", 33 | " * Examine features that will be created\n", 34 | "5. Run Deep Feature Synthesis to generate thousands of new features\n", 35 | "\n", 36 | "\n", 37 | "## Problem and Dataset\n", 38 | "\n", 39 | "The [Home Credit Default Risk competition](https://www.kaggle.com/c/home-credit-default-risk) currently running on Kaggle is a supervised classification task where the objective is to predict whether or not an applicant for a loan (known as a client) will default on the loan. The data comprises socio-economic indicators for the clients, loan specific financial information, and comprehensive data on previous loans at Home Credit (the institution sponsoring the competition) and other credit agencies. The metric for this competition is Receiver Operating Characteristic Area Under the Curve (ROC AUC) with predictions made in terms of the probability of default. We can evaluate our submissions both through cross-validation on the training data (for which we have the labels) or by submitting our test predictions to Kaggle to see where we place on the public leaderboard (which is calculated with only 10% of the testing data). \n", 40 | "\n", 41 | "The Home Credit Default Risk dataset ([available for download here](https://www.kaggle.com/c/home-credit-default-risk/data)) consists of seven related tables of data:\n", 42 | "\n", 43 | "* application_train/application_test: the main training/testing data for each client at Home Credit. The information includes both socioeconomic indicators for the client and loan-specific characteristics. Each loan has its own row and is uniquely identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0: the loan was repaid or 1: the loan was not repaid. \n", 44 | "* bureau: data concerning client's previous credits from other financial institutions (not Home Credit). Each previous credit has its own row in bureau, but one client in the application data can have multiple previous credits. The previous credits are uniquely identified by the feature `SK_ID_BUREAU`.\n", 45 | "* bureau_balance: monthly balance data about the credits in bureau. Each row has information for one month about a previous credit and a single previous credit can have multiple rows. This is linked backed to the bureau loan data by `SK_ID_BUREAU` (not unique in this dataframe).\n", 46 | "* previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each client in the application data can have multiple previous loans. Each previous application has one row in this dataframe and is uniquely identified by the feature `SK_ID_PREV`. \n", 47 | "* POS_CASH_BALANCE: monthly data about previous point of sale or cash loans from the previous loan data. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).\n", 48 | "* credit_card_balance: monthly data about previous credit cards loans from the previous loan data. Each row is one month of a credit card balance, and a single credit card can have many rows. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).\n", 49 | "* installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. This is linked backed to the previous loan data by `SK_ID_PREV` (not unique in this dataframe).\n", 50 | "\n", 51 | "The image below shows the seven tables and the variables linking them:\n", 52 | "\n", 53 | "![](../images/kaggle_home_credit/home_credit_data.png)\n", 54 | "\n", 55 | "The variables that tie the tables together will be important to understand when it comes to adding `relationships` between entities. __The only domain knowledge we need for a full Featuretools approach to the problem is the indexes of the tables and the relationships between the tables.__" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 1, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "# pandas and numpy for data manipulation\n", 65 | "import pandas as pd\n", 66 | "import numpy as np\n", 67 | "\n", 68 | "# featuretools for automated feature engineering\n", 69 | "import featuretools as ft" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### Read in Data\n", 77 | "\n", 78 | "First we can read in the seven data tables. We also replace the anomalous values previously identified (we did the same process with manual feature engineering). " 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 2, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "# Read in the datasets and replace the anomalous values\n", 88 | "app_train = pd.read_csv('input/application_train.csv').replace({365243: np.nan})\n", 89 | "app_test = pd.read_csv('input/application_test.csv').replace({365243: np.nan})\n", 90 | "bureau = pd.read_csv('input/bureau.csv').replace({365243: np.nan})\n", 91 | "bureau_balance = pd.read_csv('input/bureau_balance.csv').replace({365243: np.nan})\n", 92 | "cash = pd.read_csv('input/POS_CASH_balance.csv').replace({365243: np.nan})\n", 93 | "credit = pd.read_csv('input/credit_card_balance.csv').replace({365243: np.nan})\n", 94 | "previous = pd.read_csv('input/previous_application.csv').replace({365243: np.nan})\n", 95 | "installments = pd.read_csv('input/installments_payments.csv').replace({365243: np.nan})" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "We will join together the training and testing datasets to make sure we build the same features for each set. Later, after the feature matrix is built, we can separate out the two sets. " 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 3, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "app_test['TARGET'] = np.nan\n", 112 | "\n", 113 | "# Join together training and testing\n", 114 | "app = app_train.append(app_test, ignore_index = True, sort = True)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "Several of the indexes are an incorrect data type (floats) so we need to make these all the same (integers) for adding relationships. " 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:\n", 131 | " for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:\n", 132 | " if index in list(dataset.columns):\n", 133 | " dataset[index] = dataset[index].fillna(0).astype(np.int64)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "# Featuretools Basics\n", 141 | "\n", 142 | "[Featuretools](https://docs.featuretools.com/#minute-quick-start) is an open-source Python library for automatically creating features out of a set of related tables using a technique called [Deep Feature Synthesis](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). Automated feature engineering, like many topics in machine learning, is a complex subject built upon a foundation of simpler ideas. By going through these ideas one at a time, we can build up our understanding of Featuretools which will later allow for us to get the most out of it.\n", 143 | "\n", 144 | "There are a few concepts that we will cover along the way:\n", 145 | "\n", 146 | "* [Entities and EntitySets](https://docs.featuretools.com/en/stable/loading_data/using_entitysets.html): our tables and a data structure for keeping track of them all\n", 147 | "* [Relationships between tables](https://docs.featuretools.com/en/stable/loading_data/using_entitysets.html#adding-a-relationship): how the tables can be related to one another\n", 148 | "* [Feature primitives](https://docs.featuretools.com/en/stable/automated_feature_engineering/primitives.html): aggregations and transformations that are stacked to build features\n", 149 | "* [Deep feature synthesis](https://docs.featuretools.com/en/stable/automated_feature_engineering/afe.html): the method that uses feature primitives to generate thousands of new features\n", 150 | "\n", 151 | "# Entities and Entitysets\n", 152 | "\n", 153 | "An entity is simply a table or in Pandas, a `dataframe`. The observations must be in the rows and the features in the columns. An entity in featuretools must have a unique index where none of the elements are duplicated. Currently, only `app`, `bureau`, and `previous` have unique indices (`SK_ID_CURR`, `SK_ID_BUREAU`, and `SK_ID_PREV` respectively). For the other dataframes, when we create entities from them, we must pass in `make_index = True` and then specify the name of the index. \n", 154 | "\n", 155 | "Entities can also have time indices that represent when the information in the row became known. (There are not datetimes in any of the data, but there are relative times, given in months or days, that could be treated as time variables, although we will not use them as time in this notebook).\n", 156 | "\n", 157 | "An [EntitySet](https://docs.featuretools.com/en/stable/loading_data/using_entitysets.html) is a collection of tables and the relationships between them. This can be thought of a data structure with its own methods and attributes. Using an EntitySet allows us to group together multiple tables and will make creating the features much simpler than keeping track of individual tables and relationships. __EntitySets and entities are abstractions that can be applied to any dataset because they do not depend on the underlying data.__\n", 158 | "\n", 159 | "First we'll make an empty entityset named clients to keep track of all the data." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 5, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# Entity set with id applications\n", 169 | "es = ft.EntitySet(id = 'clients')" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "### Variable Types\n", 177 | "\n", 178 | "Featuretools will automatically infer the variable types. However, there may be some cases where we need to explicitly tell featuretools the variable type such as when a boolean variable is represented as an integer. Variable types in featuretools can be specified as a dictionary. \n", 179 | "\n", 180 | "We will first work with the `app` data to specify the proper variable types. To identify the `Boolean` variables that are recorded as numbers (1.0 or 0.0), we can iterate through the data and find any columns where there are only 2 unique values and the data type is numeric. We can also use the column definitions to find any other data types that should be identified, such as `Ordinal` variables. Identifying the correct variable types is important because Featuretools applies different operations to different data types (just as we do when manual feature engineering)." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 6, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "import featuretools.variable_types as vtypes" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 7, 195 | "metadata": {}, 196 | "outputs": [ 197 | { 198 | "name": "stdout", 199 | "output_type": "stream", 200 | "text": [ 201 | "There are 32 Boolean variables in the application data.\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "app_types = {}\n", 207 | "\n", 208 | "# Handle the Boolean variables:\n", 209 | "for col in app:\n", 210 | " if (app[col].nunique() == 2) and (app[col].dtype == float):\n", 211 | " app_types[col] = vtypes.Boolean\n", 212 | "\n", 213 | "# Remove the `TARGET`\n", 214 | "del app_types['TARGET']\n", 215 | "\n", 216 | "print('There are {} Boolean variables in the application data.'.format(len(app_types)))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 8, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "# Ordinal variables\n", 226 | "app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal\n", 227 | "app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal\n", 228 | "app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "The `previous` table is the only other `entity` that has features which should be recorded as Boolean. Correctly identifying the type of column will prevent featuretools from making irrelevant features such as the mean or max of a `Boolean`. " 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 9, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "name": "stdout", 245 | "output_type": "stream", 246 | "text": [ 247 | "There are 2 Boolean variables in the previous data.\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "previous_types = {}\n", 253 | "\n", 254 | "# Handle the Boolean variables:\n", 255 | "for col in previous:\n", 256 | " if (previous[col].nunique() == 2) and (previous[col].dtype == float):\n", 257 | " previous_types[col] = vtypes.Boolean\n", 258 | "\n", 259 | "print('There are {} Boolean variables in the previous data.'.format(len(previous_types)))" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "In addition to identifying Boolean variables, we want to make sure featuretools does not create nonsense features such as statistical aggregations (mean, max, etc.) of ids. The `credit`, `cash`, and `installments` data all have the `SK_ID_CURR` variable. However, we do not actually need this variable in these dataframes because we link them to `app` through the `previous` dataframe with the `SK_ID_PREV` variable. \n", 267 | "\n", 268 | "We don't want to make features from `SK_ID_CURR` since it is an arbitrary id and should have no predictive power. \n", 269 | "Our options to handle these variables is either to tell featuretools to ignore them, or to drop the features before including them in the entityset. We will take the latter approach." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 10, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "installments = installments.drop(columns = ['SK_ID_CURR'])\n", 279 | "credit = credit.drop(columns = ['SK_ID_CURR'])\n", 280 | "cash = cash.drop(columns = ['SK_ID_CURR'])" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "## Adding Entities\n", 288 | "\n", 289 | "Now we define each entity, or table of data, and add it to the `EntitySet`. We need to pass in an index if the table has one or `make_index = True` if not. In the cases where we need to make an index, we must supply a name for the index. We also need to pass in the dictionary of variable types if there are any specific variables we should identify. The following code adds all seven tables to the `EntitySet`." 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 11, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "# Entities with a unique index\n", 299 | "es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR',\n", 300 | " variable_types = app_types)\n", 301 | "\n", 302 | "es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')\n", 303 | "\n", 304 | "es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',\n", 305 | " variable_types = previous_types)\n", 306 | "\n", 307 | "# Entities that do not have a unique index\n", 308 | "es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, \n", 309 | " make_index = True, index = 'bureaubalance_index')\n", 310 | "\n", 311 | "es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, \n", 312 | " make_index = True, index = 'cash_index')\n", 313 | "\n", 314 | "es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,\n", 315 | " make_index = True, index = 'installments_index')\n", 316 | "\n", 317 | "es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,\n", 318 | " make_index = True, index = 'credit_index')" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 12, 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "data": { 328 | "text/plain": [ 329 | "Entityset: clients\n", 330 | " Entities:\n", 331 | " app [Rows: 356255, Columns: 122]\n", 332 | " bureau [Rows: 1716428, Columns: 17]\n", 333 | " previous [Rows: 1670214, Columns: 37]\n", 334 | " bureau_balance [Rows: 27299925, Columns: 4]\n", 335 | " cash [Rows: 10001358, Columns: 8]\n", 336 | " installments [Rows: 13605401, Columns: 8]\n", 337 | " credit [Rows: 3840312, Columns: 23]\n", 338 | " Relationships:\n", 339 | " No relationships" 340 | ] 341 | }, 342 | "execution_count": 12, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "# Display entityset so far\n", 349 | "es" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "The `EntitySet` allows us to group together all of our tables as one data structure. This is much easier than manipulating the tables one at a time (as we have to do in manual feature engineering)." 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "# Relationships\n", 364 | "\n", 365 | "Relationships are a fundamental concept not only in featuretools, but in any relational database. The most common type of relationship is one-to-many. The best way to think of a one-to-many relationship is with the analogy of parent-to-child. A parent is a single individual, but can have mutliple children. In the context of tables, a parent table will have one row (observation) for every individual while a child table can have many observations for each parent. In a _parent table_, each individual has a single row and is uniquely identified by an index (also called a key). Each individual in the parent table can have multiple rows in the _child table_. Things get a little more complicated because children tables can have children of their own, making these grandchildren of the original parent. \n", 366 | "\n", 367 | "As an example of a parent-to-child relationship, the `app` dataframe has one row for each client (identified by `SK_ID_CURR`) while the `bureau` dataframe has multiple previous loans for each client. Therefore, the `bureau` dataframe is the child of the `app` dataframe. The `bureau` dataframe in turn is the parent of `bureau_balance` because each loan has one row in `bureau` (identified by `SK_ID_BUREAU`) but multiple monthly records in `bureau_balance`. When we do manual feature engineering, keeping track of all these relationships is a massive time investment (and a potential source of error), but we can add these relationships to our `EntitySet` and let featuretools worry about keeping the tables straight!" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 13, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "Parent: app, Parent Variable of bureau: SK_ID_CURR\n", 380 | "\n", 381 | " SK_ID_CURR TARGET TOTALAREA_MODE WALLSMATERIAL_MODE\n", 382 | "0 100002 1.0 0.0149 Stone, brick\n", 383 | "1 100003 0.0 0.0714 Block\n", 384 | "2 100004 0.0 NaN NaN\n", 385 | "3 100006 0.0 NaN NaN\n", 386 | "4 100007 0.0 NaN NaN\n", 387 | "\n", 388 | "Child: bureau, Child Variable of app: SK_ID_CURR\n", 389 | "\n", 390 | " SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT\n", 391 | "0 215354 5714462 Closed currency 1 -497.0\n", 392 | "1 215354 5714463 Active currency 1 -208.0\n", 393 | "2 215354 5714464 Active currency 1 -203.0\n", 394 | "3 215354 5714465 Active currency 1 -203.0\n", 395 | "4 215354 5714466 Active currency 1 -629.0\n" 396 | ] 397 | } 398 | ], 399 | "source": [ 400 | "print('Parent: app, Parent Variable of bureau: SK_ID_CURR\\n\\n', app.iloc[:, 111:115].head())\n", 401 | "print('\\nChild: bureau, Child Variable of app: SK_ID_CURR\\n\\n', bureau.iloc[:, :5].head())" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "The `SK_ID_CURR` 215354 has one row in the parent table and multiple rows in the child. \n", 409 | "\n", 410 | "Two tables are linked via a shared variable. The `app` and `bureau` dataframe are linked by the `SK_ID_CURR` variable while the `bureau` and `bureau_balance` dataframes are linked with the `SK_ID_BUREAU`. The linking variable is called the `parent` variable in the parent table and the `child` variable in the child table." 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 14, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\n", 423 | "\n", 424 | " SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT\n", 425 | "0 215354 5714462 Closed currency 1 -497.0\n", 426 | "1 215354 5714463 Active currency 1 -208.0\n", 427 | "2 215354 5714464 Active currency 1 -203.0\n", 428 | "3 215354 5714465 Active currency 1 -203.0\n", 429 | "4 215354 5714466 Active currency 1 -629.0\n", 430 | "\n", 431 | "Child: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\n", 432 | "\n", 433 | " bureaubalance_index SK_ID_BUREAU MONTHS_BALANCE STATUS\n", 434 | "0 0 5715448 0 C\n", 435 | "1 1 5715448 -1 C\n", 436 | "2 2 5715448 -2 C\n", 437 | "3 3 5715448 -3 C\n", 438 | "4 4 5715448 -4 C\n" 439 | ] 440 | } 441 | ], 442 | "source": [ 443 | "print('Parent: bureau, Parent Variable of bureau_balance: SK_ID_BUREAU\\n\\n', bureau.iloc[:, :5].head())\n", 444 | "print('\\nChild: bureau_balance, Child Variable of bureau: SK_ID_BUREAU\\n\\n', bureau_balance.head())" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "Traditionally, we use the relationships between parents and children to aggregate data by grouping together all the children for a single parent and calculating statistics. For example, we might group together all the loans for a single client and calculate the average loan amount. This is straightforward, but can grow extremely tedious when we want to make hundreds of these features. Doing so one at a time is extremely inefficient especially because we end up re-writing much of the code over and over again and this code cannot be used for any different problem! \n", 452 | "\n", 453 | "Things get even worse when we have to aggregate the grandchildren because we have to use two steps: first aggregate at the parent level, and then at the grandparent level. Soon we will see that Featuretools can do this work automatically for us, generating thousands of features from __all__ of the data tables. When we did this manually it took about 15 minutes per feature so Featuretools potentially saves us hundreds of hours.\n", 454 | "\n", 455 | "### Adding Relationships\n", 456 | "\n", 457 | "Defining the relationships is straightforward using the diagram for the data tables. For each relationship, we need to first specify the parent variable and then the child variable. Altogether, there are a total of 6 relationships between the tables (counting the training and testing relationships as one). Below we specify these relationships and then add them to the EntitySet." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 15, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "# Relationship between app_train and bureau\n", 467 | "r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])\n", 468 | "\n", 469 | "# Relationship between bureau and bureau balance\n", 470 | "r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])\n", 471 | "\n", 472 | "# Relationship between current app and previous apps\n", 473 | "r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])\n", 474 | "\n", 475 | "# Relationships between previous apps and cash, installments, and credit\n", 476 | "r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])\n", 477 | "r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])\n", 478 | "r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 16, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "data": { 488 | "text/plain": [ 489 | "Entityset: clients\n", 490 | " Entities:\n", 491 | " app [Rows: 356255, Columns: 122]\n", 492 | " bureau [Rows: 1716428, Columns: 17]\n", 493 | " previous [Rows: 1670214, Columns: 37]\n", 494 | " bureau_balance [Rows: 27299925, Columns: 4]\n", 495 | " cash [Rows: 10001358, Columns: 8]\n", 496 | " installments [Rows: 13605401, Columns: 8]\n", 497 | " credit [Rows: 3840312, Columns: 23]\n", 498 | " Relationships:\n", 499 | " bureau.SK_ID_CURR -> app.SK_ID_CURR\n", 500 | " bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU\n", 501 | " previous.SK_ID_CURR -> app.SK_ID_CURR\n", 502 | " cash.SK_ID_PREV -> previous.SK_ID_PREV\n", 503 | " installments.SK_ID_PREV -> previous.SK_ID_PREV\n", 504 | " credit.SK_ID_PREV -> previous.SK_ID_PREV" 505 | ] 506 | }, 507 | "execution_count": 16, 508 | "metadata": {}, 509 | "output_type": "execute_result" 510 | } 511 | ], 512 | "source": [ 513 | "# Add in the defined relationships\n", 514 | "es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,\n", 515 | " r_previous_cash, r_previous_installments, r_previous_credit])\n", 516 | "# Print out the EntitySet\n", 517 | "es" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "Again, we can see the benefits of using an `EntitySet` that is able to track all of the relationships for us. This allows us to work at a higher level of abstraction, thinking about the entire dataset rather than each individual table, greatly increasing our efficiency." 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "__Slightly advanced note__: we need to be careful to not create a [diamond graph](https://en.wikipedia.org/wiki/Diamond_graph) where there are multiple paths from a parent to a child. If we directly link `app` and `cash` via `SK_ID_CURR`; `previous` and `cash` via `SK_ID_PREV`; and `app` and `previous` via `SK_ID_CURR`, then we have created two paths from `app` to `cash`. This results in ambiguity, so the approach we have to take instead is to link `app` to `cash` through `previous`. We establish a relationship between `previous` (the parent) and `cash` (the child) using `SK_ID_PREV`. Then we establish a relationship between `app` (the parent) and `previous` (now the child) using `SK_ID_CURR`. Then featuretools will be able to create features on `app` derived from both `previous` and `cash` by stacking multiple primitives. \n", 532 | "\n", 533 | "If this doesn't make too much sense, then just remember to only include one path from a parent to any descendents. For example, link a grandparent to a grandchild through the parent instead of directly through a shared variable." 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "All entities in the entity can be linked through these relationships. In theory this allows us to calculate features for any of the entities, but in practice, we will only calculate features for the `app` dataframe since that will be used for training/testing. The end outcome will be a dataframe that has one row for each client in `app` with thousands of features for each individual. \n", 541 | "\n", 542 | "We are almost to the point where we can start creating thousands of features but we still have a few foundational topics to understand. The next building block to cover is feature primitives." 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "## Visualize EntitySet" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 17, 555 | "metadata": {}, 556 | "outputs": [ 557 | { 558 | "data": { 559 | "image/svg+xml": [ 560 | "\n", 561 | "\n", 563 | "\n", 565 | "\n", 566 | "\n", 568 | "\n", 569 | "clients\n", 570 | "\n", 571 | "\n", 572 | "\n", 573 | "app\n", 574 | "\n", 575 | "app (356255 rows)\n", 576 | "\n", 577 | "SK_ID_CURR : index\n", 578 | "AMT_ANNUITY : numeric\n", 579 | "AMT_CREDIT : numeric\n", 580 | "AMT_GOODS_PRICE : numeric\n", 581 | "AMT_INCOME_TOTAL : numeric\n", 582 | "AMT_REQ_CREDIT_BUREAU_DAY : numeric\n", 583 | "AMT_REQ_CREDIT_BUREAU_HOUR : numeric\n", 584 | "AMT_REQ_CREDIT_BUREAU_MON : numeric\n", 585 | "AMT_REQ_CREDIT_BUREAU_QRT : numeric\n", 586 | "AMT_REQ_CREDIT_BUREAU_WEEK : numeric\n", 587 | "AMT_REQ_CREDIT_BUREAU_YEAR : numeric\n", 588 | "APARTMENTS_AVG : numeric\n", 589 | "APARTMENTS_MEDI : numeric\n", 590 | "APARTMENTS_MODE : numeric\n", 591 | "BASEMENTAREA_AVG : numeric\n", 592 | "BASEMENTAREA_MEDI : numeric\n", 593 | "BASEMENTAREA_MODE : numeric\n", 594 | "CNT_CHILDREN : numeric\n", 595 | "CNT_FAM_MEMBERS : numeric\n", 596 | "CODE_GENDER : categorical\n", 597 | "COMMONAREA_AVG : numeric\n", 598 | "COMMONAREA_MEDI : numeric\n", 599 | "COMMONAREA_MODE : numeric\n", 600 | "DAYS_BIRTH : numeric\n", 601 | "DAYS_EMPLOYED : numeric\n", 602 | "DAYS_ID_PUBLISH : numeric\n", 603 | "DAYS_LAST_PHONE_CHANGE : numeric\n", 604 | "DAYS_REGISTRATION : numeric\n", 605 | "DEF_30_CNT_SOCIAL_CIRCLE : numeric\n", 606 | "DEF_60_CNT_SOCIAL_CIRCLE : numeric\n", 607 | "ELEVATORS_AVG : numeric\n", 608 | "ELEVATORS_MEDI : numeric\n", 609 | "ELEVATORS_MODE : numeric\n", 610 | "EMERGENCYSTATE_MODE : categorical\n", 611 | "ENTRANCES_AVG : numeric\n", 612 | "ENTRANCES_MEDI : numeric\n", 613 | "ENTRANCES_MODE : numeric\n", 614 | "EXT_SOURCE_1 : numeric\n", 615 | "EXT_SOURCE_2 : numeric\n", 616 | "EXT_SOURCE_3 : numeric\n", 617 | "FLAG_OWN_CAR : categorical\n", 618 | "FLAG_OWN_REALTY : categorical\n", 619 | "FLOORSMAX_AVG : numeric\n", 620 | "FLOORSMAX_MEDI : numeric\n", 621 | "FLOORSMAX_MODE : numeric\n", 622 | "FLOORSMIN_AVG : numeric\n", 623 | "FLOORSMIN_MEDI : numeric\n", 624 | "FLOORSMIN_MODE : numeric\n", 625 | "FONDKAPREMONT_MODE : categorical\n", 626 | "HOUSETYPE_MODE : categorical\n", 627 | "LANDAREA_AVG : numeric\n", 628 | "LANDAREA_MEDI : numeric\n", 629 | "LANDAREA_MODE : numeric\n", 630 | "LIVINGAPARTMENTS_AVG : numeric\n", 631 | "LIVINGAPARTMENTS_MEDI : numeric\n", 632 | "LIVINGAPARTMENTS_MODE : numeric\n", 633 | "LIVINGAREA_AVG : numeric\n", 634 | "LIVINGAREA_MEDI : numeric\n", 635 | "LIVINGAREA_MODE : numeric\n", 636 | "NAME_CONTRACT_TYPE : categorical\n", 637 | "NAME_EDUCATION_TYPE : categorical\n", 638 | "NAME_FAMILY_STATUS : categorical\n", 639 | "NAME_HOUSING_TYPE : categorical\n", 640 | "NAME_INCOME_TYPE : categorical\n", 641 | "NAME_TYPE_SUITE : categorical\n", 642 | "NONLIVINGAPARTMENTS_AVG : numeric\n", 643 | "NONLIVINGAPARTMENTS_MEDI : numeric\n", 644 | "NONLIVINGAPARTMENTS_MODE : numeric\n", 645 | "NONLIVINGAREA_AVG : numeric\n", 646 | "NONLIVINGAREA_MEDI : numeric\n", 647 | "NONLIVINGAREA_MODE : numeric\n", 648 | "OBS_30_CNT_SOCIAL_CIRCLE : numeric\n", 649 | "OBS_60_CNT_SOCIAL_CIRCLE : numeric\n", 650 | "OCCUPATION_TYPE : categorical\n", 651 | "ORGANIZATION_TYPE : categorical\n", 652 | "OWN_CAR_AGE : numeric\n", 653 | "REGION_POPULATION_RELATIVE : numeric\n", 654 | "TARGET : numeric\n", 655 | "TOTALAREA_MODE : numeric\n", 656 | "WALLSMATERIAL_MODE : categorical\n", 657 | "WEEKDAY_APPR_PROCESS_START : categorical\n", 658 | "YEARS_BEGINEXPLUATATION_AVG : numeric\n", 659 | "YEARS_BEGINEXPLUATATION_MEDI : numeric\n", 660 | "YEARS_BEGINEXPLUATATION_MODE : numeric\n", 661 | "YEARS_BUILD_AVG : numeric\n", 662 | "YEARS_BUILD_MEDI : numeric\n", 663 | "YEARS_BUILD_MODE : numeric\n", 664 | "FLAG_CONT_MOBILE : boolean\n", 665 | "FLAG_DOCUMENT_10 : boolean\n", 666 | "FLAG_DOCUMENT_11 : boolean\n", 667 | "FLAG_DOCUMENT_12 : boolean\n", 668 | "FLAG_DOCUMENT_13 : boolean\n", 669 | "FLAG_DOCUMENT_14 : boolean\n", 670 | "FLAG_DOCUMENT_15 : boolean\n", 671 | "FLAG_DOCUMENT_16 : boolean\n", 672 | "FLAG_DOCUMENT_17 : boolean\n", 673 | "FLAG_DOCUMENT_18 : boolean\n", 674 | "FLAG_DOCUMENT_19 : boolean\n", 675 | "FLAG_DOCUMENT_2 : boolean\n", 676 | "FLAG_DOCUMENT_20 : boolean\n", 677 | "FLAG_DOCUMENT_21 : boolean\n", 678 | "FLAG_DOCUMENT_3 : boolean\n", 679 | "FLAG_DOCUMENT_4 : boolean\n", 680 | "FLAG_DOCUMENT_5 : boolean\n", 681 | "FLAG_DOCUMENT_6 : boolean\n", 682 | "FLAG_DOCUMENT_7 : boolean\n", 683 | "FLAG_DOCUMENT_8 : boolean\n", 684 | "FLAG_DOCUMENT_9 : boolean\n", 685 | "FLAG_EMAIL : boolean\n", 686 | "FLAG_EMP_PHONE : boolean\n", 687 | "FLAG_MOBIL : boolean\n", 688 | "FLAG_PHONE : boolean\n", 689 | "FLAG_WORK_PHONE : boolean\n", 690 | "LIVE_CITY_NOT_WORK_CITY : boolean\n", 691 | "LIVE_REGION_NOT_WORK_REGION : boolean\n", 692 | "REG_CITY_NOT_LIVE_CITY : boolean\n", 693 | "REG_CITY_NOT_WORK_CITY : boolean\n", 694 | "REG_REGION_NOT_LIVE_REGION : boolean\n", 695 | "REG_REGION_NOT_WORK_REGION : boolean\n", 696 | "REGION_RATING_CLIENT : ordinal\n", 697 | "REGION_RATING_CLIENT_W_CITY : ordinal\n", 698 | "HOUR_APPR_PROCESS_START : ordinal\n", 699 | "\n", 700 | "\n", 701 | "\n", 702 | "bureau\n", 703 | "\n", 704 | "bureau (1716428 rows)\n", 705 | "\n", 706 | "SK_ID_BUREAU : index\n", 707 | "SK_ID_CURR : id\n", 708 | "CREDIT_ACTIVE : categorical\n", 709 | "CREDIT_CURRENCY : categorical\n", 710 | "DAYS_CREDIT : numeric\n", 711 | "CREDIT_DAY_OVERDUE : numeric\n", 712 | "DAYS_CREDIT_ENDDATE : numeric\n", 713 | "DAYS_ENDDATE_FACT : numeric\n", 714 | "AMT_CREDIT_MAX_OVERDUE : numeric\n", 715 | "CNT_CREDIT_PROLONG : numeric\n", 716 | "AMT_CREDIT_SUM : numeric\n", 717 | "AMT_CREDIT_SUM_DEBT : numeric\n", 718 | "AMT_CREDIT_SUM_LIMIT : numeric\n", 719 | "AMT_CREDIT_SUM_OVERDUE : numeric\n", 720 | "CREDIT_TYPE : categorical\n", 721 | "DAYS_CREDIT_UPDATE : numeric\n", 722 | "AMT_ANNUITY : numeric\n", 723 | "\n", 724 | "\n", 725 | "\n", 726 | "bureau->app\n", 727 | "\n", 728 | "\n", 729 | "SK_ID_CURR\n", 730 | "\n", 731 | "\n", 732 | "\n", 733 | "previous\n", 734 | "\n", 735 | "previous (1670214 rows)\n", 736 | "\n", 737 | "SK_ID_PREV : index\n", 738 | "SK_ID_CURR : id\n", 739 | "NAME_CONTRACT_TYPE : categorical\n", 740 | "AMT_ANNUITY : numeric\n", 741 | "AMT_APPLICATION : numeric\n", 742 | "AMT_CREDIT : numeric\n", 743 | "AMT_DOWN_PAYMENT : numeric\n", 744 | "AMT_GOODS_PRICE : numeric\n", 745 | "WEEKDAY_APPR_PROCESS_START : categorical\n", 746 | "HOUR_APPR_PROCESS_START : numeric\n", 747 | "FLAG_LAST_APPL_PER_CONTRACT : categorical\n", 748 | "RATE_DOWN_PAYMENT : numeric\n", 749 | "RATE_INTEREST_PRIMARY : numeric\n", 750 | "RATE_INTEREST_PRIVILEGED : numeric\n", 751 | "NAME_CASH_LOAN_PURPOSE : categorical\n", 752 | "NAME_CONTRACT_STATUS : categorical\n", 753 | "DAYS_DECISION : numeric\n", 754 | "NAME_PAYMENT_TYPE : categorical\n", 755 | "CODE_REJECT_REASON : categorical\n", 756 | "NAME_TYPE_SUITE : categorical\n", 757 | "NAME_CLIENT_TYPE : categorical\n", 758 | "NAME_GOODS_CATEGORY : categorical\n", 759 | "NAME_PORTFOLIO : categorical\n", 760 | "NAME_PRODUCT_TYPE : categorical\n", 761 | "CHANNEL_TYPE : categorical\n", 762 | "SELLERPLACE_AREA : numeric\n", 763 | "NAME_SELLER_INDUSTRY : categorical\n", 764 | "CNT_PAYMENT : numeric\n", 765 | "NAME_YIELD_GROUP : categorical\n", 766 | "PRODUCT_COMBINATION : categorical\n", 767 | "DAYS_FIRST_DRAWING : numeric\n", 768 | "DAYS_FIRST_DUE : numeric\n", 769 | "DAYS_LAST_DUE_1ST_VERSION : numeric\n", 770 | "DAYS_LAST_DUE : numeric\n", 771 | "DAYS_TERMINATION : numeric\n", 772 | "NFLAG_LAST_APPL_IN_DAY : boolean\n", 773 | "NFLAG_INSURED_ON_APPROVAL : boolean\n", 774 | "\n", 775 | "\n", 776 | "\n", 777 | "previous->app\n", 778 | "\n", 779 | "\n", 780 | "SK_ID_CURR\n", 781 | "\n", 782 | "\n", 783 | "\n", 784 | "bureau_balance\n", 785 | "\n", 786 | "bureau_balance (27299925 rows)\n", 787 | "\n", 788 | "bureaubalance_index : index\n", 789 | "SK_ID_BUREAU : id\n", 790 | "MONTHS_BALANCE : numeric\n", 791 | "STATUS : categorical\n", 792 | "\n", 793 | "\n", 794 | "\n", 795 | "bureau_balance->bureau\n", 796 | "\n", 797 | "\n", 798 | "SK_ID_BUREAU\n", 799 | "\n", 800 | "\n", 801 | "\n", 802 | "cash\n", 803 | "\n", 804 | "cash (10001358 rows)\n", 805 | "\n", 806 | "cash_index : index\n", 807 | "SK_ID_PREV : id\n", 808 | "MONTHS_BALANCE : numeric\n", 809 | "CNT_INSTALMENT : numeric\n", 810 | "CNT_INSTALMENT_FUTURE : numeric\n", 811 | "NAME_CONTRACT_STATUS : categorical\n", 812 | "SK_DPD : numeric\n", 813 | "SK_DPD_DEF : numeric\n", 814 | "\n", 815 | "\n", 816 | "\n", 817 | "cash->previous\n", 818 | "\n", 819 | "\n", 820 | "SK_ID_PREV\n", 821 | "\n", 822 | "\n", 823 | "\n", 824 | "installments\n", 825 | "\n", 826 | "installments (13605401 rows)\n", 827 | "\n", 828 | "installments_index : index\n", 829 | "SK_ID_PREV : id\n", 830 | "NUM_INSTALMENT_VERSION : numeric\n", 831 | "NUM_INSTALMENT_NUMBER : numeric\n", 832 | "DAYS_INSTALMENT : numeric\n", 833 | "DAYS_ENTRY_PAYMENT : numeric\n", 834 | "AMT_INSTALMENT : numeric\n", 835 | "AMT_PAYMENT : numeric\n", 836 | "\n", 837 | "\n", 838 | "\n", 839 | "installments->previous\n", 840 | "\n", 841 | "\n", 842 | "SK_ID_PREV\n", 843 | "\n", 844 | "\n", 845 | "\n", 846 | "credit\n", 847 | "\n", 848 | "credit (3840312 rows)\n", 849 | "\n", 850 | "credit_index : index\n", 851 | "SK_ID_PREV : id\n", 852 | "MONTHS_BALANCE : numeric\n", 853 | "AMT_BALANCE : numeric\n", 854 | "AMT_CREDIT_LIMIT_ACTUAL : numeric\n", 855 | "AMT_DRAWINGS_ATM_CURRENT : numeric\n", 856 | "AMT_DRAWINGS_CURRENT : numeric\n", 857 | "AMT_DRAWINGS_OTHER_CURRENT : numeric\n", 858 | "AMT_DRAWINGS_POS_CURRENT : numeric\n", 859 | "AMT_INST_MIN_REGULARITY : numeric\n", 860 | "AMT_PAYMENT_CURRENT : numeric\n", 861 | "AMT_PAYMENT_TOTAL_CURRENT : numeric\n", 862 | "AMT_RECEIVABLE_PRINCIPAL : numeric\n", 863 | "AMT_RECIVABLE : numeric\n", 864 | "AMT_TOTAL_RECEIVABLE : numeric\n", 865 | "CNT_DRAWINGS_ATM_CURRENT : numeric\n", 866 | "CNT_DRAWINGS_CURRENT : numeric\n", 867 | "CNT_DRAWINGS_OTHER_CURRENT : numeric\n", 868 | "CNT_DRAWINGS_POS_CURRENT : numeric\n", 869 | "CNT_INSTALMENT_MATURE_CUM : numeric\n", 870 | "NAME_CONTRACT_STATUS : categorical\n", 871 | "SK_DPD : numeric\n", 872 | "SK_DPD_DEF : numeric\n", 873 | "\n", 874 | "\n", 875 | "\n", 876 | "credit->previous\n", 877 | "\n", 878 | "\n", 879 | "SK_ID_PREV\n", 880 | "\n", 881 | "\n", 882 | "\n" 883 | ], 884 | "text/plain": [ 885 | "" 886 | ] 887 | }, 888 | "execution_count": 17, 889 | "metadata": {}, 890 | "output_type": "execute_result" 891 | } 892 | ], 893 | "source": [ 894 | "es.plot()" 895 | ] 896 | }, 897 | { 898 | "cell_type": "markdown", 899 | "metadata": {}, 900 | "source": [ 901 | "# Feature Primitives\n", 902 | "\n", 903 | "A [feature primitive](https://docs.featuretools.com/en/stable/automated_feature_engineering/primitives.html) is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, many of which we already use in manual feature engineering, that can be stacked on top of each other to create complex deep features. Feature primitives fall into two categories:\n", 904 | "\n", 905 | "* __Aggregation__: function that groups together children for each parent and calculates a statistic such as mean, min, max, or standard deviation across the children. An example is the maximum previous loan amount for each client. An aggregation covers multiple tables using relationships between tables.\n", 906 | "* __Transformation__: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.\n", 907 | "\n", 908 | "A list of the available features primitives in featuretools can be viewed below." 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": 18, 914 | "metadata": {}, 915 | "outputs": [ 916 | { 917 | "data": { 918 | "text/html": [ 919 | "
\n", 920 | "\n", 933 | "\n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | "
nametypedask_compatibledescription
0num_trueaggregationTrueCounts the number of `True` values.
1trendaggregationFalseCalculates the trend of a variable over time.
2maxaggregationTrueCalculates the highest value, ignoring `NaN` values.
3time_since_lastaggregationFalseCalculates the time elapsed since the last datetime (default in seconds).
4anyaggregationTrueDetermines if any value is 'True' in a list.
5modeaggregationFalseDetermines the most commonly repeated value.
6firstaggregationFalseDetermines the first value in a list.
7skewaggregationFalseComputes the extent to which a distribution differs from a normal distribution.
8countaggregationTrueDetermines the total number of values, excluding `NaN`.
9n_most_commonaggregationFalseDetermines the `n` most common elements.
\n", 1016 | "
" 1017 | ], 1018 | "text/plain": [ 1019 | " name type dask_compatible \\\n", 1020 | "0 num_true aggregation True \n", 1021 | "1 trend aggregation False \n", 1022 | "2 max aggregation True \n", 1023 | "3 time_since_last aggregation False \n", 1024 | "4 any aggregation True \n", 1025 | "5 mode aggregation False \n", 1026 | "6 first aggregation False \n", 1027 | "7 skew aggregation False \n", 1028 | "8 count aggregation True \n", 1029 | "9 n_most_common aggregation False \n", 1030 | "\n", 1031 | " description \n", 1032 | "0 Counts the number of `True` values. \n", 1033 | "1 Calculates the trend of a variable over time. \n", 1034 | "2 Calculates the highest value, ignoring `NaN` values. \n", 1035 | "3 Calculates the time elapsed since the last datetime (default in seconds). \n", 1036 | "4 Determines if any value is 'True' in a list. \n", 1037 | "5 Determines the most commonly repeated value. \n", 1038 | "6 Determines the first value in a list. \n", 1039 | "7 Computes the extent to which a distribution differs from a normal distribution. \n", 1040 | "8 Determines the total number of values, excluding `NaN`. \n", 1041 | "9 Determines the `n` most common elements. " 1042 | ] 1043 | }, 1044 | "execution_count": 18, 1045 | "metadata": {}, 1046 | "output_type": "execute_result" 1047 | } 1048 | ], 1049 | "source": [ 1050 | "# List the primitives in a dataframe\n", 1051 | "primitives = ft.list_primitives()\n", 1052 | "pd.options.display.max_colwidth = 100\n", 1053 | "\n", 1054 | "primitives[primitives['type'] == 'aggregation'].head(10)" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": 19, 1060 | "metadata": {}, 1061 | "outputs": [ 1062 | { 1063 | "data": { 1064 | "text/html": [ 1065 | "
\n", 1066 | "\n", 1079 | "\n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | "
nametypedask_compatibledescription
22is_weekendtransformTrueDetermines if a date falls on a weekend.
23divide_by_featuretransformTrueDivide a scalar by each value in the list.
24equal_scalartransformTrueDetermines if values in a list are equal to a given scalar.
25longitudetransformFalseReturns the second tuple value in a list of LatLong tuples.
26monthtransformTrueDetermines the month value of a datetime.
27divide_numerictransformTrueElement-wise division of two lists.
28less_than_equal_to_scalartransformTrueDetermines if values are less than or equal to a given scalar.
29percentiletransformFalseDetermines the percentile rank for each value in a list.
30minutetransformTrueDetermines the minutes value of a datetime.
31cum_mintransformFalseCalculates the cumulative minimum.
\n", 1162 | "
" 1163 | ], 1164 | "text/plain": [ 1165 | " name type dask_compatible \\\n", 1166 | "22 is_weekend transform True \n", 1167 | "23 divide_by_feature transform True \n", 1168 | "24 equal_scalar transform True \n", 1169 | "25 longitude transform False \n", 1170 | "26 month transform True \n", 1171 | "27 divide_numeric transform True \n", 1172 | "28 less_than_equal_to_scalar transform True \n", 1173 | "29 percentile transform False \n", 1174 | "30 minute transform True \n", 1175 | "31 cum_min transform False \n", 1176 | "\n", 1177 | " description \n", 1178 | "22 Determines if a date falls on a weekend. \n", 1179 | "23 Divide a scalar by each value in the list. \n", 1180 | "24 Determines if values in a list are equal to a given scalar. \n", 1181 | "25 Returns the second tuple value in a list of LatLong tuples. \n", 1182 | "26 Determines the month value of a datetime. \n", 1183 | "27 Element-wise division of two lists. \n", 1184 | "28 Determines if values are less than or equal to a given scalar. \n", 1185 | "29 Determines the percentile rank for each value in a list. \n", 1186 | "30 Determines the minutes value of a datetime. \n", 1187 | "31 Calculates the cumulative minimum. " 1188 | ] 1189 | }, 1190 | "execution_count": 19, 1191 | "metadata": {}, 1192 | "output_type": "execute_result" 1193 | } 1194 | ], 1195 | "source": [ 1196 | "primitives[primitives['type'] == 'transform'].head(10)" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "markdown", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "# Deep Feature Synthesis\n", 1204 | "\n", 1205 | "[Deep Feature Synthesis (DFS)](https://docs.featuretools.com/en/stable/automated_feature_engineering/afe.html) is the method Featuretools uses to make new features. DFS stacks feature primitives to form features with a \"depth\" equal to the number of primitives. For example, if we take the maximum value of a client's previous loans (say `MAX(previous.loan_amount)`), that is a \"deep feature\" with a depth of 1. To create a feature with a depth of two, we could stack primitives by taking the maximum value of a client's average monthly payments per previous loan (such as `MAX(previous(MEAN(installments.payment)))`). In manual feature engineering, this would require two separate groupings and aggregations and took more than 15 minutes to write the code per feature. \n", 1206 | "\n", 1207 | "Deep Feature Synthesis is an extremely powerful method that allows us to overcome our human limitations on time and creativity by building features that we would never be able to think of on our own (or would not have the patience to implement). Furthermore, DFS is applicable to any dataset with only very minor changes in syntax. In feature engineering, we generally apply the same functions to multiple datasets, but when we do it by hand, we have to re-write the code because it is problem-specific. Featuretools code can be applied to any dataset because it is written at a higher level of abstraction.\n", 1208 | "\n", 1209 | "The [original paper on automated feature engineering using Deep Feature Synthesis](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) is worth a read if you want to understand the concepts at a deeper level.\n", 1210 | "\n", 1211 | "To perform DFS in featuretools, we use the `dfs` function passing it an `entityset`, the `target_entity` (where we want to make the features), the `agg_primitives` to use, the `trans_primitives` to use, the `max_depth` of the features, and a number of other arguments depending on our use case. There are also options for multi-processing with `njobs` and the information that is printed out with `verbose`. \n", 1212 | "\n", 1213 | "One other important argument is __`features_only`__. If we set this to `True`, `dfs` will only make the feature names and not calculate the actual values of the features (called the feature matrix). This is useful when we want to inspect the feature that will be created and we can also save the features to use with a different dataset (for example when we have training and testing data)." 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "markdown", 1218 | "metadata": {}, 1219 | "source": [ 1220 | "## Deep Feature Synthesis with Default Primitives\n", 1221 | "\n", 1222 | "Without using any domain knowledge we can make thousands of features by using the default primitives in featuretools. This first call will use the default aggregation and transformation primitives, a max depth of 2, and calculate primitives for the `app` entity. We will only generate the features themselves (the names and not the values) which we can save and inspect." 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "code", 1227 | "execution_count": 20, 1228 | "metadata": {}, 1229 | "outputs": [ 1230 | { 1231 | "name": "stdout", 1232 | "output_type": "stream", 1233 | "text": [ 1234 | "Built 2069 features\n" 1235 | ] 1236 | } 1237 | ], 1238 | "source": [ 1239 | "# Default primitives from featuretools\n", 1240 | "default_agg_primitives = [\"sum\", \"std\", \"max\", \"skew\", \"min\", \"mean\", \"count\", \"percent_true\", \"num_unique\", \"mode\"]\n", 1241 | "default_trans_primitives = [\"day\", \"year\", \"month\", \"weekday\", \"haversine\", \"num_words\", \"num_characters\"]\n", 1242 | "\n", 1243 | "# DFS with specified primitives\n", 1244 | "feature_names = ft.dfs(entityset = es, target_entity = 'app',\n", 1245 | " trans_primitives = default_trans_primitives,\n", 1246 | " agg_primitives=default_agg_primitives, \n", 1247 | " where_primitives = [], seed_features = [],\n", 1248 | " max_depth = 2, n_jobs = -1, verbose = 1,\n", 1249 | " features_only=True)" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "markdown", 1254 | "metadata": {}, 1255 | "source": [ 1256 | "Even a basic call to deep feature synthesis gives us over 2000 features to work with. Granted, not all of these will be important, but this still represents hundreds of hours that we saved. Moreover, `dfs` might be able to find important features that we would never have thought of in the first place. \n", 1257 | "\n", 1258 | "We can look at the some of the feature names:" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": 21, 1264 | "metadata": {}, 1265 | "outputs": [ 1266 | { 1267 | "data": { 1268 | "text/plain": [ 1269 | "[,\n", 1270 | " ,\n", 1271 | " ,\n", 1272 | " ,\n", 1273 | " ,\n", 1274 | " ,\n", 1275 | " ,\n", 1276 | " ,\n", 1277 | " ,\n", 1278 | " ,\n", 1279 | " ,\n", 1280 | " ,\n", 1281 | " ,\n", 1282 | " ,\n", 1283 | " ,\n", 1284 | " ,\n", 1285 | " ,\n", 1286 | " ,\n", 1287 | " ,\n", 1288 | " ]" 1289 | ] 1290 | }, 1291 | "execution_count": 21, 1292 | "metadata": {}, 1293 | "output_type": "execute_result" 1294 | } 1295 | ], 1296 | "source": [ 1297 | "feature_names[1000:1020]" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "metadata": {}, 1303 | "source": [ 1304 | "Notice how featuretools stacks multiple primitives on top of each other. This one of the ideas behind Deep Feature Synthesis and automated feature engineering. Rather than having to do these groupings and aggregations by ourselves, Featuretools is able to handle it all using the framework (`entities`, `relationships`, and `primitives`) that we provide. We can also use Featuretools to expand on our domain knowledge. " 1305 | ] 1306 | }, 1307 | { 1308 | "cell_type": "markdown", 1309 | "metadata": {}, 1310 | "source": [ 1311 | "# Building on Top of Domain Features\n", 1312 | "\n", 1313 | "Featuretools will automatically build thousands of features for us, but that does not mean we can't use our own knowledge to improve the predictive performance. Featuretools is able to augment our domain knowledge by stacking additional features on top of our domain knowledge based features. We identified and created numerous useful features in the manual feature engineering notebook, based on our own knowledge and that of thousands of data scientists working on this problem on Kaggle. Rather than getting only one domain knowledge feature, we can effectively get dozens or even hundreds. __Here we'll explain the options for using domain knowledge, but we'll stick with the simple implementation of Featuretools for comparison purposes.__\n", 1314 | "\n", 1315 | "For more information on any of these topics, see the [documentation](https://docs.featuretools.com/en/stable/guides/tuning_dfs.html) or the other notebooks in this repository. \n", 1316 | "\n", 1317 | "### Seed Features \n", 1318 | "\n", 1319 | "Seed features are domain features that we make in the data that Featuretools is then able to build on top of. For example, we saw that the rate of a loan is an important feature because a higher rate loan is likely more risky. In Featuretools, we can encode the loan rate (both for the current loan and for previous loans) as a seed feature and Featuretools will build additional explanatory variables on this domain knowledge wherever possible. \n", 1320 | "\n", 1321 | "### Interesting Values\n", 1322 | "\n", 1323 | "Interesting values have a similar idea to seed features except they allow us to make conditional features. For example, we might want to find for each client the mean amount of previous loans that have been closed and the mean amount of previous loans that are still active. By specifying interesting values in `bureau` on the `CREDIT_ACTIVE` variable we can have Featuretools do exactly that! Carrying this out by hand would be extremely tedious and present numerous opportunities for errors.\n", 1324 | "\n", 1325 | "### Custom Primitives\n", 1326 | "\n", 1327 | "If we aren't satisfied with the primitives available to use in Featuretools, we can write our own functions to transform or aggregate the data. This is one of the most powerful capabilities in featuretools because it allows us to make very specific operations that can then be applied to multiple datasets. \n", 1328 | "\n", 1329 | "__In this notebook we concentrate on a basic implementation of Featuretools, but keep in mind these capabilities are available for optimizing the library and using domain knowledge!__" 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "markdown", 1334 | "metadata": {}, 1335 | "source": [ 1336 | "# Selecting Primitives\n", 1337 | "\n", 1338 | "For our actual set of features, we will use a select group of primitives rather than just the defaults. This will generate over 2100 features to use for modeling. " 1339 | ] 1340 | }, 1341 | { 1342 | "cell_type": "code", 1343 | "execution_count": 22, 1344 | "metadata": {}, 1345 | "outputs": [], 1346 | "source": [ 1347 | "# Specify primitives\n", 1348 | "agg_primitives = [\"sum\", \"max\", \"min\", \"mean\", \"count\", \"percent_true\", \"num_unique\", \"mode\"]\n", 1349 | "trans_primitives = ['percentile', 'and']" 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "execution_count": 23, 1355 | "metadata": { 1356 | "scrolled": true 1357 | }, 1358 | "outputs": [ 1359 | { 1360 | "name": "stdout", 1361 | "output_type": "stream", 1362 | "text": [ 1363 | "Built 2188 features\n" 1364 | ] 1365 | } 1366 | ], 1367 | "source": [ 1368 | "# Deep feature synthesis \n", 1369 | "feature_names = ft.dfs(entityset=es, target_entity='app',\n", 1370 | " agg_primitives = agg_primitives,\n", 1371 | " trans_primitives = trans_primitives,\n", 1372 | " n_jobs = -1, verbose = 1,\n", 1373 | " features_only = True,\n", 1374 | " max_depth = 2)" 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "code", 1379 | "execution_count": 24, 1380 | "metadata": {}, 1381 | "outputs": [], 1382 | "source": [ 1383 | "ft.save_features(feature_names, 'input/features.txt')" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": {}, 1389 | "source": [ 1390 | "If we save the features, we can then use them with `calculate_feature_matrix`. This is useful when we want to apply the same features across datasets (such as if we have separate trainig/testing)." 1391 | ] 1392 | }, 1393 | { 1394 | "cell_type": "markdown", 1395 | "metadata": {}, 1396 | "source": [ 1397 | "## Run Full Deep Feature Synthesis\n", 1398 | "\n", 1399 | "If we are content with the features that will be built, we can run deep feature synthesis and create the feature matrix. The following call runs the full deep feature synthesis. This might take a long time depending on your machine. Featuretools does allow for parallel processing, but each core must be able to handle the entire entityset. \n", 1400 | "\n", 1401 | "__An actual run of this code was completed using Dask which can be seen in the [Featuretools on Dask notebook](https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb).__ The Dask code takes under 2 hours to run and is a great example of how we can use parallel processing to use our resouces in the most efficient manner." 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "code", 1406 | "execution_count": null, 1407 | "metadata": {}, 1408 | "outputs": [], 1409 | "source": [ 1410 | "import sys\n", 1411 | "print('Total size of entityset: {:.5f} gb.'.format(sys.getsizeof(es) / 1e9))" 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "code", 1416 | "execution_count": null, 1417 | "metadata": {}, 1418 | "outputs": [], 1419 | "source": [ 1420 | "import psutil\n", 1421 | "\n", 1422 | "print('Total number of cpus detected: {}.'.format(psutil.cpu_count()))\n", 1423 | "print('Total size of system memory: {:.5f} gb.'.format(psutil.virtual_memory().total / 1e9))" 1424 | ] 1425 | }, 1426 | { 1427 | "cell_type": "code", 1428 | "execution_count": null, 1429 | "metadata": {}, 1430 | "outputs": [], 1431 | "source": [ 1432 | "feature_matrix, feature_names = ft.dfs(entityset=es, target_entity='app',\n", 1433 | " agg_primitives = agg_primitives,\n", 1434 | " trans_primitives = trans_primitives,\n", 1435 | " n_jobs = 1, verbose = 1, features_only = False,\n", 1436 | " max_depth = 2, chunk_size = 100)" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "code", 1441 | "execution_count": null, 1442 | "metadata": {}, 1443 | "outputs": [], 1444 | "source": [ 1445 | "# feature_matrix.reset_index(inplace = True)\n", 1446 | "# feature_matrix.to_csv('../input/feature_matrix.csv', index = False)" 1447 | ] 1448 | }, 1449 | { 1450 | "cell_type": "markdown", 1451 | "metadata": {}, 1452 | "source": [ 1453 | "To download the feature matrix, head to https://www.kaggle.com/willkoehrsen/home-credit-default-risk-feature-tools and select the `feature_matrix_article.csv`. There are several other versions of automatically engineered feature matrices available there as well. " 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "markdown", 1458 | "metadata": {}, 1459 | "source": [ 1460 | "# Conclusions\n", 1461 | "\n", 1462 | "In this notebook, we saw how to implement automated feature engineering for a data science problem. __Automated feature engineering allows us to create thousands of new features from a set of related data tables, significantly increasing our efficiency as data scientists.__ Moreover, we can still use domain knowledge in our features and even augment our domain knowledge by building on top of our own hand-built features. The main takeaways are:\n", 1463 | "\n", 1464 | "* Automated feature engineering took 1 hour to implement compared to 10 hours for manual feature engineering\n", 1465 | "* Automated feature engineering built thousands of features in a few lines of code compared to dozens of lines of code per feature for manual engineering.\n", 1466 | "* Overall, performance of the automated features are comparable or better than those of the manual features (see the Results notebook)\n", 1467 | "\n", 1468 | "The benefits of automated feature engineering are significant and will considerably help us in our role as data scientists. It won't alleviate the need for data scientists, but rather will make us more efficient and build better predictive pipelines in less time. " 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "markdown", 1473 | "metadata": {}, 1474 | "source": [ 1475 | "## Next Steps\n", 1476 | "\n", 1477 | "After creating a full set of features, we can apply feature selection and then proceed with modeling. To optimize the model for the features, we use random search for 100 iterations over a grid of hyperparamters. To see how to use Dask to run Featuretools in parallel, refer to the Featuretools Implementation with Dask notebook. For feature selection refer to the Feature Selection notebook. Final results are presented in the Results notebook." 1478 | ] 1479 | }, 1480 | { 1481 | "cell_type": "markdown", 1482 | "metadata": {}, 1483 | "source": [ 1484 | "

\n", 1485 | " \"Featuretools\"\n", 1486 | "

\n", 1487 | "\n", 1488 | "Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact)." 1489 | ] 1490 | } 1491 | ], 1492 | "metadata": { 1493 | "kernelspec": { 1494 | "display_name": "Python 3", 1495 | "language": "python", 1496 | "name": "python3" 1497 | }, 1498 | "language_info": { 1499 | "codemirror_mode": { 1500 | "name": "ipython", 1501 | "version": 3 1502 | }, 1503 | "file_extension": ".py", 1504 | "mimetype": "text/x-python", 1505 | "name": "python", 1506 | "nbconvert_exporter": "python", 1507 | "pygments_lexer": "ipython3", 1508 | "version": "3.7.4" 1509 | } 1510 | }, 1511 | "nbformat": 4, 1512 | "nbformat_minor": 2 1513 | } 1514 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, Featuretools 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Predicting whether an applicant is capable of repaying a loan 2 | 3 | Featuretools 4 | 5 | 6 | **As a bank decides which applicants to provide loans, they may wish to predict if the applicant will default on the loan. Through automated feature engineering, we can identify the predictive patterns in the financial data that can be used to ensure that clients capable of repayment are not rejected.** 7 | 8 | In this tutorial, we show how [Featuretools](https://www.featuretools.com) can be used to perform feature engineering on a multi-table dataset of 300 thousand applicant financial information provided by Home Credit to train an accurate machine learning model to predict what if an applicant will repay a loan. 9 | 10 | ## Highlights 11 | 12 | * We automatically generate 1820 features using Deep Feature Synthesis. 13 | * We are able to generate features, check that we are content with those features, and create the feature matrix. 14 | * We develop are able to generate features in 1 hour vs 10 hours with manual feature engineering. 15 | 16 | ## Running the tutorial 17 | 1. Clone the repo 18 | 19 | ``` 20 | git clone https://github.com/Featuretools/predict-loan-repayment.git 21 | ``` 22 | 23 | 2. Install the requirements 24 | 25 | ``` 26 | pip install -r requirements.txt 27 | ``` 28 | 29 | *You will also need to install **graphviz** for this demo. Please install graphviz according to the instructions in the [Featuretools Documentation](https://docs.featuretools.com/getting_started/install.html)* 30 | 31 | 3. Download the data 32 | 33 | You can download the data from [Kaggle](https://www.kaggle.com/c/home-credit-default-risk/data). After downloading, save the CSV to a directory called `input` in the root of this repository. 34 | 35 | 4. Run the Tutorial notebook:
36 | [Automated Loan Repayment](https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Automated%20Loan%20Repayment.ipynb) 37 | 38 | ``` 39 | jupyter notebook 40 | ``` 41 | 42 | ## Feature Labs 43 | 44 | Featuretools 45 | 46 | 47 | Featuretools is an open source project created by [Feature Labs](https://www.featurelabs.com/). To see the other open source projects we're working on visit Feature Labs [Open Source](https://www.featurelabs.com/open). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/). 48 | 49 | ### Contact 50 | 51 | Any questions can be directed to help@featurelabs.com 52 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | bokeh>=1.0.2 2 | featuretools>=0.16.0 3 | ipython>=5.8.0 4 | jupyter>=1.0.0 5 | lightgbm>=2.2.2 6 | matplotlib>=2.2.3 7 | psutil>=5.4.8 8 | pyarrow>=0.11.1 9 | scikit-learn>=0.20.2 10 | seaborn>=0.9.0 11 | tsfresh>=0.11.1 12 | umap-learn>=0.3.7 13 | graphviz>=0.10.1 14 | --------------------------------------------------------------------------------