├── README.md └── notebooks ├── Automated Feature Engineering Basics.ipynb ├── Introduction to Manual Feature Engineering.ipynb ├── Manual Feature Engineering Part Two.ipynb └── Tuning Automated Feature Engineering.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # kaggle-automated-feature-engineering 2 | Applying automated feature engineering to the [Kaggle Home Credit Default Risk Competition](https://www.kaggle.com/c/home-credit-default-risk) 3 | 4 | This repository documents my application of featuretools for automated feature engineering in a Kaggle competition. The complete set of notebooks can be viewed and run on Kaggle. I have also included manual feature engineering notebooks for comparison purposes. The competition is still ongoing and I will update with new results and notebooks as they are completed. 5 | 6 | ## Notebooks 7 | 8 | * Applied Automated Feature Engineering Basics [Kaggle Link](https://www.kaggle.com/willkoehrsen/applied-automated-feature-engineering-basics/notebook) 9 | * Intro to Tuning Automated Feature Engineering [Kaggle Link](https://www.kaggle.com/willkoehrsen/intro-to-tuning-automated-feature-engineering) 10 | * Intro to Manual Feature Engineering [Kaggle Link](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering) 11 | * Manual Feature Engineering Part Two [Kaggle Link](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering-p2) 12 | 13 | ## Datasets 14 | 15 | * [Featuretools default featureset](https://www.kaggle.com/willkoehrsen/home-credit-default-risk-feature-tools) 16 | * [Manual Feature Engineering Part One](https://www.kaggle.com/willkoehrsen/home-credit-manual-engineered-features) 17 | * [Manual Feature Engineering Part Two](https://www.kaggle.com/willkoehrsen/home-credit-manual-engineered-features) 18 | -------------------------------------------------------------------------------- /notebooks/Manual Feature Engineering Part Two.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "metadata": { 5 | "_uuid": "44bbae45bfef6ef3547b66e18166d9b99a2f4462" 6 | }, 7 | "cell_type": "markdown", 8 | "source": "# Introduction: Manual Feature Engineering (part two)\n\nIn this notebook we will expand on the [Introduction to Manual Feature Engineering](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering/output) notebook. We will use the aggregation and value counting functions developed in that notebook in order to incorporate information from the `previous_application`, `POS_CASH_balance`, `installments_payments`, and `credit_card_balance` data files. We already used the information from the `bureau` and `bureau_balance` in the previous notebook and were able to improve our competition score compared to using only the `application` data. After running a model with the features included here, performance does increase, but we run into issues with an explosion in the number of features! I'm working on a notebook of feature selection, but for this notebook we will continue building up a rich set of data for our model. \n\nThe definitions of the four additional data files are:\n\n* previous_application (called `previous`): previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.\n* POS_CASH_BALANCE (called `cash`): monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.\n* credit_card_balance (called `credit`): monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.\n* installments_payment (called `installments`): payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment." 9 | }, 10 | { 11 | "metadata": { 12 | "_uuid": "04c8826306314ba274e409667aa58a806c6f462c" 13 | }, 14 | "cell_type": "markdown", 15 | "source": "# Functions \n\nWe spent quite a bit of time developing two functions in the previous notebook:\n\n* `agg_numeric`: calculate aggregation statistics (`mean`, `count`, `max`, `min`) for numeric variables.\n* `count_categorical`: compute counts and normalized counts of each category in a categorical variable.\n\nTogether, these two functions can extract information about both the numeric and categorical data in a dataframe. Our general approach will be to apply both of these functions to the dataframes, grouping by the client id, `SK_ID_CURR`. For the `POS_CASH_balance`, `credit_card_balance`, and `installment_payments`, we can first group by the `SK_ID_PREV`, the unique id for the previous loan. Then we will group the resulting dataframe by the `SK_ID_CURR` to calculate the aggregation statistics for each client across all of their previous loans. If that's a little confusing, I'd suggest heading back to the [first feature engineering notebook](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering/output).**" 16 | }, 17 | { 18 | "metadata": { 19 | "trusted": true, 20 | "collapsed": true, 21 | "_uuid": "9b3c521b3cb7916e7473668356a6348d272c0655" 22 | }, 23 | "cell_type": "code", 24 | "source": "## Function to Aggregate Numeric Data\n\nThis groups by # pandas and numpy for data manipulation\nimport pandas as pd\nimport numpy as np\n\n# matplotlib and seaborn for plotting\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n# Suppress warnings from pandas\nimport warnings\nwarnings.filterwarnings('ignore')\n\nplt.style.use('fivethirtyeight')\n\n# Memory management\nimport gc ", 25 | "execution_count": 1, 26 | "outputs": [] 27 | }, 28 | { 29 | "metadata": { 30 | "_uuid": "fa91c4f6d654dea9c61780fae217acac0b322835" 31 | }, 32 | "cell_type": "markdown", 33 | "source": "## Function to Aggregate Numeric Data\n\nThis groups data by the `group_var` and calculates `mean`, `max`, `min`, and `sum`. It will only be applied to numeric data by default in pandas." 34 | }, 35 | { 36 | "metadata": { 37 | "trusted": true, 38 | "collapsed": true, 39 | "_uuid": "c5a7850897a9f7ea20be066f379b29d46e670584" 40 | }, 41 | "cell_type": "code", 42 | "source": "def agg_numeric(df, group_var, df_name):\n \"\"\"Aggregates the numeric values in a dataframe. This can\n be used to create features for each instance of the grouping variable.\n \n Parameters\n --------\n df (dataframe): \n the dataframe to calculate the statistics on\n group_var (string): \n the variable by which to group df\n df_name (string): \n the variable used to rename the columns\n \n Return\n --------\n agg (dataframe): \n a dataframe with the statistics aggregated for \n all numeric columns. Each instance of the grouping variable will have \n the statistics (mean, min, max, sum; currently supported) calculated. \n The columns are also renamed to keep track of features created.\n \n \"\"\"\n \n # First calculate counts\n counts = pd.DataFrame(df.groupby(group_var, as_index = False)[df.columns[1]].count()).rename(columns = {df.columns[1]: '%s_counts' % df_name})\n \n # Group by the specified variable and calculate the statistics\n agg = df.groupby(group_var).agg(['mean', 'max', 'min', 'sum']).reset_index()\n \n # Need to create new column names\n columns = [group_var]\n \n # Iterate through the variables names\n for var in agg.columns.levels[0]:\n # Skip the grouping variable\n if var != group_var:\n # Iterate through the stat names\n for stat in agg.columns.levels[1][:-1]:\n # Make a new column name for the variable and stat\n columns.append('%s_%s_%s' % (df_name, var, stat))\n \n # Rename the columns\n agg.columns = columns\n \n # Merge with the counts\n agg = agg.merge(counts, on = group_var, how = 'left')\n \n return agg", 43 | "execution_count": 2, 44 | "outputs": [] 45 | }, 46 | { 47 | "metadata": { 48 | "_uuid": "c99bb22a98377fda42944fbd04ad688cb53e3177" 49 | }, 50 | "cell_type": "markdown", 51 | "source": "### Function to Calculate Categorical Counts\n\nThis function calculates the occurrences (counts) of each category in a categorical variable for each client. It also calculates the normed count, which is the count for a category divided by the total counts for all categories in a categorical variable. " 52 | }, 53 | { 54 | "metadata": { 55 | "trusted": true, 56 | "collapsed": true, 57 | "_uuid": "6899ea01794c3c3e44c7bbc6bde3613202d545d4" 58 | }, 59 | "cell_type": "code", 60 | "source": "def count_categorical(df, group_var, df_name):\n \"\"\"Computes counts and normalized counts for each observation\n of `group_var` of each unique category in every categorical variable\n \n Parameters\n --------\n df : dataframe \n The dataframe to calculate the value counts for.\n \n group_var : string\n The variable by which to group the dataframe. For each unique\n value of this variable, the final dataframe will have one row\n \n df_name : string\n Variable added to the front of column names to keep track of columns\n\n \n Return\n --------\n categorical : dataframe\n A dataframe with counts and normalized counts of each unique category in every categorical variable\n with one row for every unique value of the `group_var`.\n \n \"\"\"\n \n # Select the categorical columns\n categorical = pd.get_dummies(df.select_dtypes('object'))\n\n # Make sure to put the identifying id on the column\n categorical[group_var] = df[group_var]\n\n # Groupby the group var and calculate the sum and mean\n categorical = categorical.groupby(group_var).agg(['sum', 'mean'])\n \n column_names = []\n \n # Iterate through the columns in level 0\n for var in categorical.columns.levels[0]:\n # Iterate through the stats in level 1\n for stat in ['count', 'count_norm']:\n # Make a new column name\n column_names.append('%s_%s_%s' % (df_name, var, stat))\n \n categorical.columns = column_names\n \n return categorical", 61 | "execution_count": 3, 62 | "outputs": [] 63 | }, 64 | { 65 | "metadata": { 66 | "_uuid": "eda8a6b2429cde1fcbaf5f5beec0707f739b9e73" 67 | }, 68 | "cell_type": "markdown", 69 | "source": "### Function for KDE Plots of Variable\n\nWe also made a function that plots the distribution of variable colored by the value of `TARGET` (either 1 for did not repay the loan or 0 for did repay the loan). We can use this function to visually examine any new variables we create. This also calculates the correlation cofficient of the variable with the target which can be used as an approximation of whether or not the created variable will be useful. " 70 | }, 71 | { 72 | "metadata": { 73 | "trusted": true, 74 | "collapsed": true, 75 | "_uuid": "557b945ebd7663bee123b9074389141d38c18b6b" 76 | }, 77 | "cell_type": "code", 78 | "source": "# Plots the disribution of a variable colored by value of the target\ndef kde_target(var_name, df):\n \n # Calculate the correlation coefficient between the new variable and the target\n corr = df['TARGET'].corr(df[var_name])\n \n # Calculate medians for repaid vs not repaid\n avg_repaid = df.ix[df['TARGET'] == 0, var_name].median()\n avg_not_repaid = df.ix[df['TARGET'] == 1, var_name].median()\n \n plt.figure(figsize = (12, 6))\n \n # Plot the distribution for target == 0 and target == 1\n sns.kdeplot(df.ix[df['TARGET'] == 0, var_name], label = 'TARGET == 0')\n sns.kdeplot(df.ix[df['TARGET'] == 1, var_name], label = 'TARGET == 1')\n \n # label the plot\n plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)\n plt.legend();\n \n # print out the correlation\n print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))\n # Print out average values\n print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)\n print('Median value for loan that was repaid = %0.4f' % avg_repaid)\n ", 79 | "execution_count": 4, 80 | "outputs": [] 81 | }, 82 | { 83 | "metadata": { 84 | "_uuid": "0ed0bafec723c5206f130895838245fcc8d91b26" 85 | }, 86 | "cell_type": "markdown", 87 | "source": "Let's deal with one dataframe at a time. First up is the `previous_applications`. This has one row for every previous loan a client had at Home Credit. A client can have multiple previous loans which is why we need to aggregate statistics for each client." 88 | }, 89 | { 90 | "metadata": { 91 | "_uuid": "063a93d01f0afaebd41f256c13cbecd67d177f19" 92 | }, 93 | "cell_type": "markdown", 94 | "source": "### `previous_application`" 95 | }, 96 | { 97 | "metadata": { 98 | "trusted": true, 99 | "_uuid": "4f81a6dce526a8627039e58d1f7025941fde2798" 100 | }, 101 | "cell_type": "code", 102 | "source": "previous = pd.read_csv('../input/previous_application.csv')\nprevious.head()", 103 | "execution_count": 5, 104 | "outputs": [ 105 | { 106 | "output_type": "execute_result", 107 | "execution_count": 5, 108 | "data": { 109 | "text/plain": " SK_ID_PREV ... NFLAG_INSURED_ON_APPROVAL\n0 2030495 ... 0.0\n1 2802425 ... 1.0\n2 2523466 ... 1.0\n3 2819243 ... 1.0\n4 1784265 ... NaN\n\n[5 rows x 37 columns]", 110 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SK_ID_PREVSK_ID_CURRNAME_CONTRACT_TYPEAMT_ANNUITYAMT_APPLICATIONAMT_CREDITAMT_DOWN_PAYMENTAMT_GOODS_PRICEWEEKDAY_APPR_PROCESS_STARTHOUR_APPR_PROCESS_STARTFLAG_LAST_APPL_PER_CONTRACTNFLAG_LAST_APPL_IN_DAYRATE_DOWN_PAYMENTRATE_INTEREST_PRIMARYRATE_INTEREST_PRIVILEGEDNAME_CASH_LOAN_PURPOSENAME_CONTRACT_STATUSDAYS_DECISIONNAME_PAYMENT_TYPECODE_REJECT_REASONNAME_TYPE_SUITENAME_CLIENT_TYPENAME_GOODS_CATEGORYNAME_PORTFOLIONAME_PRODUCT_TYPECHANNEL_TYPESELLERPLACE_AREANAME_SELLER_INDUSTRYCNT_PAYMENTNAME_YIELD_GROUPPRODUCT_COMBINATIONDAYS_FIRST_DRAWINGDAYS_FIRST_DUEDAYS_LAST_DUE_1ST_VERSIONDAYS_LAST_DUEDAYS_TERMINATIONNFLAG_INSURED_ON_APPROVAL
02030495271877Consumer loans1730.43017145.017145.00.017145.0SATURDAY15Y10.00.1828320.867336XAPApproved-73Cash through the bankXAPNaNRepeaterMobilePOSXNACountry-wide35Connectivity12.0middlePOS mobile with interest365243.0-42.0300.0-42.0-37.00.0
12802425108129Cash loans25188.615607500.0679671.0NaN607500.0THURSDAY11Y1NaNNaNNaNXNAApproved-164XNAXAPUnaccompaniedRepeaterXNACashx-sellContact center-1XNA36.0low_actionCash X-Sell: low365243.0-134.0916.0365243.0365243.01.0
22523466122040Cash loans15060.735112500.0136444.5NaN112500.0TUESDAY11Y1NaNNaNNaNXNAApproved-301Cash through the bankXAPSpouse, partnerRepeaterXNACashx-sellCredit and cash offices-1XNA12.0highCash X-Sell: high365243.0-271.059.0365243.0365243.01.0
32819243176158Cash loans47041.335450000.0470790.0NaN450000.0MONDAY7Y1NaNNaNNaNXNAApproved-512Cash through the bankXAPNaNRepeaterXNACashx-sellCredit and cash offices-1XNA12.0middleCash X-Sell: middle365243.0-482.0-152.0-182.0-177.01.0
41784265202054Cash loans31924.395337500.0404055.0NaN337500.0THURSDAY9Y1NaNNaNNaNRepairsRefused-781Cash through the bankHCNaNRepeaterXNACashwalk-inCredit and cash offices-1XNA24.0highCash Street: highNaNNaNNaNNaNNaNNaN
\n
" 111 | }, 112 | "metadata": {} 113 | } 114 | ] 115 | }, 116 | { 117 | "metadata": { 118 | "trusted": true, 119 | "_uuid": "f70f561a3075d5c0f68c31cf33449d9db8583ae9" 120 | }, 121 | "cell_type": "code", 122 | "source": "# Calculate aggregate statistics for each numeric column\nprevious_agg = agg_numeric(previous.drop(columns = ['SK_ID_PREV']), group_var = 'SK_ID_CURR', df_name = 'previous_loans')\nprevious_agg.head()", 123 | "execution_count": 6, 124 | "outputs": [ 125 | { 126 | "output_type": "execute_result", 127 | "execution_count": 6, 128 | "data": { 129 | "text/plain": " SK_ID_CURR ... previous_loans_counts\n0 100001 ... 1\n1 100002 ... 1\n2 100003 ... 3\n3 100004 ... 1\n4 100005 ... 2\n\n[5 rows x 78 columns]", 130 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SK_ID_CURRprevious_loans_AMT_ANNUITY_meanprevious_loans_AMT_ANNUITY_maxprevious_loans_AMT_ANNUITY_minprevious_loans_AMT_ANNUITY_sumprevious_loans_AMT_APPLICATION_meanprevious_loans_AMT_APPLICATION_maxprevious_loans_AMT_APPLICATION_minprevious_loans_AMT_APPLICATION_sumprevious_loans_AMT_CREDIT_meanprevious_loans_AMT_CREDIT_maxprevious_loans_AMT_CREDIT_minprevious_loans_AMT_CREDIT_sumprevious_loans_AMT_DOWN_PAYMENT_meanprevious_loans_AMT_DOWN_PAYMENT_maxprevious_loans_AMT_DOWN_PAYMENT_minprevious_loans_AMT_DOWN_PAYMENT_sumprevious_loans_AMT_GOODS_PRICE_meanprevious_loans_AMT_GOODS_PRICE_maxprevious_loans_AMT_GOODS_PRICE_minprevious_loans_AMT_GOODS_PRICE_sumprevious_loans_HOUR_APPR_PROCESS_START_meanprevious_loans_HOUR_APPR_PROCESS_START_maxprevious_loans_HOUR_APPR_PROCESS_START_minprevious_loans_HOUR_APPR_PROCESS_START_sumprevious_loans_NFLAG_LAST_APPL_IN_DAY_meanprevious_loans_NFLAG_LAST_APPL_IN_DAY_maxprevious_loans_NFLAG_LAST_APPL_IN_DAY_minprevious_loans_NFLAG_LAST_APPL_IN_DAY_sumprevious_loans_RATE_DOWN_PAYMENT_meanprevious_loans_RATE_DOWN_PAYMENT_maxprevious_loans_RATE_DOWN_PAYMENT_minprevious_loans_RATE_DOWN_PAYMENT_sumprevious_loans_RATE_INTEREST_PRIMARY_meanprevious_loans_RATE_INTEREST_PRIMARY_maxprevious_loans_RATE_INTEREST_PRIMARY_minprevious_loans_RATE_INTEREST_PRIMARY_sumprevious_loans_RATE_INTEREST_PRIVILEGED_meanprevious_loans_RATE_INTEREST_PRIVILEGED_maxprevious_loans_RATE_INTEREST_PRIVILEGED_minprevious_loans_RATE_INTEREST_PRIVILEGED_sumprevious_loans_DAYS_DECISION_meanprevious_loans_DAYS_DECISION_maxprevious_loans_DAYS_DECISION_minprevious_loans_DAYS_DECISION_sumprevious_loans_SELLERPLACE_AREA_meanprevious_loans_SELLERPLACE_AREA_maxprevious_loans_SELLERPLACE_AREA_minprevious_loans_SELLERPLACE_AREA_sumprevious_loans_CNT_PAYMENT_meanprevious_loans_CNT_PAYMENT_maxprevious_loans_CNT_PAYMENT_minprevious_loans_CNT_PAYMENT_sumprevious_loans_DAYS_FIRST_DRAWING_meanprevious_loans_DAYS_FIRST_DRAWING_maxprevious_loans_DAYS_FIRST_DRAWING_minprevious_loans_DAYS_FIRST_DRAWING_sumprevious_loans_DAYS_FIRST_DUE_meanprevious_loans_DAYS_FIRST_DUE_maxprevious_loans_DAYS_FIRST_DUE_minprevious_loans_DAYS_FIRST_DUE_sumprevious_loans_DAYS_LAST_DUE_1ST_VERSION_meanprevious_loans_DAYS_LAST_DUE_1ST_VERSION_maxprevious_loans_DAYS_LAST_DUE_1ST_VERSION_minprevious_loans_DAYS_LAST_DUE_1ST_VERSION_sumprevious_loans_DAYS_LAST_DUE_meanprevious_loans_DAYS_LAST_DUE_maxprevious_loans_DAYS_LAST_DUE_minprevious_loans_DAYS_LAST_DUE_sumprevious_loans_DAYS_TERMINATION_meanprevious_loans_DAYS_TERMINATION_maxprevious_loans_DAYS_TERMINATION_minprevious_loans_DAYS_TERMINATION_sumprevious_loans_NFLAG_INSURED_ON_APPROVAL_meanprevious_loans_NFLAG_INSURED_ON_APPROVAL_maxprevious_loans_NFLAG_INSURED_ON_APPROVAL_minprevious_loans_NFLAG_INSURED_ON_APPROVAL_sumprevious_loans_counts
01000013951.0003951.0003951.0003951.00024835.5024835.524835.524835.523787.0023787.023787.023787.02520.02520.02520.02520.024835.524835.524835.524835.513.0000001313131.01110.1043260.1043260.1043260.104326NaNNaNNaN0.0NaNNaNNaN0.0-1740.0-1740-1740-174023.02323238.08.08.08.0365243.0365243.0365243.0365243.0-1709.000000-1709.0-1709.0-1709.0-1499.000000-1499.0-1499.0-1499.0-1619.000000-1619.0-1619.0-1619.0-1612.000000-1612.0-1612.0-1612.00.0000000.00.00.01
11000029251.7759251.7759251.7759251.775179055.00179055.0179055.0179055.0179055.00179055.0179055.0179055.00.00.00.00.0179055.0179055.0179055.0179055.09.0000009991.01110.0000000.0000000.0000000.000000NaNNaNNaN0.0NaNNaNNaN0.0-606.0-606-606-606500.050050050024.024.024.024.0365243.0365243.0365243.0365243.0-565.000000-565.0-565.0-565.0125.000000125.0125.0125.0-25.000000-25.0-25.0-25.0-17.000000-17.0-17.0-17.00.0000000.00.00.01
210000356553.99098356.9956737.310169661.970435436.50900000.068809.51306309.5484191.001035882.068053.51452573.03442.56885.00.06885.0435436.5900000.068809.51306309.514.6666671712441.01130.0500300.1000610.0000000.100061NaNNaNNaN0.0NaNNaNNaN0.0-1305.0-746-2341-3915533.01400-1159910.012.06.030.0365243.0365243.0365243.01095729.0-1274.333333-716.0-2310.0-3823.0-1004.333333-386.0-1980.0-3013.0-1054.333333-536.0-1980.0-3163.0-1047.333333-527.0-1976.0-3142.00.6666671.00.02.03
31000045357.2505357.2505357.2505357.25024282.0024282.024282.024282.020106.0020106.020106.020106.04860.04860.04860.04860.024282.024282.024282.024282.05.0000005551.01110.2120080.2120080.2120080.212008NaNNaNNaN0.0NaNNaNNaN0.0-815.0-815-815-81530.03030304.04.04.04.0365243.0365243.0365243.0365243.0-784.000000-784.0-784.0-784.0-694.000000-694.0-694.0-694.0-724.000000-724.0-724.0-724.0-714.000000-714.0-714.0-714.00.0000000.00.00.01
41000054813.2004813.2004813.2004813.20022308.7544617.50.044617.520076.7540153.50.040153.54464.04464.04464.04464.044617.544617.544617.544617.510.5000001110211.01120.1089640.1089640.1089640.108964NaNNaNNaN0.0NaNNaNNaN0.0-536.0-315-757-107218.037-13612.012.012.012.0365243.0365243.0365243.0365243.0-706.000000-706.0-706.0-706.0-376.000000-376.0-376.0-376.0-466.000000-466.0-466.0-466.0-460.000000-460.0-460.0-460.00.0000000.00.00.02
\n
" 131 | }, 132 | "metadata": {} 133 | } 134 | ] 135 | }, 136 | { 137 | "metadata": { 138 | "trusted": true, 139 | "_uuid": "267a4f0a8097c4b13bbcb67faa25fafb7dee0af5" 140 | }, 141 | "cell_type": "code", 142 | "source": "# Calculate value counts for each categorical column\nprevious_counts = count_categorical(previous, group_var = 'SK_ID_CURR', df_name = 'previous_loans')\nprevious_counts.head()", 143 | "execution_count": 7, 144 | "outputs": [ 145 | { 146 | "output_type": "execute_result", 147 | "execution_count": 7, 148 | "data": { 149 | "text/plain": " previous_loans_NAME_CONTRACT_TYPE_Cash loans_count previous_loans_NAME_CONTRACT_TYPE_Cash loans_count_norm previous_loans_NAME_CONTRACT_TYPE_Consumer loans_count previous_loans_NAME_CONTRACT_TYPE_Consumer loans_count_norm previous_loans_NAME_CONTRACT_TYPE_Revolving loans_count previous_loans_NAME_CONTRACT_TYPE_Revolving loans_count_norm previous_loans_NAME_CONTRACT_TYPE_XNA_count previous_loans_NAME_CONTRACT_TYPE_XNA_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_MONDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_MONDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_SATURDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_SATURDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_count_norm previous_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_count previous_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_count_norm previous_loans_FLAG_LAST_APPL_PER_CONTRACT_N_count previous_loans_FLAG_LAST_APPL_PER_CONTRACT_N_count_norm previous_loans_FLAG_LAST_APPL_PER_CONTRACT_Y_count previous_loans_FLAG_LAST_APPL_PER_CONTRACT_Y_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_count previous_loans_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Business development_count previous_loans_NAME_CASH_LOAN_PURPOSE_Business development_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a garage_count previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a garage_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_count previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a home_count previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a home_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_count previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_count previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Car repairs_count previous_loans_NAME_CASH_LOAN_PURPOSE_Car repairs_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Education_count previous_loans_NAME_CASH_LOAN_PURPOSE_Education_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Everyday expenses_count previous_loans_NAME_CASH_LOAN_PURPOSE_Everyday expenses_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Furniture_count previous_loans_NAME_CASH_LOAN_PURPOSE_Furniture_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Gasification / water supply_count previous_loans_NAME_CASH_LOAN_PURPOSE_Gasification / water supply_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Hobby_count previous_loans_NAME_CASH_LOAN_PURPOSE_Hobby_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Journey_count previous_loans_NAME_CASH_LOAN_PURPOSE_Journey_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Medicine_count previous_loans_NAME_CASH_LOAN_PURPOSE_Medicine_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Money for a third person_count previous_loans_NAME_CASH_LOAN_PURPOSE_Money for a third person_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Other_count previous_loans_NAME_CASH_LOAN_PURPOSE_Other_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Payments on other loans_count previous_loans_NAME_CASH_LOAN_PURPOSE_Payments on other loans_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase of electronic equipment_count previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase of electronic equipment_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_count previous_loans_NAME_CASH_LOAN_PURPOSE_Refusal to name the goal_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Repairs_count previous_loans_NAME_CASH_LOAN_PURPOSE_Repairs_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Urgent needs_count previous_loans_NAME_CASH_LOAN_PURPOSE_Urgent needs_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_Wedding / gift / holiday_count previous_loans_NAME_CASH_LOAN_PURPOSE_Wedding / gift / holiday_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_XAP_count previous_loans_NAME_CASH_LOAN_PURPOSE_XAP_count_norm previous_loans_NAME_CASH_LOAN_PURPOSE_XNA_count previous_loans_NAME_CASH_LOAN_PURPOSE_XNA_count_norm previous_loans_NAME_CONTRACT_STATUS_Approved_count previous_loans_NAME_CONTRACT_STATUS_Approved_count_norm previous_loans_NAME_CONTRACT_STATUS_Canceled_count previous_loans_NAME_CONTRACT_STATUS_Canceled_count_norm previous_loans_NAME_CONTRACT_STATUS_Refused_count previous_loans_NAME_CONTRACT_STATUS_Refused_count_norm previous_loans_NAME_CONTRACT_STATUS_Unused offer_count previous_loans_NAME_CONTRACT_STATUS_Unused offer_count_norm previous_loans_NAME_PAYMENT_TYPE_Cash through the bank_count previous_loans_NAME_PAYMENT_TYPE_Cash through the bank_count_norm previous_loans_NAME_PAYMENT_TYPE_Cashless from the account of the employer_count previous_loans_NAME_PAYMENT_TYPE_Cashless from the account of the employer_count_norm previous_loans_NAME_PAYMENT_TYPE_Non-cash from your account_count previous_loans_NAME_PAYMENT_TYPE_Non-cash from your account_count_norm previous_loans_NAME_PAYMENT_TYPE_XNA_count previous_loans_NAME_PAYMENT_TYPE_XNA_count_norm previous_loans_CODE_REJECT_REASON_CLIENT_count previous_loans_CODE_REJECT_REASON_CLIENT_count_norm previous_loans_CODE_REJECT_REASON_HC_count previous_loans_CODE_REJECT_REASON_HC_count_norm previous_loans_CODE_REJECT_REASON_LIMIT_count previous_loans_CODE_REJECT_REASON_LIMIT_count_norm previous_loans_CODE_REJECT_REASON_SCO_count previous_loans_CODE_REJECT_REASON_SCO_count_norm previous_loans_CODE_REJECT_REASON_SCOFR_count previous_loans_CODE_REJECT_REASON_SCOFR_count_norm previous_loans_CODE_REJECT_REASON_SYSTEM_count previous_loans_CODE_REJECT_REASON_SYSTEM_count_norm previous_loans_CODE_REJECT_REASON_VERIF_count previous_loans_CODE_REJECT_REASON_VERIF_count_norm previous_loans_CODE_REJECT_REASON_XAP_count previous_loans_CODE_REJECT_REASON_XAP_count_norm previous_loans_CODE_REJECT_REASON_XNA_count previous_loans_CODE_REJECT_REASON_XNA_count_norm previous_loans_NAME_TYPE_SUITE_Children_count previous_loans_NAME_TYPE_SUITE_Children_count_norm previous_loans_NAME_TYPE_SUITE_Family_count previous_loans_NAME_TYPE_SUITE_Family_count_norm previous_loans_NAME_TYPE_SUITE_Group of people_count previous_loans_NAME_TYPE_SUITE_Group of people_count_norm previous_loans_NAME_TYPE_SUITE_Other_A_count previous_loans_NAME_TYPE_SUITE_Other_A_count_norm previous_loans_NAME_TYPE_SUITE_Other_B_count previous_loans_NAME_TYPE_SUITE_Other_B_count_norm previous_loans_NAME_TYPE_SUITE_Spouse, partner_count previous_loans_NAME_TYPE_SUITE_Spouse, partner_count_norm previous_loans_NAME_TYPE_SUITE_Unaccompanied_count previous_loans_NAME_TYPE_SUITE_Unaccompanied_count_norm previous_loans_NAME_CLIENT_TYPE_New_count previous_loans_NAME_CLIENT_TYPE_New_count_norm previous_loans_NAME_CLIENT_TYPE_Refreshed_count previous_loans_NAME_CLIENT_TYPE_Refreshed_count_norm previous_loans_NAME_CLIENT_TYPE_Repeater_count previous_loans_NAME_CLIENT_TYPE_Repeater_count_norm previous_loans_NAME_CLIENT_TYPE_XNA_count previous_loans_NAME_CLIENT_TYPE_XNA_count_norm previous_loans_NAME_GOODS_CATEGORY_Additional Service_count previous_loans_NAME_GOODS_CATEGORY_Additional Service_count_norm previous_loans_NAME_GOODS_CATEGORY_Animals_count previous_loans_NAME_GOODS_CATEGORY_Animals_count_norm previous_loans_NAME_GOODS_CATEGORY_Audio/Video_count previous_loans_NAME_GOODS_CATEGORY_Audio/Video_count_norm previous_loans_NAME_GOODS_CATEGORY_Auto Accessories_count previous_loans_NAME_GOODS_CATEGORY_Auto Accessories_count_norm previous_loans_NAME_GOODS_CATEGORY_Clothing and Accessories_count previous_loans_NAME_GOODS_CATEGORY_Clothing and Accessories_count_norm previous_loans_NAME_GOODS_CATEGORY_Computers_count previous_loans_NAME_GOODS_CATEGORY_Computers_count_norm previous_loans_NAME_GOODS_CATEGORY_Construction Materials_count previous_loans_NAME_GOODS_CATEGORY_Construction Materials_count_norm previous_loans_NAME_GOODS_CATEGORY_Consumer Electronics_count previous_loans_NAME_GOODS_CATEGORY_Consumer Electronics_count_norm previous_loans_NAME_GOODS_CATEGORY_Direct Sales_count previous_loans_NAME_GOODS_CATEGORY_Direct Sales_count_norm previous_loans_NAME_GOODS_CATEGORY_Education_count previous_loans_NAME_GOODS_CATEGORY_Education_count_norm previous_loans_NAME_GOODS_CATEGORY_Fitness_count previous_loans_NAME_GOODS_CATEGORY_Fitness_count_norm previous_loans_NAME_GOODS_CATEGORY_Furniture_count previous_loans_NAME_GOODS_CATEGORY_Furniture_count_norm previous_loans_NAME_GOODS_CATEGORY_Gardening_count previous_loans_NAME_GOODS_CATEGORY_Gardening_count_norm previous_loans_NAME_GOODS_CATEGORY_Homewares_count previous_loans_NAME_GOODS_CATEGORY_Homewares_count_norm previous_loans_NAME_GOODS_CATEGORY_House Construction_count previous_loans_NAME_GOODS_CATEGORY_House Construction_count_norm previous_loans_NAME_GOODS_CATEGORY_Insurance_count previous_loans_NAME_GOODS_CATEGORY_Insurance_count_norm previous_loans_NAME_GOODS_CATEGORY_Jewelry_count previous_loans_NAME_GOODS_CATEGORY_Jewelry_count_norm previous_loans_NAME_GOODS_CATEGORY_Medical Supplies_count previous_loans_NAME_GOODS_CATEGORY_Medical Supplies_count_norm previous_loans_NAME_GOODS_CATEGORY_Medicine_count previous_loans_NAME_GOODS_CATEGORY_Medicine_count_norm previous_loans_NAME_GOODS_CATEGORY_Mobile_count previous_loans_NAME_GOODS_CATEGORY_Mobile_count_norm previous_loans_NAME_GOODS_CATEGORY_Office Appliances_count previous_loans_NAME_GOODS_CATEGORY_Office Appliances_count_norm previous_loans_NAME_GOODS_CATEGORY_Other_count previous_loans_NAME_GOODS_CATEGORY_Other_count_norm previous_loans_NAME_GOODS_CATEGORY_Photo / Cinema Equipment_count previous_loans_NAME_GOODS_CATEGORY_Photo / Cinema Equipment_count_norm previous_loans_NAME_GOODS_CATEGORY_Sport and Leisure_count previous_loans_NAME_GOODS_CATEGORY_Sport and Leisure_count_norm previous_loans_NAME_GOODS_CATEGORY_Tourism_count previous_loans_NAME_GOODS_CATEGORY_Tourism_count_norm previous_loans_NAME_GOODS_CATEGORY_Vehicles_count previous_loans_NAME_GOODS_CATEGORY_Vehicles_count_norm previous_loans_NAME_GOODS_CATEGORY_Weapon_count previous_loans_NAME_GOODS_CATEGORY_Weapon_count_norm previous_loans_NAME_GOODS_CATEGORY_XNA_count previous_loans_NAME_GOODS_CATEGORY_XNA_count_norm previous_loans_NAME_PORTFOLIO_Cards_count previous_loans_NAME_PORTFOLIO_Cards_count_norm previous_loans_NAME_PORTFOLIO_Cars_count previous_loans_NAME_PORTFOLIO_Cars_count_norm previous_loans_NAME_PORTFOLIO_Cash_count previous_loans_NAME_PORTFOLIO_Cash_count_norm previous_loans_NAME_PORTFOLIO_POS_count previous_loans_NAME_PORTFOLIO_POS_count_norm previous_loans_NAME_PORTFOLIO_XNA_count previous_loans_NAME_PORTFOLIO_XNA_count_norm previous_loans_NAME_PRODUCT_TYPE_XNA_count previous_loans_NAME_PRODUCT_TYPE_XNA_count_norm previous_loans_NAME_PRODUCT_TYPE_walk-in_count previous_loans_NAME_PRODUCT_TYPE_walk-in_count_norm previous_loans_NAME_PRODUCT_TYPE_x-sell_count previous_loans_NAME_PRODUCT_TYPE_x-sell_count_norm previous_loans_CHANNEL_TYPE_AP+ (Cash loan)_count previous_loans_CHANNEL_TYPE_AP+ (Cash loan)_count_norm previous_loans_CHANNEL_TYPE_Car dealer_count previous_loans_CHANNEL_TYPE_Car dealer_count_norm previous_loans_CHANNEL_TYPE_Channel of corporate sales_count previous_loans_CHANNEL_TYPE_Channel of corporate sales_count_norm previous_loans_CHANNEL_TYPE_Contact center_count previous_loans_CHANNEL_TYPE_Contact center_count_norm previous_loans_CHANNEL_TYPE_Country-wide_count previous_loans_CHANNEL_TYPE_Country-wide_count_norm previous_loans_CHANNEL_TYPE_Credit and cash offices_count previous_loans_CHANNEL_TYPE_Credit and cash offices_count_norm previous_loans_CHANNEL_TYPE_Regional / Local_count previous_loans_CHANNEL_TYPE_Regional / Local_count_norm previous_loans_CHANNEL_TYPE_Stone_count previous_loans_CHANNEL_TYPE_Stone_count_norm previous_loans_NAME_SELLER_INDUSTRY_Auto technology_count previous_loans_NAME_SELLER_INDUSTRY_Auto technology_count_norm previous_loans_NAME_SELLER_INDUSTRY_Clothing_count previous_loans_NAME_SELLER_INDUSTRY_Clothing_count_norm previous_loans_NAME_SELLER_INDUSTRY_Connectivity_count previous_loans_NAME_SELLER_INDUSTRY_Connectivity_count_norm previous_loans_NAME_SELLER_INDUSTRY_Construction_count previous_loans_NAME_SELLER_INDUSTRY_Construction_count_norm previous_loans_NAME_SELLER_INDUSTRY_Consumer electronics_count previous_loans_NAME_SELLER_INDUSTRY_Consumer electronics_count_norm previous_loans_NAME_SELLER_INDUSTRY_Furniture_count previous_loans_NAME_SELLER_INDUSTRY_Furniture_count_norm previous_loans_NAME_SELLER_INDUSTRY_Industry_count previous_loans_NAME_SELLER_INDUSTRY_Industry_count_norm previous_loans_NAME_SELLER_INDUSTRY_Jewelry_count previous_loans_NAME_SELLER_INDUSTRY_Jewelry_count_norm previous_loans_NAME_SELLER_INDUSTRY_MLM partners_count previous_loans_NAME_SELLER_INDUSTRY_MLM partners_count_norm previous_loans_NAME_SELLER_INDUSTRY_Tourism_count previous_loans_NAME_SELLER_INDUSTRY_Tourism_count_norm previous_loans_NAME_SELLER_INDUSTRY_XNA_count previous_loans_NAME_SELLER_INDUSTRY_XNA_count_norm previous_loans_NAME_YIELD_GROUP_XNA_count previous_loans_NAME_YIELD_GROUP_XNA_count_norm previous_loans_NAME_YIELD_GROUP_high_count previous_loans_NAME_YIELD_GROUP_high_count_norm previous_loans_NAME_YIELD_GROUP_low_action_count previous_loans_NAME_YIELD_GROUP_low_action_count_norm previous_loans_NAME_YIELD_GROUP_low_normal_count previous_loans_NAME_YIELD_GROUP_low_normal_count_norm previous_loans_NAME_YIELD_GROUP_middle_count previous_loans_NAME_YIELD_GROUP_middle_count_norm previous_loans_PRODUCT_COMBINATION_Card Street_count previous_loans_PRODUCT_COMBINATION_Card Street_count_norm previous_loans_PRODUCT_COMBINATION_Card X-Sell_count previous_loans_PRODUCT_COMBINATION_Card X-Sell_count_norm previous_loans_PRODUCT_COMBINATION_Cash_count previous_loans_PRODUCT_COMBINATION_Cash_count_norm previous_loans_PRODUCT_COMBINATION_Cash Street: high_count previous_loans_PRODUCT_COMBINATION_Cash Street: high_count_norm previous_loans_PRODUCT_COMBINATION_Cash Street: low_count previous_loans_PRODUCT_COMBINATION_Cash Street: low_count_norm previous_loans_PRODUCT_COMBINATION_Cash Street: middle_count previous_loans_PRODUCT_COMBINATION_Cash Street: middle_count_norm previous_loans_PRODUCT_COMBINATION_Cash X-Sell: high_count previous_loans_PRODUCT_COMBINATION_Cash X-Sell: high_count_norm previous_loans_PRODUCT_COMBINATION_Cash X-Sell: low_count previous_loans_PRODUCT_COMBINATION_Cash X-Sell: low_count_norm previous_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_count previous_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_count_norm previous_loans_PRODUCT_COMBINATION_POS household with interest_count previous_loans_PRODUCT_COMBINATION_POS household with interest_count_norm previous_loans_PRODUCT_COMBINATION_POS household without interest_count previous_loans_PRODUCT_COMBINATION_POS household without interest_count_norm previous_loans_PRODUCT_COMBINATION_POS industry with interest_count previous_loans_PRODUCT_COMBINATION_POS industry with interest_count_norm previous_loans_PRODUCT_COMBINATION_POS industry without interest_count previous_loans_PRODUCT_COMBINATION_POS industry without interest_count_norm previous_loans_PRODUCT_COMBINATION_POS mobile with interest_count previous_loans_PRODUCT_COMBINATION_POS mobile with interest_count_norm previous_loans_PRODUCT_COMBINATION_POS mobile without interest_count previous_loans_PRODUCT_COMBINATION_POS mobile without interest_count_norm previous_loans_PRODUCT_COMBINATION_POS other with interest_count previous_loans_PRODUCT_COMBINATION_POS other with interest_count_norm previous_loans_PRODUCT_COMBINATION_POS others without interest_count previous_loans_PRODUCT_COMBINATION_POS others without interest_count_norm\nSK_ID_CURR \n100001 0 0.000000 1 1.000000 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.000000 1 1.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 1 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.000000 1 1.000000 0 0.0 1 1.000000 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.0 1 1.0 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 1 1.0 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 \n100002 0 0.000000 1 1.000000 0 0.0 0 0.0 0 0.000000 0 0.0 1 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.000000 1 1.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 1.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.000000 1 1.000000 0 0.0 1 1.000000 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.000000 0 0.0 1 1.000000 1 1.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 \n100003 1 0.333333 2 0.666667 0 0.0 0 0.0 1 0.333333 0 0.0 1 0.333333 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 3 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.666667 1 0.333333 3 1.0 0 0.0 0 0.0 0 0.0 2 0.666667 0 0.0 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 3 1.0 0 0.0 0 0.0 2 0.666667 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 2 0.666667 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 0 0.0 1 0.333333 2 0.666667 0 0.0 2 0.666667 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 1 0.333333 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 1 0.333333 2 0.666667 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.333333 0 0.0 1 0.333333 0 0.0 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 \n100004 0 0.000000 1 1.000000 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.000000 1 1.0 0 0.0 0 0.0 0 0.0 1 1.000000 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 1.000000 1 1.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.000000 1 1.000000 0 0.0 1 1.000000 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.000000 1 1.0 0 0.000000 0 0.0 0 0.0 1 1.0 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.000000 1 1.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.0 1 1.0 0 0.0 0 0.0 \n100005 1 0.500000 1 0.500000 0 0.0 0 0.0 1 0.500000 0 0.0 0 0.000000 0 0.000000 1 0.5 0 0.0 0 0.0 0 0.0 2 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.500000 1 0.500000 1 0.5 1 0.5 0 0.0 0 0.0 1 0.500000 0 0.0 0 0.0 1 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 1.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.5 0 0.000000 1 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.5 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 0.500000 0 0.0 0 0.0 0 0.000000 1 0.500000 1 0.5 2 1.000000 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 0.500000 1 0.500000 0 0.0 0 0.000000 0 0.0 0 0.0 1 0.5 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 1 0.500000 1 0.5 1 0.5 0 0.0 0 0.000000 0 0.000000 0 0.0 0 0.0 1 0.5 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 0 0.000000 0 0.0 1 0.5 0 0.0 0 0.0 0 0.0 ", 150 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
previous_loans_NAME_CONTRACT_TYPE_Cash loans_countprevious_loans_NAME_CONTRACT_TYPE_Cash loans_count_normprevious_loans_NAME_CONTRACT_TYPE_Consumer loans_countprevious_loans_NAME_CONTRACT_TYPE_Consumer loans_count_normprevious_loans_NAME_CONTRACT_TYPE_Revolving loans_countprevious_loans_NAME_CONTRACT_TYPE_Revolving loans_count_normprevious_loans_NAME_CONTRACT_TYPE_XNA_countprevious_loans_NAME_CONTRACT_TYPE_XNA_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_MONDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_MONDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_SATURDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_SATURDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_count_normprevious_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_countprevious_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_count_normprevious_loans_FLAG_LAST_APPL_PER_CONTRACT_N_countprevious_loans_FLAG_LAST_APPL_PER_CONTRACT_N_count_normprevious_loans_FLAG_LAST_APPL_PER_CONTRACT_Y_countprevious_loans_FLAG_LAST_APPL_PER_CONTRACT_Y_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Business development_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Business development_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a garage_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a garage_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a home_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a home_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_count_normprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_countprevious_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_count_norm...previous_loans_NAME_YIELD_GROUP_low_action_countprevious_loans_NAME_YIELD_GROUP_low_action_count_normprevious_loans_NAME_YIELD_GROUP_low_normal_countprevious_loans_NAME_YIELD_GROUP_low_normal_count_normprevious_loans_NAME_YIELD_GROUP_middle_countprevious_loans_NAME_YIELD_GROUP_middle_count_normprevious_loans_PRODUCT_COMBINATION_Card Street_countprevious_loans_PRODUCT_COMBINATION_Card Street_count_normprevious_loans_PRODUCT_COMBINATION_Card X-Sell_countprevious_loans_PRODUCT_COMBINATION_Card X-Sell_count_normprevious_loans_PRODUCT_COMBINATION_Cash_countprevious_loans_PRODUCT_COMBINATION_Cash_count_normprevious_loans_PRODUCT_COMBINATION_Cash Street: high_countprevious_loans_PRODUCT_COMBINATION_Cash Street: high_count_normprevious_loans_PRODUCT_COMBINATION_Cash Street: low_countprevious_loans_PRODUCT_COMBINATION_Cash Street: low_count_normprevious_loans_PRODUCT_COMBINATION_Cash Street: middle_countprevious_loans_PRODUCT_COMBINATION_Cash Street: middle_count_normprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: high_countprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: high_count_normprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: low_countprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: low_count_normprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_countprevious_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_count_normprevious_loans_PRODUCT_COMBINATION_POS household with interest_countprevious_loans_PRODUCT_COMBINATION_POS household with interest_count_normprevious_loans_PRODUCT_COMBINATION_POS household without interest_countprevious_loans_PRODUCT_COMBINATION_POS household without interest_count_normprevious_loans_PRODUCT_COMBINATION_POS industry with interest_countprevious_loans_PRODUCT_COMBINATION_POS industry with interest_count_normprevious_loans_PRODUCT_COMBINATION_POS industry without interest_countprevious_loans_PRODUCT_COMBINATION_POS industry without interest_count_normprevious_loans_PRODUCT_COMBINATION_POS mobile with interest_countprevious_loans_PRODUCT_COMBINATION_POS mobile with interest_count_normprevious_loans_PRODUCT_COMBINATION_POS mobile without interest_countprevious_loans_PRODUCT_COMBINATION_POS mobile without interest_count_normprevious_loans_PRODUCT_COMBINATION_POS other with interest_countprevious_loans_PRODUCT_COMBINATION_POS other with interest_count_normprevious_loans_PRODUCT_COMBINATION_POS others without interest_countprevious_loans_PRODUCT_COMBINATION_POS others without interest_count_norm
SK_ID_CURR
10000100.00000011.00000000.000.011.00000000.000.00000000.00000000.000.000.000.011.000.000.000.000.000.000.000.0...00.000.00000000.00000000.000.000.000.000.000.000.000.00000000.000.00000000.000.00000000.011.000.000.000.0
10000200.00000011.00000000.000.000.00000000.011.00000000.00000000.000.000.000.011.000.000.000.000.000.000.000.0...00.011.00000000.00000000.000.000.000.000.000.000.000.00000000.000.00000000.000.00000000.000.000.011.000.0
10000310.33333320.66666700.000.010.33333300.010.33333310.33333300.000.000.000.031.000.000.000.000.000.000.000.0...00.010.33333320.66666700.000.000.000.000.000.000.010.33333300.010.33333300.010.33333300.000.000.000.000.0
10000400.00000011.00000000.000.011.00000000.000.00000000.00000000.000.000.000.011.000.000.000.000.000.000.000.0...00.000.00000011.00000000.000.000.000.000.000.000.000.00000000.000.00000000.000.00000000.000.011.000.000.0
10000510.50000010.50000000.000.010.50000000.000.00000000.00000010.500.000.000.021.000.000.000.000.000.000.000.0...00.000.00000000.00000000.000.010.500.000.000.000.000.00000000.000.00000000.000.00000000.010.500.000.000.0
\n
" 151 | }, 152 | "metadata": {} 153 | } 154 | ] 155 | }, 156 | { 157 | "metadata": { 158 | "trusted": true, 159 | "_uuid": "39ccf84d56774e155a8cf62cb1c516d7dc97c6c8" 160 | }, 161 | "cell_type": "code", 162 | "source": "print('Previous aggregated shape: ', previous_agg.shape)\nprint('Previous categorical counts shape: ', previous_counts.shape)", 163 | "execution_count": 8, 164 | "outputs": [ 165 | { 166 | "output_type": "stream", 167 | "text": "Previous aggregated shape: (338857, 78)\nPrevious categorical counts shape: (338857, 286)\n", 168 | "name": "stdout" 169 | } 170 | ] 171 | }, 172 | { 173 | "metadata": { 174 | "_uuid": "cc0032e7c0f6f591854d31d484bd14f709f4ff13" 175 | }, 176 | "cell_type": "markdown", 177 | "source": "We can join the calculated dataframe to the main training dataframe using a merge. Then we should delete the calculated dataframes to avoid using too much of the kernel memory." 178 | }, 179 | { 180 | "metadata": { 181 | "trusted": true, 182 | "_uuid": "7b7651f8ad471a23e8dac04c5ee8ee3c42b214ea" 183 | }, 184 | "cell_type": "code", 185 | "source": "train = pd.read_csv('../input/application_train.csv')\ntest = pd.read_csv('../input/application_test.csv')\n\n# Merge in the previous information\ntrain = train.merge(previous_counts, on ='SK_ID_CURR', how = 'left')\ntrain = train.merge(previous_agg, on = 'SK_ID_CURR', how = 'left')\n\ntest = test.merge(previous_counts, on ='SK_ID_CURR', how = 'left')\ntest = test.merge(previous_agg, on = 'SK_ID_CURR', how = 'left')\n\n# Remove variables to free memory\ngc.enable()\ndel previous, previous_agg, previous_counts\ngc.collect()", 186 | "execution_count": 9, 187 | "outputs": [ 188 | { 189 | "output_type": "execute_result", 190 | "execution_count": 9, 191 | "data": { 192 | "text/plain": "56" 193 | }, 194 | "metadata": {} 195 | } 196 | ] 197 | }, 198 | { 199 | "metadata": { 200 | "_uuid": "cec09d0de0aeaed65b63ed59889b3c121cade929" 201 | }, 202 | "cell_type": "markdown", 203 | "source": "We are going to have to be careful about calculating too many features. We don't want to overwhelm the model with too many irrelevant features or features with too many missing values. In the previous notebook, we removed any features with more than 75% missing values. To be consistent, we will apply that same logic here. " 204 | }, 205 | { 206 | "metadata": { 207 | "_uuid": "d4c38cd12402823c091e81dbcf42ae56e108c055" 208 | }, 209 | "cell_type": "markdown", 210 | "source": "## Function to Calculate Missing Values" 211 | }, 212 | { 213 | "metadata": { 214 | "trusted": true, 215 | "collapsed": true, 216 | "_uuid": "836b21cd4edfa344efc9bed804bfb889fc33b5ff" 217 | }, 218 | "cell_type": "code", 219 | "source": "# Function to calculate missing values by column# Funct \ndef missing_values_table(df, print_info = False):\n # Total missing values\n mis_val = df.isnull().sum()\n \n # Percentage of missing values\n mis_val_percent = 100 * df.isnull().sum() / len(df)\n \n # Make a table with the results\n mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)\n \n # Rename the columns\n mis_val_table_ren_columns = mis_val_table.rename(\n columns = {0 : 'Missing Values', 1 : '% of Total Values'})\n \n # Sort the table by percentage of missing descending\n mis_val_table_ren_columns = mis_val_table_ren_columns[\n mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(\n '% of Total Values', ascending=False).round(1)\n \n if print_info:\n # Print some summary information\n print (\"Your selected dataframe has \" + str(df.shape[1]) + \" columns.\\n\" \n \"There are \" + str(mis_val_table_ren_columns.shape[0]) +\n \" columns that have missing values.\")\n \n # Return the dataframe with missing information\n return mis_val_table_ren_columns", 220 | "execution_count": 10, 221 | "outputs": [] 222 | }, 223 | { 224 | "metadata": { 225 | "trusted": true, 226 | "collapsed": true, 227 | "_uuid": "d6a1c60994bd2be69b842c5393dbe2799c4ff74c" 228 | }, 229 | "cell_type": "code", 230 | "source": "def remove_missing_columns(train, test, threshold = 90):\n # Calculate missing stats for train and test (remember to calculate a percent!)\n train_miss = pd.DataFrame(train.isnull().sum())\n train_miss['percent'] = 100 * train_miss[0] / len(train)\n \n test_miss = pd.DataFrame(test.isnull().sum())\n test_miss['percent'] = 100 * test_miss[0] / len(test)\n \n # list of missing columns for train and test\n missing_train_columns = list(train_miss.index[train_miss['percent'] > threshold])\n missing_test_columns = list(test_miss.index[test_miss['percent'] > threshold])\n \n # Combine the two lists together\n missing_columns = list(set(missing_train_columns + missing_test_columns))\n \n # Print information\n print('There are %d columns with greater than %d%% missing values.' % (len(missing_columns), threshold))\n \n # Drop the missing columns and return\n train = train.drop(columns = missing_columns)\n test = test.drop(columns = missing_columns)\n \n return train, test", 231 | "execution_count": 11, 232 | "outputs": [] 233 | }, 234 | { 235 | "metadata": { 236 | "trusted": true, 237 | "_uuid": "6d90c583647b95c11fa72273931ef51d584e9795" 238 | }, 239 | "cell_type": "code", 240 | "source": "train, test = remove_missing_columns(train, test)", 241 | "execution_count": 13, 242 | "outputs": [ 243 | { 244 | "output_type": "stream", 245 | "text": "There are 6 columns with greater than 90% missing values.\n", 246 | "name": "stdout" 247 | } 248 | ] 249 | }, 250 | { 251 | "metadata": { 252 | "_uuid": "d8b6b99fcfc81e3965213e2e98480365a9c25db1" 253 | }, 254 | "cell_type": "markdown", 255 | "source": "# Applying to More Data" 256 | }, 257 | { 258 | "metadata": { 259 | "_uuid": "c9ff11c7e3fbd97f328c40217a902f2fb3dcad73" 260 | }, 261 | "cell_type": "markdown", 262 | "source": "### Function to Aggregate Stats at the Client Level" 263 | }, 264 | { 265 | "metadata": { 266 | "trusted": true, 267 | "collapsed": true, 268 | "_uuid": "2041ca7741921a3508258b616db39101e1e4d722" 269 | }, 270 | "cell_type": "code", 271 | "source": "def aggregate_client(df, group_vars, df_names):\n \"\"\"Aggregate a dataframe with data at the loan level \n at the client level\n \n Args:\n df (dataframe): data at the loan level\n group_vars (list of two strings): grouping variables for the loan \n and then the client (example ['SK_ID_PREV', 'SK_ID_CURR'])\n names (list of two strings): names to call the resulting columns\n (example ['cash', 'client'])\n \n Returns:\n df_client (dataframe): aggregated numeric stats at the client level. \n Each client will have a single row with all the numeric data aggregated\n \"\"\"\n \n # Aggregate the numeric columns\n df_agg = agg_numeric(df, group_var = group_vars[0], df_name = df_names[0])\n \n # If there are categorical variables\n if any(df.dtypes == 'object'):\n \n # Count the categorical columns\n df_counts = count_categorical(df, group_var = group_vars[0], df_name = df_names[0])\n\n # Merge the numeric and categorical\n df_by_loan = df_counts.merge(df_agg, on = group_vars[0], how = 'outer')\n\n gc.enable()\n del df_agg, df_counts\n gc.collect()\n\n # Merge to get the client id in dataframe\n df_by_loan = df_by_loan.merge(df[[group_vars[0], group_vars[1]]], on = group_vars[0], how = 'left')\n\n # Remove the loan id\n df_by_loan = df_by_loan.drop(columns = [group_vars[0]])\n\n # Aggregate numeric stats by column\n df_by_client = agg_numeric(df_by_loan, group_var = group_vars[1], df_name = df_names[1])\n\n \n # No categorical variables\n else:\n # Merge to get the client id in dataframe\n df_by_loan = df_agg.merge(df[[group_vars[0], group_vars[1]]], on = group_vars[0], how = 'left')\n \n gc.enable()\n del df_agg\n gc.collect()\n \n # Remove the loan id\n df_by_loan = df_by_loan.drop(columns = [group_vars[0]])\n \n # Aggregate numeric stats by column\n df_by_client = agg_numeric(df_by_loan, group_var = group_vars[1], df_name = df_names[1])\n \n # Memory management\n gc.enable()\n del df, df_by_loan\n gc.collect()\n\n return df_by_client", 272 | "execution_count": 14, 273 | "outputs": [] 274 | }, 275 | { 276 | "metadata": { 277 | "_uuid": "65c0196b37488c09ec7205edcb6726d365b5a96a" 278 | }, 279 | "cell_type": "markdown", 280 | "source": "## Monthly Cash Data" 281 | }, 282 | { 283 | "metadata": { 284 | "trusted": true, 285 | "_uuid": "e2e13040abf2af99ef9dcaea9b29f8128d2e4f07" 286 | }, 287 | "cell_type": "code", 288 | "source": "cash = pd.read_csv('../input/POS_CASH_balance.csv')\ncash.head()", 289 | "execution_count": 15, 290 | "outputs": [ 291 | { 292 | "output_type": "execute_result", 293 | "execution_count": 15, 294 | "data": { 295 | "text/plain": " SK_ID_PREV SK_ID_CURR ... SK_DPD SK_DPD_DEF\n0 1803195 182943 ... 0 0\n1 1715348 367990 ... 0 0\n2 1784872 397406 ... 0 0\n3 1903291 269225 ... 0 0\n4 2341044 334279 ... 0 0\n\n[5 rows x 8 columns]", 296 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SK_ID_PREVSK_ID_CURRMONTHS_BALANCECNT_INSTALMENTCNT_INSTALMENT_FUTURENAME_CONTRACT_STATUSSK_DPDSK_DPD_DEF
01803195182943-3148.045.0Active00
11715348367990-3336.035.0Active00
21784872397406-3212.09.0Active00
31903291269225-3548.042.0Active00
42341044334279-3536.035.0Active00
\n
" 297 | }, 298 | "metadata": {} 299 | } 300 | ] 301 | }, 302 | { 303 | "metadata": { 304 | "trusted": true, 305 | "_uuid": "e261b4a55167cf017eb5a190b8c83571b5e1398d" 306 | }, 307 | "cell_type": "code", 308 | "source": "cash_by_client = aggregate_client(cash, group_vars = ['SK_ID_PREV', 'SK_ID_CURR'], df_names = ['cash', 'client'])\ncash_by_client.head()", 309 | "execution_count": 16, 310 | "outputs": [ 311 | { 312 | "output_type": "execute_result", 313 | "execution_count": 16, 314 | "data": { 315 | "text/plain": " SK_ID_CURR ... client_counts\n0 100001 ... 9\n1 100002 ... 19\n2 100003 ... 28\n3 100004 ... 4\n4 100005 ... 11\n\n[5 rows x 174 columns]", 316 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SK_ID_CURRclient_cash_NAME_CONTRACT_STATUS_Active_count_meanclient_cash_NAME_CONTRACT_STATUS_Active_count_maxclient_cash_NAME_CONTRACT_STATUS_Active_count_minclient_cash_NAME_CONTRACT_STATUS_Active_count_sumclient_cash_NAME_CONTRACT_STATUS_Active_count_norm_meanclient_cash_NAME_CONTRACT_STATUS_Active_count_norm_maxclient_cash_NAME_CONTRACT_STATUS_Active_count_norm_minclient_cash_NAME_CONTRACT_STATUS_Active_count_norm_sumclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_meanclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_maxclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_minclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_sumclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_norm_meanclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_norm_maxclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_norm_minclient_cash_NAME_CONTRACT_STATUS_Amortized debt_count_norm_sumclient_cash_NAME_CONTRACT_STATUS_Approved_count_meanclient_cash_NAME_CONTRACT_STATUS_Approved_count_maxclient_cash_NAME_CONTRACT_STATUS_Approved_count_minclient_cash_NAME_CONTRACT_STATUS_Approved_count_sumclient_cash_NAME_CONTRACT_STATUS_Approved_count_norm_meanclient_cash_NAME_CONTRACT_STATUS_Approved_count_norm_maxclient_cash_NAME_CONTRACT_STATUS_Approved_count_norm_minclient_cash_NAME_CONTRACT_STATUS_Approved_count_norm_sumclient_cash_NAME_CONTRACT_STATUS_Canceled_count_meanclient_cash_NAME_CONTRACT_STATUS_Canceled_count_maxclient_cash_NAME_CONTRACT_STATUS_Canceled_count_minclient_cash_NAME_CONTRACT_STATUS_Canceled_count_sumclient_cash_NAME_CONTRACT_STATUS_Canceled_count_norm_meanclient_cash_NAME_CONTRACT_STATUS_Canceled_count_norm_maxclient_cash_NAME_CONTRACT_STATUS_Canceled_count_norm_minclient_cash_NAME_CONTRACT_STATUS_Canceled_count_norm_sumclient_cash_NAME_CONTRACT_STATUS_Completed_count_meanclient_cash_NAME_CONTRACT_STATUS_Completed_count_maxclient_cash_NAME_CONTRACT_STATUS_Completed_count_minclient_cash_NAME_CONTRACT_STATUS_Completed_count_sumclient_cash_NAME_CONTRACT_STATUS_Completed_count_norm_meanclient_cash_NAME_CONTRACT_STATUS_Completed_count_norm_maxclient_cash_NAME_CONTRACT_STATUS_Completed_count_norm_min...client_cash_CNT_INSTALMENT_FUTURE_sum_maxclient_cash_CNT_INSTALMENT_FUTURE_sum_minclient_cash_CNT_INSTALMENT_FUTURE_sum_sumclient_cash_SK_DPD_mean_meanclient_cash_SK_DPD_mean_maxclient_cash_SK_DPD_mean_minclient_cash_SK_DPD_mean_sumclient_cash_SK_DPD_max_meanclient_cash_SK_DPD_max_maxclient_cash_SK_DPD_max_minclient_cash_SK_DPD_max_sumclient_cash_SK_DPD_min_meanclient_cash_SK_DPD_min_maxclient_cash_SK_DPD_min_minclient_cash_SK_DPD_min_sumclient_cash_SK_DPD_sum_meanclient_cash_SK_DPD_sum_maxclient_cash_SK_DPD_sum_minclient_cash_SK_DPD_sum_sumclient_cash_SK_DPD_DEF_mean_meanclient_cash_SK_DPD_DEF_mean_maxclient_cash_SK_DPD_DEF_mean_minclient_cash_SK_DPD_DEF_mean_sumclient_cash_SK_DPD_DEF_max_meanclient_cash_SK_DPD_DEF_max_maxclient_cash_SK_DPD_DEF_max_minclient_cash_SK_DPD_DEF_max_sumclient_cash_SK_DPD_DEF_min_meanclient_cash_SK_DPD_DEF_min_maxclient_cash_SK_DPD_DEF_min_minclient_cash_SK_DPD_DEF_min_sumclient_cash_SK_DPD_DEF_sum_meanclient_cash_SK_DPD_DEF_sum_maxclient_cash_SK_DPD_DEF_sum_minclient_cash_SK_DPD_DEF_sum_sumclient_cash_counts_meanclient_cash_counts_maxclient_cash_counts_minclient_cash_counts_sumclient_counts
01000013.5555564332.00.7777780.8000000.7500007.00.0000.00.00.00.00.00.0000.00.00.00.00.00.00000.00.00.00.01.000000119.00.2222220.2500000.200000...10.03.062.00.7777781.750.07.03.11111170280.00003.11111170280.7777781.750.07.03.11111170280.00003.11111170284.55555654419
110000219.0000001919361.01.0000001.0000001.00000019.00.0000.00.00.00.00.00.0000.00.00.00.00.00.00000.00.00.00.00.000000000.00.0000000.0000000.000000...285.0285.05415.00.0000000.000.00.00.0000000000.00000.0000000000.0000000.000.00.00.0000000000.00000.00000000019.000000191936119
21000039.142857127256.00.9285711.0000000.87500026.00.0000.00.00.00.00.00.0000.00.00.00.00.00.00000.00.00.00.00.5714291016.00.0714290.1250000.000000...78.021.01608.00.0000000.000.00.00.0000000000.00000.0000000000.0000000.000.00.00.0000000000.00000.0000000009.71428612827228
31000043.0000003312.00.7500000.7500000.7500003.00.0000.00.00.00.00.00.0000.00.00.00.00.00.00000.00.00.00.01.000000114.00.2500000.2500000.250000...9.09.036.00.0000000.000.00.00.0000000000.00000.0000000000.0000000.000.00.00.0000000000.00000.0000000004.00000044164
41000059.0000009999.00.8181820.8181820.8181829.00.0000.00.00.00.00.00.0000.00.00.00.00.00.00000.00.00.00.01.0000001111.00.0909090.0909090.090909...72.072.0792.00.0000000.000.00.00.0000000000.00000.0000000000.0000000.000.00.00.0000000000.00000.00000000011.000000111112111
\n
" 317 | }, 318 | "metadata": {} 319 | } 320 | ] 321 | }, 322 | { 323 | "metadata": { 324 | "trusted": true, 325 | "_uuid": "aa1c17e2ea0742c9bb3acdf7e7e218b0a8fc4fea" 326 | }, 327 | "cell_type": "code", 328 | "source": "print('Cash by Client Shape: ', cash_by_client.shape)\ntrain = train.merge(cash_by_client, on = 'SK_ID_CURR', how = 'left')\ntest = test.merge(cash_by_client, on = 'SK_ID_CURR', how = 'left')\n\ngc.enable()\ndel cash, cash_by_client\ngc.collect()", 329 | "execution_count": 17, 330 | "outputs": [ 331 | { 332 | "output_type": "stream", 333 | "text": "Cash by Client Shape: (337252, 174)\n", 334 | "name": "stdout" 335 | }, 336 | { 337 | "output_type": "execute_result", 338 | "execution_count": 17, 339 | "data": { 340 | "text/plain": "35" 341 | }, 342 | "metadata": {} 343 | } 344 | ] 345 | }, 346 | { 347 | "metadata": { 348 | "trusted": true, 349 | "_uuid": "03b30744261f48c0515e1d1601654a780eaf1d51" 350 | }, 351 | "cell_type": "code", 352 | "source": "train, test = remove_missing_columns(train, test)", 353 | "execution_count": 18, 354 | "outputs": [ 355 | { 356 | "output_type": "stream", 357 | "text": "There are 0 columns with greater than 90% missing values.\n", 358 | "name": "stdout" 359 | } 360 | ] 361 | }, 362 | { 363 | "metadata": { 364 | "_uuid": "f15d0b5811e29892393f643cba60f407a24084a4" 365 | }, 366 | "cell_type": "markdown", 367 | "source": "## Monthly Credit Data" 368 | }, 369 | { 370 | "metadata": { 371 | "trusted": true, 372 | "_uuid": "0bb1f02376642b2bb50ff35620a482312ecbf258" 373 | }, 374 | "cell_type": "code", 375 | "source": "credit = pd.read_csv('../input/credit_card_balance.csv')\ncredit.head()", 376 | "execution_count": null, 377 | "outputs": [ 378 | { 379 | "output_type": "execute_result", 380 | "execution_count": 19, 381 | "data": { 382 | "text/plain": " SK_ID_PREV SK_ID_CURR ... SK_DPD SK_DPD_DEF\n0 2562384 378907 ... 0 0\n1 2582071 363914 ... 0 0\n2 1740877 371185 ... 0 0\n3 1389973 337855 ... 0 0\n4 1891521 126868 ... 0 0\n\n[5 rows x 23 columns]", 383 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
SK_ID_PREVSK_ID_CURRMONTHS_BALANCEAMT_BALANCEAMT_CREDIT_LIMIT_ACTUALAMT_DRAWINGS_ATM_CURRENTAMT_DRAWINGS_CURRENTAMT_DRAWINGS_OTHER_CURRENTAMT_DRAWINGS_POS_CURRENTAMT_INST_MIN_REGULARITYAMT_PAYMENT_CURRENTAMT_PAYMENT_TOTAL_CURRENTAMT_RECEIVABLE_PRINCIPALAMT_RECIVABLEAMT_TOTAL_RECEIVABLECNT_DRAWINGS_ATM_CURRENTCNT_DRAWINGS_CURRENTCNT_DRAWINGS_OTHER_CURRENTCNT_DRAWINGS_POS_CURRENTCNT_INSTALMENT_MATURE_CUMNAME_CONTRACT_STATUSSK_DPDSK_DPD_DEF
02562384378907-656.9701350000.0877.50.0877.51700.3251800.01800.00.0000.0000.0000.010.01.035.0Active00
12582071363914-163975.555450002250.02250.00.00.02250.0002250.02250.060175.08064875.55564875.5551.010.00.069.0Active00
21740877371185-731815.2254500000.00.00.00.02250.0002250.02250.026926.42531460.08531460.0850.000.00.030.0Active00
31389973337855-4236572.1102250002250.02250.00.00.011795.76011925.011925.0224949.285233048.970233048.9701.010.00.010.0Active00
41891521126868-1453919.4554500000.011547.00.011547.022924.89027000.027000.0443044.395453919.455453919.4550.010.01.0101.0Active00
\n
" 384 | }, 385 | "metadata": {} 386 | } 387 | ] 388 | }, 389 | { 390 | "metadata": { 391 | "trusted": true, 392 | "_uuid": "f0a8e270271e7037d893386550a091e24f10a5cd", 393 | "collapsed": true 394 | }, 395 | "cell_type": "code", 396 | "source": "credit_by_client = aggregate_client(credit, group_vars = ['SK_ID_PREV', 'SK_ID_CURR'], df_names = ['credit', 'client'])\ncredit_by_client.head()", 397 | "execution_count": null, 398 | "outputs": [] 399 | }, 400 | { 401 | "metadata": { 402 | "trusted": true, 403 | "_uuid": "9a6501414046b9c6c1b7b72c13d3227efde93435", 404 | "collapsed": true 405 | }, 406 | "cell_type": "code", 407 | "source": "print('Credit by client shape: ', credit_by_client.shape)\n\ntrain = train.merge(credit_by_client, on = 'SK_ID_CURR', how = 'left')\ntest = test.merge(credit_by_client, on = 'SK_ID_CURR', how = 'left')\n\ngc.enable()\ndel credit, credit_by_client\ngc.collect()", 408 | "execution_count": null, 409 | "outputs": [] 410 | }, 411 | { 412 | "metadata": { 413 | "trusted": true, 414 | "_uuid": "7c4d9012f8c6d426a40949c2f96883480193d680", 415 | "collapsed": true 416 | }, 417 | "cell_type": "code", 418 | "source": "train, test = remove_missing_columns(train, test)", 419 | "execution_count": null, 420 | "outputs": [] 421 | }, 422 | { 423 | "metadata": { 424 | "_uuid": "13a46dfc971d3010b340636879697b5bba0f2255" 425 | }, 426 | "cell_type": "markdown", 427 | "source": "### Installment Payments" 428 | }, 429 | { 430 | "metadata": { 431 | "trusted": true, 432 | "_uuid": "fe3bef4aaa73791c63e02ec2435a884984408ace", 433 | "collapsed": true 434 | }, 435 | "cell_type": "code", 436 | "source": "installments = pd.read_csv('../input/installments_payments.csv')\ninstallments.head()", 437 | "execution_count": null, 438 | "outputs": [] 439 | }, 440 | { 441 | "metadata": { 442 | "trusted": true, 443 | "_uuid": "e68d671fc3fe08df21b81404ee1c3c4030ebafff", 444 | "collapsed": true 445 | }, 446 | "cell_type": "code", 447 | "source": "installments_by_client = aggregate_client(installments, group_vars = ['SK_ID_PREV', 'SK_ID_CURR'], df_names = ['installments', 'client'])\ninstallments_by_client.head()", 448 | "execution_count": null, 449 | "outputs": [] 450 | }, 451 | { 452 | "metadata": { 453 | "trusted": true, 454 | "_uuid": "e70b057d5f1ec2ba66738ec1ef260e02133a5de9", 455 | "collapsed": true 456 | }, 457 | "cell_type": "code", 458 | "source": "print('Installments by client shape: ', installments_by_client.shape)\n\ntrain = train.merge(installments_by_client, on = 'SK_ID_CURR', how = 'left')\ntest = test.merge(installments_by_client, on = 'SK_ID_CURR', how = 'left')\n\ngc.enable()\ndel installments, installments_by_client\ngc.collect()", 459 | "execution_count": null, 460 | "outputs": [] 461 | }, 462 | { 463 | "metadata": { 464 | "trusted": true, 465 | "_uuid": "478cdd9f3387297c88cca15fe725425d4910b776", 466 | "collapsed": true 467 | }, 468 | "cell_type": "code", 469 | "source": "train, test = remove_missing_columns(train, test)", 470 | "execution_count": null, 471 | "outputs": [] 472 | }, 473 | { 474 | "metadata": { 475 | "trusted": true, 476 | "_uuid": "9bfb6d762bbdbfadf2ac7aac8da7d835e53f5679", 477 | "collapsed": true 478 | }, 479 | "cell_type": "code", 480 | "source": "print('Final Training Shape: ', train.shape)\nprint('Final Testing Shape: ', test.shape)", 481 | "execution_count": null, 482 | "outputs": [] 483 | }, 484 | { 485 | "metadata": { 486 | "_uuid": "b91556063fd2fc8b53ed8b7e674822f96e65d6cd" 487 | }, 488 | "cell_type": "markdown", 489 | "source": " #### Save All Newly Calculated Features\n \n Unfortunately, saving all the created features does not work in a Kaggle notebook. You will have to run the code on your personal machine. I have run the code and uploaded the [entire datasets here](https://www.kaggle.com/willkoehrsen/home-credit-manual-engineered-features). I plan on doing some feature selection and uploading reduced versions of the datasets. Right now, they are slightly to big to handle in Kaggle notebooks or scripts. ." 490 | }, 491 | { 492 | "metadata": { 493 | "trusted": true, 494 | "_uuid": "33e37b48ab79b6773cbbb4331aedc6117fecef23", 495 | "collapsed": true 496 | }, 497 | "cell_type": "code", 498 | "source": "# train.to_csv('train_previous_raw.csv', index = False, chunksize = 500)\n# test.to_csv('test_previous_raw.csv', index = False)", 499 | "execution_count": null, 500 | "outputs": [] 501 | }, 502 | { 503 | "metadata": { 504 | "trusted": true, 505 | "collapsed": true, 506 | "_uuid": "7bd7f24dabe54d068e48b90ac1e42291bed6bf4d" 507 | }, 508 | "cell_type": "code", 509 | "source": "", 510 | "execution_count": null, 511 | "outputs": [] 512 | } 513 | ], 514 | "metadata": { 515 | "kernelspec": { 516 | "display_name": "Python 3", 517 | "language": "python", 518 | "name": "python3" 519 | }, 520 | "language_info": { 521 | "name": "python", 522 | "version": "3.6.5", 523 | "mimetype": "text/x-python", 524 | "codemirror_mode": { 525 | "name": "ipython", 526 | "version": 3 527 | }, 528 | "pygments_lexer": "ipython3", 529 | "nbconvert_exporter": "python", 530 | "file_extension": ".py" 531 | } 532 | }, 533 | "nbformat": 4, 534 | "nbformat_minor": 1 535 | } --------------------------------------------------------------------------------