├── .ipynb_checkpoints ├── Untitled-checkpoint.ipynb └── customer-journey-analytics-checkpoint.ipynb ├── README.md ├── __pycache__ ├── bigquery_.cpython-37.pyc ├── custom_pca.cpython-37.pyc └── helper.cpython-37.pyc ├── bigquery_.py ├── custom_pca.py ├── customer-journey-analytics.html ├── customer-journey-analytics.ipynb ├── google_analytics_schema.json ├── google_analytics_schema.xlsx ├── helper.py ├── product_categories.xlsx └── temp_data.h5 /.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Customer Journey Analytics 2 | 3 | Do you want to increase revenue and marketing ROI from your e-commerce platform? If yes, this is the project for you, so read further. 4 | 5 | ### Table of Contents 6 | 1. [Project Motivation](#motivation) 7 | 2. [Results](#results) 8 | 4. [Installation](#installation) 9 | 3. [File Descriptions](#files) 10 | 5. [Licensing, Authors, and Acknowledgements](#licensing) 11 | 12 | ## Project Motivation 13 | 14 | Customer journey mapping has become very popular in recent years. Making it right can upgrade marketing strategy, boost personalized branding and offerings and result in increased revenue and marketing ROI. To bring value to the business, it requires a healthy balance of qualitative knowledge of customer-facing functions about market and customers and quantitative insights, which can be gained from an e-commerce platform, CRM and other market related external sources. 15 | 16 | The purpose of this project is to share with fellow sales, marketing professionals and data scientists how to approach quantitative part of customer journey mapping, namely answer questions: 17 | 1. How many buyer personas do we have? 18 | 2. What are their unique characteristics? 19 | 3. How accurately can we predict buyer persona from the first customer purchase transaction? 20 | 4. How can we adapt the marketing strategy concerning buyer personas to increase ROI? 21 | 22 | From a data science perspective it means: 23 | 1. How to use hierarchical and non-hierarchical clustering to identify buyer personas? 24 | 2. How to use ensemble and linear-based models to profile buyer personas characteristics? 25 | 26 | ## Results 27 | 28 | The main findings of the code can be found at the post available [here](https://medium.com/@alfred.sasko/customer-journey-analytics-will-make-you-more-money-ba7a11cda063). 29 | 30 | ## Installation 31 | 32 | There are several necessary 3rd party libraries beyond the Anaconda distribution of Python which needs to be installed and imported to run code. These are: 33 | - [scikit_posthocs](https://scikit-posthocs.readthedocs.io/en/latest/) providing posthoc tests for multiple comparison 34 | - [google cloud SDK](https://anaconda.org/conda-forge/google-cloud-sdk) providing access to BigQuerry and [Google Analytics Sample Dataset](https://support.google.com/analytics/answer/7586738?hl=en) 35 | 36 | ## File Descriptions 37 | 38 | There is 1 notebook available here to showcase work related to the above questions. Markdown cells were used to assist in walking through the thought process for individual steps. 39 | 40 | There are additional files: 41 | - `bigquery_.py` provides custom classes of BigqueryTable and BiqqueryDataset to query data to [Google Merchandise Store](https://www.googlemerchandisestore.com/) sample dataset. 42 | - `helper_py` provides custom functions for various analyses, to keep notebook manageable to read. 43 | - `custom_pca.py` holds adaptation of [scikit-learn PCA class](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) including _Varimax Rotation_ and _Latent Root Criterion_ 44 | - `google_analytics_schema.xlsx` contains an analysis of variables in [Big Query Export Schema](https://support.google.com/analytics/answer/3437719?hl=en) used as a schema for Google Analytics Sample. 45 | - `product_categories.xlsx` ensures encoding of broken product category variables in the dataset 46 | - `temp.data.h5` stores codes/levels of each variable in the dataset 47 | 48 | ## Licensing, Authors, Acknowledgements 49 | 50 | Must give credit to Google for the data. @alexisbcook for a nice introduction to [Nested and Repeated Data](https://www.kaggle.com/alexisbcook/nested-and-repeated-data). Daqing Chen, Sai Laing Sain & Kun Guo for their technical article [Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining](https://link.springer.com/article/10.1057/dbm.2012.17). 51 | -------------------------------------------------------------------------------- /__pycache__/bigquery_.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/bigquery_.cpython-37.pyc -------------------------------------------------------------------------------- /__pycache__/custom_pca.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/custom_pca.cpython-37.pyc -------------------------------------------------------------------------------- /__pycache__/helper.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/helper.cpython-37.pyc -------------------------------------------------------------------------------- /bigquery_.py: -------------------------------------------------------------------------------- 1 | """Bigquery Custom Module 2 | 3 | 4 | Module serves for custome classes and functions to access Google's bigquery 5 | """ 6 | 7 | # IMPORTS 8 | # ------- 9 | 10 | # internal constants 11 | __ENV_VAR_NAME__ = "GOOGLE_APPLICATION_CREDENTIALS" 12 | 13 | 14 | # Standard libraries 15 | import os 16 | import numpy as np 17 | import pandas as pd 18 | import ipdb 19 | 20 | # 3rd party libraries 21 | from google.cloud import bigquery 22 | from tqdm import tqdm_notebook as tqdm 23 | 24 | def authenticate_service_account(service_account_key): 25 | """Set Windows environment variable for bigquery authentification 26 | 27 | Args: 28 | service_account_key: path to service account key 29 | """ 30 | if not __ENV_VAR_NAME__ in os.environ: 31 | os.environ[__ENV_VAR_NAME__] = service_account_key 32 | else: 33 | raise Exception( 34 | 'Environment variable {} = {} already exists.' 35 | .format(__ENV_VAR_NAME__, os.environ[__ENV_VAR_NAME__]) 36 | ) 37 | 38 | 39 | class BigqueryTable (bigquery.table.Table): 40 | '''Bigquery table object customized to work with Pandas''' 41 | 42 | def _append_schema_field( 43 | self, schema_list, schema_field, parent_field_name=''): 44 | '''Explodes schema field and append it to list of exploded schema fields 45 | 46 | Args: 47 | schema_list: list of schema fields 48 | schema_field: bigquery.schema.SchemaField object 49 | parent_field_name: field name of patent schema, None by defualt 50 | 51 | Returns: 52 | schema_list: updated schema list with appened schema_field record 53 | ''' 54 | 55 | # explode schema field 56 | schema_field = (schema_field.name if parent_field_name == '' \ 57 | else parent_field_name + '.' + schema_field.name, 58 | schema_field.field_type, 59 | schema_field.mode, 60 | schema_field.description) 61 | 62 | # append exploded schema to list of exploded schemas 63 | schema_list.append(schema_field) 64 | return schema_list 65 | 66 | def _traverse_schema(self, schema_list, schema_field, 67 | parent_field_name = ''): 68 | '''Traverses table schema, unnest schema field objects, explodes their 69 | attributes to list of tuples 70 | 71 | Args: 72 | schema_list: unnnested and exploded schema list 73 | schema_field: bigquery.schema.SchemaField object 74 | parent_field_name: field name of patent schema, empty string by default 75 | 76 | Returns: 77 | schema_list: updated schema list by schema_field 78 | ''' 79 | 80 | # for nested schema field go deeper 81 | if schema_field.field_type == 'RECORD': 82 | 83 | # explode and append parent schema field to schema list 84 | schema_list = self._append_schema_field( 85 | schema_list, schema_field, parent_field_name) 86 | 87 | # explode and append children schema fields 88 | for child_field in schema_field.fields: 89 | schema_list = self._traverse_schema( 90 | schema_list, child_field, 91 | schema_field.name if parent_field_name == '' 92 | else parent_field_name + '.' + schema_field.name 93 | ) 94 | 95 | # explode and append not nested schema field 96 | else: 97 | schema_list = self._append_schema_field( 98 | schema_list, schema_field, parent_field_name) 99 | 100 | return schema_list 101 | 102 | def schema_to_dataframe(self, bq_exp_schema=None): 103 | '''Unnest schema and recast it to dataframe 104 | 105 | Args: 106 | bq_exp_schema: BigQuery export schema as dataframe with descriptions, 107 | None by default 108 | 109 | Returns: 110 | schema_df: Table schema as dataframe 111 | ''' 112 | 113 | # traverse and explode table schema to list of tuples 114 | schema_list = [] 115 | for schema_field in self.schema: 116 | schema_list = self._traverse_schema(schema_list, schema_field) 117 | 118 | # transform exploded schema to dataframe 119 | schema_df = pd.DataFrame( 120 | np.array(schema_list), 121 | columns=['Field Name', 'Data Type', 'Mode', 'Description'] 122 | ).set_index('Field Name') 123 | 124 | # merge BigQuery export schema with table schema to get field 125 | # description 126 | if bq_exp_schema is not None: 127 | 128 | # merge schemas 129 | bq_exp_schema = bq_exp_schema.set_index('Field Name') 130 | schema_df = ( 131 | pd.merge(schema_df.drop(columns=['Description']), 132 | bq_exp_schema['Description'], 133 | left_index=True, right_index=True, 134 | how='left') 135 | .fillna('Not specified in BigQuerry Export Schema') 136 | ) 137 | 138 | # match depreciated fields 139 | depreciated_fields = { 140 | 'hits.appInfo.name': 'hits.appInfo.appName', 141 | 'hits.appInfo.version': 'hits.appInfo.appVersion', 142 | 'hits.appInfo.id': 'hits.appInfo.appId', 143 | 'hits.appInfo.installerId': 'hits.appInfo.appInstallerId', 144 | 145 | 'hits.publisher.adxClicks': 146 | 'hits.publisher.adxBackfillDfpClicks', 147 | 'hits.publisher.adxImpressions': 148 | 'hits.publisher.adxBackfillDfpImpressions', 149 | 'hits.publisher.adxMatchedQueries': 150 | 'hits.publisher.adxBackfillDfpMatchedQueries', 151 | 'hits.publisher.adxMeasurableImpressions': 152 | 'hits.publisher.adxBackfillDfpMeasurableImpressions', 153 | 'hits.publisher.adxQueries': 154 | 'hits.publisher.adxBackfillDfpQueries', 155 | 'hits.publisher.adxViewableImpressions': 156 | 'hits.publisher.adxBackfillDfpViewableImpressions', 157 | 'hits.publisher.adxPagesViewed': 158 | 'hits.publisher.adxBackfillDfpPagesViewed' 159 | } 160 | for ga_field, exp_field in depreciated_fields.items(): 161 | schema_df['Description'][ga_field] = ( 162 | bq_exp_schema['Description'][exp_field]) 163 | 164 | # match multiple fields 165 | multiple_fields = { 166 | 'totals.uniqueScreenviews': 'totals.UniqueScreenViews', 167 | 'hits.contentGroup.contentGroup': 168 | 'hits.contentGroup.contentGroupX', 169 | 'hits.contentGroup.previousContentGroup': 170 | 'hits.contentGroup.previousContentGroupX', 171 | 'hits.contentGroup.contentGroupUniqueViews': 172 | 'hits.contentGroup.contentGroupUniqueViewsX'} 173 | 174 | fields = schema_df.index 175 | for ga_field, exp_field in multiple_fields.items(): 176 | schema_df['Description'][fields.str.contains(ga_field)] = ( 177 | bq_exp_schema['Description'][exp_field] 178 | ) 179 | 180 | return schema_df.reset_index() 181 | 182 | def display_schema(self, schema_df): 183 | '''Displays left justified schema to full cell width 184 | 185 | Args: 186 | schema_df: Exploded bigquerry schema as dataframe 187 | ''' 188 | with pd.option_context('display.max_colwidth', 999): 189 | display( 190 | schema_df.style.set_table_styles( 191 | [dict(selector='th', props=[('text-align', 'left')]), 192 | dict(selector='td', props=[('text-align', 'left')])]) 193 | ) 194 | 195 | 196 | def to_dataframe(self, client): 197 | '''Expands and Converts table to DataFrame 198 | 199 | Args: 200 | client: instatiated Bigquery client 201 | 202 | Returns: 203 | df: Dataframe with expanded table fields 204 | ''' 205 | 206 | # convert table schema to DataFrame 207 | schema = self.schema_to_dataframe(bq_exp_schema=None) 208 | 209 | # extract SELECT aliases from schema 210 | repeated_fields = schema[schema['Mode'] == 'REPEATED'] 211 | select_aliases = schema.apply(self._get_select_aliases, 212 | axis=1, 213 | args=(repeated_fields,)) 214 | 215 | select_aliases = select_aliases[select_aliases['Data Type'] != 'RECORD'] 216 | select = ',\n'.join( 217 | (select_aliases['Field Name'] 218 | + ' AS ' 219 | + select_aliases['Field Alias']) 220 | .to_list() 221 | ) 222 | 223 | # extract FROM aliases from schema 224 | from_aliases = repeated_fields.apply(self._get_from_aliases, 225 | axis=1, 226 | args=(repeated_fields,)) 227 | from_ = '\n'.join( 228 | ('LEFT JOIN UNNEST(' + from_aliases['Field Name'] 229 | + ') AS ' 230 | + from_aliases['Field Alias']) 231 | .to_list() 232 | ) 233 | 234 | # query table and return as dataframe 235 | query = ''' 236 | SELECT 237 | {} 238 | FROM 239 | {} 240 | {} 241 | '''.format(select, 242 | '`' + self.full_table_id.replace(':', '.') + '`', 243 | from_) 244 | df = client.query(query).to_dataframe() 245 | return df 246 | 247 | def _get_select_aliases(self, field, repeated_fields): 248 | '''Create aliases for nested fields for SELECT statment''' 249 | 250 | if '.' in field['Field Name'] and field['Mode'] != 'REPEATED': 251 | parent = '.'.join(field['Field Name'].split('.')[:-1]) 252 | child = field['Field Name'].split('.')[-1] 253 | 254 | if parent in repeated_fields.values: 255 | field['Field Name'] = parent.replace('.', '_') + '.' + child 256 | 257 | field['Field Alias'] = field['Field Name'].replace('.', '_') 258 | 259 | return field 260 | 261 | def _get_from_aliases(self, field, repeated_fields): 262 | '''Create aliases for nested fields for FROM statment''' 263 | 264 | if '.' in field['Field Name']: 265 | parent = '.'.join(field['Field Name'].split('.')[:-1]) 266 | child = field['Field Name'].split('.')[-1] 267 | 268 | if '.' in parent: 269 | field['Field Name'] = parent.replace('.', '_') + '.' + child 270 | 271 | field['Field Alias'] = field['Field Name'].replace('.', '_') 272 | 273 | return field 274 | 275 | 276 | class BigqueryDataset(bigquery.dataset.Dataset): 277 | '''Customize biqquery Dataset Class''' 278 | 279 | def __init__(self, client, *args, **kwg): 280 | super().__init__(*args, **kwg) 281 | self.schema = self.get_information_schema(client) 282 | 283 | def get_information_schema(self, client): 284 | '''Returns information schema including table names in dataset 285 | as dataframe 286 | 287 | Args: 288 | client: gigquery.client.Client object 289 | 290 | Returns: 291 | schema_df: dataset information schema as DataFrame 292 | ''' 293 | 294 | query = ''' 295 | SELECT 296 | * EXCEPT(is_typed) 297 | FROM 298 | `{project}.{dataset_id}`.INFORMATION_SCHEMA.TABLES 299 | '''.format(project=self.project, dataset_id=self.dataset_id) 300 | schema = (client.query(query) 301 | .to_dataframe() 302 | .sort_values('table_name')) 303 | return schema 304 | 305 | def get_levels(self, client, bq_exp_schema=None): 306 | '''Add to table schema two columns: 'Levels' which contains 307 | unique values of each variable as set and 'Num of Levels' 308 | determining number of unique values. Purpose is to get 309 | fealing what values are hold by variable and scale of the variable 310 | (Ex: Unary, Binary, Multilevel etc...) 311 | 312 | Note: Functions may run long time for big datasets as sequentialy 313 | loads all tables in dataset to safe memory. 314 | 315 | Args: 316 | client: bigquery.client.Client object 317 | bq_query_exp_schema: BigQuery export schema including field names 318 | Description as dataframe 319 | Returns: 320 | schema: updated schema with level charateristics as 321 | dataframe 322 | ''' 323 | 324 | # union variable levels of consecutive queried tables 325 | def level_union(var, var_levels): 326 | try: 327 | var['Levels'] = var['Levels'] | var_levels[var.name] 328 | except: 329 | pass 330 | return var['Levels'] 331 | 332 | # get last table of dataset 333 | dataset_ref = client.dataset(self.dataset_id, self.project) 334 | table_id = self.schema['table_name'].values[-1] 335 | table_ref = dataset_ref.table(table_id) 336 | table = client.get_table(table_ref) 337 | table.__class__ = BigqueryTable 338 | 339 | # get last table schema 340 | schema = table.schema_to_dataframe(bq_exp_schema) 341 | 342 | # keep only non record fields in schema 343 | schema = schema[schema['Data Type'] != 'RECORD'].copy() 344 | 345 | # ad variable names 346 | var_name = (schema['Field Name'] 347 | .apply(lambda field: field.replace('.', '_'))) 348 | schema.insert(0, 'Variable Name', var_name) 349 | schema.insert(4, 'Levels', [set()] * schema.index.size) 350 | schema = schema.set_index('Variable Name') 351 | 352 | # analyze variable codes/levels 353 | for i, table_id in tqdm(enumerate(self.schema['table_name'])): 354 | 355 | # load table 356 | table_ref = dataset_ref.table(table_id) 357 | table = client.get_table(table_ref) 358 | table.__class__ = BigqueryTable 359 | df = table.to_dataframe(client) 360 | 361 | # get and update variable levels 362 | var_levels = df.apply(lambda var: set(var.unique())) 363 | schema['Levels'] = ( 364 | schema.apply(level_union, args=(var_levels,), 365 | axis = 1) 366 | ) 367 | 368 | if i == 2: break 369 | 370 | # calculate number of variable levels 371 | schema.insert(4, 'Num of Levels', 372 | schema.apply(lambda var: len(var['Levels']), 373 | axis=1) 374 | ) 375 | return schema 376 | -------------------------------------------------------------------------------- /custom_pca.py: -------------------------------------------------------------------------------- 1 | """Custom principle component analysis module. 2 | 3 | Iplemented customizations 4 | - varimax rotation for better interpretation of principal components 5 | - communalities calculation for selecting significant features 6 | - loading significant threshold comparison as function of sample size 7 | for selecting significant features 8 | - 'surrogate' feature selection used for dimensionality reduction - 9 | features with maximum laoding instead of principal components are selected 10 | - ... 11 | """ 12 | 13 | # IMPORTS 14 | # ------- 15 | 16 | # Standard libraries 17 | 18 | import ipdb 19 | from inspect import isclass 20 | from inspect import currentframe 21 | import numbers 22 | 23 | 24 | # 3rd party libraries 25 | import numpy as np 26 | from numpy import linalg 27 | import pandas as pd 28 | 29 | from scipy.sparse import issparse 30 | 31 | import rpy2 32 | import rpy2.rlike.container as rlc 33 | from rpy2 import robjects 34 | from rpy2.robjects.vectors import FloatVector 35 | from rpy2.robjects.vectors import ListVector 36 | from rpy2.robjects.vectors import StrVector 37 | from rpy2.robjects import pandas2ri 38 | 39 | from sklearn.preprocessing import StandardScaler 40 | from sklearn.model_selection import train_test_split 41 | from sklearn.pipeline import Pipeline 42 | from sklearn.feature_extraction.text import TfidfVectorizer 43 | from sklearn.feature_extraction.text import CountVectorizer 44 | from sklearn.model_selection import GridSearchCV 45 | from sklearn.naive_bayes import MultinomialNB 46 | from sklearn.naive_bayes import ComplementNB 47 | from sklearn.metrics import f1_score 48 | from sklearn.decomposition import PCA 49 | 50 | from sklearn.decomposition import PCA 51 | from sklearn.utils import check_array 52 | from sklearn.utils.extmath import svd_flip 53 | from sklearn.utils.extmath import stable_cumsum 54 | from sklearn.utils.validation import check_is_fitted 55 | 56 | from matplotlib import pyplot as plt 57 | import seaborn as sns 58 | 59 | class CustomPCA(PCA): 60 | '''Customized PCA with following options 61 | - varimax rotation 62 | - diffrent feature selection methods 63 | 64 | and calculated communalities. 65 | ''' 66 | 67 | def __init__(self, rotation=None, feature_selection='all', **kws): 68 | ''' 69 | rotation: string, name of the rotation method. 70 | 'varimax': varimax rotation 71 | None: no rotation by by default 72 | 73 | feature_selection: string, Features selection method 74 | 'all': All features are selected, means principal 75 | components are output of tranformation 76 | 'surrogate': Only features with highest loading 77 | with principal components are used 78 | 'summated scales': Summated scales consturcted as 79 | sum of features having significant 80 | loadings onn principal components, 81 | not implemented yet. 82 | ''' 83 | 84 | super().__init__(**kws) 85 | self.rotation = rotation 86 | self.feature_selection = feature_selection 87 | 88 | 89 | def _df2mtr(self, df): 90 | '''Convert pandas dataframe to r matrix. Category dtype is casted as 91 | factorVector considering missing values 92 | (original py2ri function of rpy2 can't handle this properly so far) 93 | 94 | Args: 95 | data: pandas dataframe of shape (# samples, # features) 96 | with numeric dtype 97 | 98 | Returns: 99 | mtr: r matrix of shape (# samples # features) 100 | ''' 101 | # check arguments 102 | assert isinstance(df, pd.DataFrame), 'Argument df need to be a pd.Dataframe.' 103 | 104 | # select only numeric columns 105 | df = df.select_dtypes('number') 106 | 107 | # create and return r matrix 108 | values = FloatVector(df.values.flatten()) 109 | dimnames = ListVector( 110 | rlc.OrdDict([('index', StrVector(tuple(df.index))), 111 | ('columns', StrVector(tuple(df.columns)))]) 112 | ) 113 | 114 | return robjects.r.matrix(values, nrow=len(df.index), ncol=len(df.columns), 115 | dimnames = dimnames, byrow=True) 116 | 117 | def _varimax(self, factor_df, **kwargs): 118 | ''' 119 | varimax rotation of factor matrix 120 | 121 | Args: 122 | factor_df: factor matrix as pd.DataFrame with shape 123 | (# features, # principal components) 124 | 125 | Return: 126 | rot_factor_df: rotated factor matrix as pd.DataFrame 127 | ''' 128 | factor_mtr = self._df2mtr(factor_df) 129 | varimax = robjects.r['varimax'] 130 | rot_factor_mtr = varimax(factor_mtr) 131 | return pandas2ri.ri2py(rot_factor_mtr.rx2('loadings')) 132 | 133 | def _fit(self, X): 134 | """Dispatch to the right submethod depending on the chosen solver.""" 135 | # Raise an error for sparse input. 136 | # This is more informative than the generic one raised by check_array. 137 | if issparse(X): 138 | raise TypeError('PCA does not support sparse input. See ' 139 | 'TruncatedSVD for a possible alternative.') 140 | 141 | X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True, 142 | copy=self.copy) 143 | 144 | # Handle n_components==None 145 | if self.n_components is None: 146 | if self.svd_solver != 'arpack': 147 | n_components = min(X.shape) 148 | else: 149 | n_components = min(X.shape) - 1 150 | else: 151 | n_components = self.n_components 152 | 153 | # Handle svd_solver 154 | self._fit_svd_solver = self.svd_solver 155 | if self._fit_svd_solver == 'auto': 156 | # Small problem or n_components == 'mle', just call full PCA 157 | if (max(X.shape) <= 500 158 | or n_components == 'mle' 159 | or n_components == 'latent_root'): 160 | self._fit_svd_solver = 'full' 161 | elif n_components >= 1 and n_components < .8 * min(X.shape): 162 | self._fit_svd_solver = 'randomized' 163 | # This is also the case of n_components in (0,1) 164 | else: 165 | self._fit_svd_solver = 'full' 166 | 167 | # Call different fits for either full or truncated SVD 168 | if self._fit_svd_solver == 'full': 169 | U, S , V = self._fit_full(X, n_components) 170 | elif self._fit_svd_solver in ['arpack', 'randomized']: 171 | U, S, V = self._fit_truncated(X, n_components, self._fit_svd_solver) 172 | else: 173 | raise ValueError("Unrecognized svd_solver='{0}'" 174 | "".format(self._fit_svd_solver)) 175 | 176 | # implmentation of varimax rotation 177 | if self.rotation == 'varimax': 178 | if self.n_samples_ > self.n_components_: 179 | 180 | factor_matrix = ( 181 | self.components_.T 182 | * (self.explained_variance_.reshape(1, -1) ** (1/2)) 183 | ) 184 | 185 | rot_factor_matrix = self._varimax(pd.DataFrame(factor_matrix)) 186 | 187 | self.explained_variance_ = (rot_factor_matrix ** 2).sum(axis=0) 188 | 189 | self.components_ = ( 190 | rot_factor_matrix 191 | / (self.explained_variance_.reshape(1, -1) ** (1/2)) 192 | ).T 193 | 194 | # sort components by explained variance in descanding order 195 | self.components_ = self.components_[ 196 | np.argsort(self.explained_variance_)[::-1], : 197 | ] 198 | 199 | self.explained_variance_ = np.sort( 200 | self.explained_variance_ 201 | )[::-1] 202 | 203 | total_var = self.n_features_ 204 | self.explained_variance_ratio_ = ( 205 | self.explained_variance_ / total_var 206 | ) 207 | 208 | self.singular_values_ = None 209 | 210 | if self._fit_svd_solver == 'full': 211 | if self.n_components_ < min(self.n_features_, self.n_samples_): 212 | self.noise_variance_ = ( 213 | (total_var - self.explained_variance_.sum()) 214 | / (self.n_features_ - self.n_components_) 215 | ) 216 | else: 217 | self.noise_variance_ = 0. 218 | 219 | elif self._fit_svd_solver in ['arpack', 'randomized']: 220 | if self.n_components_ < min(self.n_features_, self.n_samples_): 221 | 222 | total_var = np.var(X, ddof=1, axis=0) 223 | 224 | self.noise_variance_ = ( 225 | total_var.sum() - self.explained_variance_.sum() 226 | ) 227 | 228 | self.noise_variance_ /= ( 229 | min(self.n_features_, self.n_samples_) 230 | - self.n_components_ 231 | ) 232 | else: 233 | self.noise_variance_ = 0. 234 | 235 | else: 236 | raise ValueError("Unrecognized svd_solver='{0}'" 237 | "".format(self._fit_svd_solver)) 238 | else: 239 | raise ValueError( 240 | "Varimax rotation requires n_samples > n_components") 241 | 242 | U, S, V = None, None, None 243 | 244 | # implmentation of communalties 245 | self.communalities_ = ( 246 | ((self.components_.T 247 | * (self.explained_variance_.reshape(1, -1) ** (1/2))) 248 | ** 2) 249 | .sum(axis=1) 250 | ) 251 | 252 | return U, S, V 253 | 254 | def _fit_full(self, X, n_components): 255 | """Fit the model by computing full SVD on X""" 256 | n_samples, n_features = X.shape 257 | 258 | if n_components == 'mle': 259 | if n_samples < n_features: 260 | raise ValueError("n_components='mle' is only supported " 261 | "if n_samples >= n_features") 262 | elif n_components == 'latent_root': 263 | pass 264 | elif not 0 <= n_components <= min(n_samples, n_features): 265 | raise ValueError("n_components=%r must be between 0 and " 266 | "min(n_samples, n_features)=%r with " 267 | "svd_solver='full'" 268 | % (n_components, min(n_samples, n_features))) 269 | elif n_components >= 1: 270 | if not isinstance(n_components, numbers.Integral): 271 | raise ValueError("n_components=%r must be of type int " 272 | "when greater than or equal to 1, " 273 | "was of type=%r" 274 | % (n_components, type(n_components))) 275 | 276 | # Center data 277 | self.mean_ = np.mean(X, axis=0) 278 | X -= self.mean_ 279 | 280 | U, S, V = linalg.svd(X, full_matrices=False) 281 | # flip eigenvectors' sign to enforce deterministic output 282 | U, V = svd_flip(U, V) 283 | 284 | components_ = V 285 | 286 | # Get variance explained by singular values 287 | explained_variance_ = (S ** 2) / (n_samples - 1) 288 | total_var = explained_variance_.sum() 289 | explained_variance_ratio_ = explained_variance_ / total_var 290 | singular_values_ = S.copy() # Store the singular values. 291 | 292 | # Postprocess the number of components required 293 | if n_components == 'mle': 294 | n_components = \ 295 | _infer_dimension_(explained_variance_, n_samples, n_features) 296 | elif n_components == 'latent_root': 297 | n_components = (explained_variance_ > 1).sum() 298 | elif 0 < n_components < 1.0: 299 | # number of components for which the cumulated explained 300 | # variance percentage is superior to the desired threshold 301 | ratio_cumsum = stable_cumsum(explained_variance_ratio_) 302 | n_components = np.searchsorted(ratio_cumsum, n_components) + 1 303 | 304 | # Compute noise covariance using Probabilistic PCA model 305 | # The sigma2 maximum likelihood (cf. eq. 12.46) 306 | if n_components < min(n_features, n_samples): 307 | self.noise_variance_ = explained_variance_[n_components:].mean() 308 | 309 | else: 310 | self.noise_variance_ = 0. 311 | 312 | self.n_samples_, self.n_features_ = n_samples, n_features 313 | self.components_ = components_[:n_components] 314 | self.n_components_ = n_components 315 | self.explained_variance_ = explained_variance_[:n_components] 316 | self.explained_variance_ratio_ = \ 317 | explained_variance_ratio_[:n_components] 318 | self.singular_values_ = singular_values_[:n_components] 319 | 320 | return U, S, V 321 | 322 | def get_support(self, indices=False): 323 | """ 324 | Get a mask, or integer index, of the features selected 325 | Parameters 326 | ---------- 327 | indices : boolean (default False) 328 | If True, the return value will be an array of integers, rather 329 | than a boolean mask. 330 | Returns 331 | ------- 332 | support : array 333 | An index that selects the retained features from a feature vector. 334 | If `indices` is False, this is a boolean array of shape 335 | [# input features], in which an element is True iff its 336 | corresponding feature is selected for retention. If `indices` is 337 | True, this is an integer array of shape [# output features] whose 338 | values are indices into the input feature vector. 339 | """ 340 | mask = self._get_support_mask() 341 | return mask if not indices else np.where(mask)[0] 342 | 343 | def _get_support_mask(self): 344 | """ 345 | Get the boolean mask indicating which features are selected 346 | Returns 347 | ------- 348 | support : boolean array of shape [# input features] 349 | An element is True iff its corresponding feature is selected for 350 | retention. 351 | """ 352 | 353 | attrs = [v for v in vars(self) 354 | if (v.endswith("_") or v.startswith("_")) 355 | and not v.startswith("__")] 356 | check_is_fitted(self, attributes=attrs, 357 | all_or_any=all) 358 | 359 | # Keep only features with at least one of the significant loading 360 | # and communality > 0.5 361 | significant_features_mask = ( 362 | ((np.absolute(self.components_) 363 | >= self.significance_threshold()) 364 | .any(axis=0)) 365 | & (self.communalities_ >= 0.5) 366 | ) 367 | 368 | if self.feature_selection == 'all': 369 | mask = significant_features_mask 370 | 371 | elif self.feature_selection == 'surrogate': 372 | # Select significant feature with maximum loading 373 | # on each principal component 374 | mask = np.full(self.n_features_, False, dtype=bool) 375 | surrogate_features_idx = np.unique( 376 | np.absolute(np.argmax(self.components_, axis=1)) 377 | ) 378 | mask[surrogate_features_idx] = True 379 | mask = (mask & significant_features_mask) 380 | 381 | elif self.feature_selection == 'summated scales': 382 | raise Exception('Not implemented yet.') 383 | else: 384 | raise ValueError('Not valid selection method.') 385 | return mask 386 | 387 | def significance_threshold(self): 388 | sample_sizes = np.array([50, 60, 70, 85, 100, 389 | 120, 150, 200, 250, 300]) 390 | thresholds = np.array([0.75, 0.70, 0.65, 0.60, 0.55, 391 | 0.50, 0.45, 0.40, 0.35, 0.30]) 392 | return min(thresholds[sample_sizes <= self.n_samples_]) 393 | 394 | def transform(self, X): 395 | """Apply dimensionality reduction to X. 396 | X is projected on the first principal components previously extracted 397 | from a training set. 398 | Parameters 399 | ---------- 400 | X : array-like, shape (n_samples, n_features) 401 | New data, where n_samples is the number of samples 402 | and n_features is the number of features. 403 | 404 | Returns 405 | ------- 406 | X_new : array-like, shape (n_samples, n_components) 407 | Examples 408 | -------- 409 | >>> import numpy as np 410 | >>> from sklearn.decomposition import IncrementalPCA 411 | >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 412 | >>> ipca = IncrementalPCA(n_components=2, batch_size=3) 413 | >>> ipca.fit(X) 414 | IncrementalPCA(batch_size=3, n_components=2) 415 | >>> ipca.transform(X) # doctest: +SKIP 416 | """ 417 | attrs = [v for v in vars(self) 418 | if (v.endswith("_") or v.startswith("_")) 419 | and not v.startswith("__")] 420 | check_is_fitted(self, attributes=attrs, 421 | all_or_any=all) 422 | 423 | X = check_array(X) 424 | if self.mean_ is not None: 425 | X = X - self.mean_ 426 | 427 | if self.feature_selection == 'all': 428 | X_transformed = np.dot(X, self.components_.T) 429 | if self.whiten: 430 | X_transformed /= np.sqrt(self.explained_variance_) 431 | 432 | else: 433 | X_transformed = X[:, self._get_support_mask()] 434 | 435 | return X_transformed 436 | 437 | def fit_transform(self, X, y=None): 438 | """Fit the model with X and apply the dimensionality reduction on X. 439 | Parameters 440 | ---------- 441 | X : array-like, shape (n_samples, n_features) 442 | Training data, where n_samples is the number of samples 443 | and n_features is the number of features. 444 | y : None 445 | Ignored variable. 446 | Returns 447 | ------- 448 | X_new : array-like, shape (n_samples, n_components) 449 | Transformed values. 450 | Notes 451 | ----- 452 | This method returns a Fortran-ordered array. To convert it to a 453 | C-ordered array, use 'np.ascontiguousarray'. 454 | """ 455 | 456 | U, S, V = self._fit(X) 457 | X_transformed = self.transform(X) 458 | 459 | return X_transformed 460 | -------------------------------------------------------------------------------- /google_analytics_schema.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "visitorId": null, 4 | "visitNumber": "1", 5 | "visitId": "1501583974", 6 | "visitStartTime": "1501583974", 7 | "date": "20170801", 8 | "totals": { 9 | "visits": "1", 10 | "hits": "1", 11 | "pageviews": "1", 12 | "timeOnSite": null, 13 | "bounces": "1", 14 | "transactions": null, 15 | "transactionRevenue": null, 16 | "newVisits": "1", 17 | "screenviews": null, 18 | "uniqueScreenviews": null, 19 | "timeOnScreen": null, 20 | "totalTransactionRevenue": null, 21 | "sessionQualityDim": "1" 22 | }, 23 | "trafficSource": { 24 | "referralPath": null, 25 | "campaign": "(not set)", 26 | "source": "(direct)", 27 | "medium": "(none)", 28 | "keyword": null, 29 | "adContent": null, 30 | "adwordsClickInfo": { 31 | "campaignId": null, 32 | "adGroupId": null, 33 | "creativeId": null, 34 | "criteriaId": null, 35 | "page": null, 36 | "slot": null, 37 | "criteriaParameters": "not available in demo dataset", 38 | "gclId": null, 39 | "customerId": null, 40 | "adNetworkType": null, 41 | "targetingCriteria": null, 42 | "isVideoAd": null 43 | }, 44 | "isTrueDirect": null, 45 | "campaignCode": null 46 | }, 47 | "device": { 48 | "browser": "Chrome", 49 | "browserVersion": "not available in demo dataset", 50 | "browserSize": "not available in demo dataset", 51 | "operatingSystem": "Android", 52 | "operatingSystemVersion": "not available in demo dataset", 53 | "isMobile": "true", 54 | "mobileDeviceBranding": "not available in demo dataset", 55 | "mobileDeviceModel": "not available in demo dataset", 56 | "mobileInputSelector": "not available in demo dataset", 57 | "mobileDeviceInfo": "not available in demo dataset", 58 | "mobileDeviceMarketingName": "not available in demo dataset", 59 | "flashVersion": "not available in demo dataset", 60 | "javaEnabled": null, 61 | "language": "not available in demo dataset", 62 | "screenColors": "not available in demo dataset", 63 | "screenResolution": "not available in demo dataset", 64 | "deviceCategory": "mobile" 65 | }, 66 | "geoNetwork": { 67 | "continent": "Americas", 68 | "subContinent": "Caribbean", 69 | "country": "St. Lucia", 70 | "region": "(not set)", 71 | "metro": "(not set)", 72 | "city": "(not set)", 73 | "cityId": "not available in demo dataset", 74 | "networkDomain": "unknown.unknown", 75 | "latitude": "not available in demo dataset", 76 | "longitude": "not available in demo dataset", 77 | "networkLocation": "not available in demo dataset" 78 | }, 79 | "hits": [ 80 | { 81 | "hitNumber": "1", 82 | "time": "0", 83 | "hour": "3", 84 | "minute": "39", 85 | "isSecure": null, 86 | "isInteraction": "true", 87 | "isEntrance": "true", 88 | "isExit": "true", 89 | "referer": "http://www.google.com/", 90 | "page": { 91 | "pagePath": "/google+redesign/electronics", 92 | "hostname": "shop.googlemerchandisestore.com", 93 | "pageTitle": "Electronics | Google Merchandise Store", 94 | "searchKeyword": null, 95 | "searchCategory": null, 96 | "pagePathLevel1": "/google+redesign/", 97 | "pagePathLevel2": "/electronics", 98 | "pagePathLevel3": "", 99 | "pagePathLevel4": "" 100 | }, 101 | "transaction": { 102 | "transactionId": null, 103 | "transactionRevenue": null, 104 | "transactionTax": null, 105 | "transactionShipping": null, 106 | "affiliation": null, 107 | "currencyCode": "USD", 108 | "localTransactionRevenue": null, 109 | "localTransactionTax": null, 110 | "localTransactionShipping": null, 111 | "transactionCoupon": null 112 | }, 113 | "item": { 114 | "transactionId": null, 115 | "productName": null, 116 | "productCategory": null, 117 | "productSku": null, 118 | "itemQuantity": null, 119 | "itemRevenue": null, 120 | "currencyCode": "USD", 121 | "localItemRevenue": null 122 | }, 123 | "contentInfo": null, 124 | "appInfo": { 125 | "name": null, 126 | "version": null, 127 | "id": null, 128 | "installerId": null, 129 | "appInstallerId": null, 130 | "appName": null, 131 | "appVersion": null, 132 | "appId": null, 133 | "screenName": "shop.googlemerchandisestore.com/google+redesign/electronics", 134 | "landingScreenName": "shop.googlemerchandisestore.com/google+redesign/electronics", 135 | "exitScreenName": "shop.googlemerchandisestore.com/google+redesign/electronics", 136 | "screenDepth": "0" 137 | }, 138 | "exceptionInfo": { 139 | "description": null, 140 | "isFatal": "true", 141 | "exceptions": null, 142 | "fatalExceptions": null 143 | }, 144 | "eventInfo": null, 145 | "product": [ 146 | { 147 | "productSKU": "GGOEGBFC018799", 148 | "v2ProductName": "Electronics Accessory Pouch", 149 | "v2ProductCategory": "Home/Electronics/", 150 | "productVariant": "(not set)", 151 | "productBrand": "(not set)", 152 | "productRevenue": null, 153 | "localProductRevenue": null, 154 | "productPrice": "4990000", 155 | "localProductPrice": "4990000", 156 | "productQuantity": null, 157 | "productRefundAmount": null, 158 | "localProductRefundAmount": null, 159 | "isImpression": "true", 160 | "isClick": null, 161 | "customDimensions": [], 162 | "customMetrics": [], 163 | "productListName": "Category", 164 | "productListPosition": "1", 165 | "productCouponCode": null 166 | }, 167 | { 168 | "productSKU": "GGOEGESB015199", 169 | "v2ProductName": "Google Flashlight", 170 | "v2ProductCategory": "Home/Electronics/", 171 | "productVariant": "(not set)", 172 | "productBrand": "(not set)", 173 | "productRevenue": null, 174 | "localProductRevenue": null, 175 | "productPrice": "59990000", 176 | "localProductPrice": "59990000", 177 | "productQuantity": null, 178 | "productRefundAmount": null, 179 | "localProductRefundAmount": null, 180 | "isImpression": "true", 181 | "isClick": null, 182 | "customDimensions": [], 183 | "customMetrics": [], 184 | "productListName": "Category", 185 | "productListPosition": "2", 186 | "productCouponCode": null 187 | }, 188 | { 189 | "productSKU": "GGOEGEVA022399", 190 | "v2ProductName": "Micro Wireless Earbud", 191 | "v2ProductCategory": "Home/Electronics/", 192 | "productVariant": "(not set)", 193 | "productBrand": "(not set)", 194 | "productRevenue": null, 195 | "localProductRevenue": null, 196 | "productPrice": "39990000", 197 | "localProductPrice": "39990000", 198 | "productQuantity": null, 199 | "productRefundAmount": null, 200 | "localProductRefundAmount": null, 201 | "isImpression": "true", 202 | "isClick": null, 203 | "customDimensions": [], 204 | "customMetrics": [], 205 | "productListName": "Category", 206 | "productListPosition": "3", 207 | "productCouponCode": null 208 | }, 209 | { 210 | "productSKU": "GGOEGCBB074199", 211 | "v2ProductName": "Google Car Clip Phone Holder", 212 | "v2ProductCategory": "Home/Electronics/", 213 | "productVariant": "(not set)", 214 | "productBrand": "(not set)", 215 | "productRevenue": null, 216 | "localProductRevenue": null, 217 | "productPrice": "6990000", 218 | "localProductPrice": "6990000", 219 | "productQuantity": null, 220 | "productRefundAmount": null, 221 | "localProductRefundAmount": null, 222 | "isImpression": "true", 223 | "isClick": null, 224 | "customDimensions": [], 225 | "customMetrics": [], 226 | "productListName": "Category", 227 | "productListPosition": "4", 228 | "productCouponCode": null 229 | }, 230 | { 231 | "productSKU": "GGOEGFKA022299", 232 | "v2ProductName": "Keyboard DOT Sticker", 233 | "v2ProductCategory": "Home/Electronics/", 234 | "productVariant": "(not set)", 235 | "productBrand": "(not set)", 236 | "productRevenue": null, 237 | "localProductRevenue": null, 238 | "productPrice": "1500000", 239 | "localProductPrice": "1500000", 240 | "productQuantity": null, 241 | "productRefundAmount": null, 242 | "localProductRefundAmount": null, 243 | "isImpression": "true", 244 | "isClick": null, 245 | "customDimensions": [], 246 | "customMetrics": [], 247 | "productListName": "Category", 248 | "productListPosition": "5", 249 | "productCouponCode": null 250 | }, 251 | { 252 | "productSKU": "GGOEGCBB074399", 253 | "v2ProductName": "Google Device Holder Sticky Pad", 254 | "v2ProductCategory": "Home/Electronics/", 255 | "productVariant": "(not set)", 256 | "productBrand": "(not set)", 257 | "productRevenue": null, 258 | "localProductRevenue": null, 259 | "productPrice": "4990000", 260 | "localProductPrice": "4990000", 261 | "productQuantity": null, 262 | "productRefundAmount": null, 263 | "localProductRefundAmount": null, 264 | "isImpression": "true", 265 | "isClick": null, 266 | "customDimensions": [], 267 | "customMetrics": [], 268 | "productListName": "Category", 269 | "productListPosition": "6", 270 | "productCouponCode": null 271 | }, 272 | { 273 | "productSKU": "GGOEGCBC074299", 274 | "v2ProductName": "Google Device Stand", 275 | "v2ProductCategory": "Home/Electronics/", 276 | "productVariant": "(not set)", 277 | "productBrand": "(not set)", 278 | "productRevenue": null, 279 | "localProductRevenue": null, 280 | "productPrice": "4990000", 281 | "localProductPrice": "4990000", 282 | "productQuantity": null, 283 | "productRefundAmount": null, 284 | "localProductRefundAmount": null, 285 | "isImpression": "true", 286 | "isClick": null, 287 | "customDimensions": [], 288 | "customMetrics": [], 289 | "productListName": "Category", 290 | "productListPosition": "7", 291 | "productCouponCode": null 292 | }, 293 | { 294 | "productSKU": "GGOEGEHQ072499", 295 | "v2ProductName": "Google 2200mAh Micro Charger", 296 | "v2ProductCategory": "Home/Electronics/", 297 | "productVariant": "(not set)", 298 | "productBrand": "(not set)", 299 | "productRevenue": null, 300 | "localProductRevenue": null, 301 | "productPrice": "22990000", 302 | "localProductPrice": "22990000", 303 | "productQuantity": null, 304 | "productRefundAmount": null, 305 | "localProductRefundAmount": null, 306 | "isImpression": "true", 307 | "isClick": null, 308 | "customDimensions": [], 309 | "customMetrics": [], 310 | "productListName": "Category", 311 | "productListPosition": "8", 312 | "productCouponCode": null 313 | }, 314 | { 315 | "productSKU": "GGOEGEHQ072599", 316 | "v2ProductName": "Google 4400mAh Power Bank", 317 | "v2ProductCategory": "Home/Electronics/", 318 | "productVariant": "(not set)", 319 | "productBrand": "(not set)", 320 | "productRevenue": null, 321 | "localProductRevenue": null, 322 | "productPrice": "37990000", 323 | "localProductPrice": "37990000", 324 | "productQuantity": null, 325 | "productRefundAmount": null, 326 | "localProductRefundAmount": null, 327 | "isImpression": "true", 328 | "isClick": null, 329 | "customDimensions": [], 330 | "customMetrics": [], 331 | "productListName": "Category", 332 | "productListPosition": "9", 333 | "productCouponCode": null 334 | }, 335 | { 336 | "productSKU": "GGOEGESB015099", 337 | "v2ProductName": "Basecamp Explorer Powerbank Flashlight", 338 | "v2ProductCategory": "Home/Electronics/", 339 | "productVariant": "(not set)", 340 | "productBrand": "(not set)", 341 | "productRevenue": null, 342 | "localProductRevenue": null, 343 | "productPrice": "22990000", 344 | "localProductPrice": "22990000", 345 | "productQuantity": null, 346 | "productRefundAmount": null, 347 | "localProductRefundAmount": null, 348 | "isImpression": "true", 349 | "isClick": null, 350 | "customDimensions": [], 351 | "customMetrics": [], 352 | "productListName": "Category", 353 | "productListPosition": "10", 354 | "productCouponCode": null 355 | }, 356 | { 357 | "productSKU": "GGOEGESC014099", 358 | "v2ProductName": "Rocket Flashlight", 359 | "v2ProductCategory": "Home/Electronics/", 360 | "productVariant": "(not set)", 361 | "productBrand": "(not set)", 362 | "productRevenue": null, 363 | "localProductRevenue": null, 364 | "productPrice": "4990000", 365 | "localProductPrice": "4990000", 366 | "productQuantity": null, 367 | "productRefundAmount": null, 368 | "localProductRefundAmount": null, 369 | "isImpression": "true", 370 | "isClick": null, 371 | "customDimensions": [], 372 | "customMetrics": [], 373 | "productListName": "Category", 374 | "productListPosition": "11", 375 | "productCouponCode": null 376 | }, 377 | { 378 | "productSKU": "GGOEGESQ016799", 379 | "v2ProductName": "Plastic Sliding Flashlight", 380 | "v2ProductCategory": "Home/Electronics/", 381 | "productVariant": "(not set)", 382 | "productBrand": "(not set)", 383 | "productRevenue": null, 384 | "localProductRevenue": null, 385 | "productPrice": "12990000", 386 | "localProductPrice": "12990000", 387 | "productQuantity": null, 388 | "productRefundAmount": null, 389 | "localProductRefundAmount": null, 390 | "isImpression": "true", 391 | "isClick": null, 392 | "customDimensions": [], 393 | "customMetrics": [], 394 | "productListName": "Category", 395 | "productListPosition": "12", 396 | "productCouponCode": null 397 | } 398 | ], 399 | "promotion": [], 400 | "promotionActionInfo": null, 401 | "refund": null, 402 | "eCommerceAction": { 403 | "action_type": "0", 404 | "step": "1", 405 | "option": null 406 | }, 407 | "experiment": [], 408 | "publisher": null, 409 | "customVariables": [], 410 | "customDimensions": [], 411 | "customMetrics": [], 412 | "type": "PAGE", 413 | "social": { 414 | "socialInteractionNetwork": null, 415 | "socialInteractionAction": null, 416 | "socialInteractions": null, 417 | "socialInteractionTarget": null, 418 | "socialNetwork": "(not set)", 419 | "uniqueSocialInteractions": null, 420 | "hasSocialSourceReferral": "No", 421 | "socialInteractionNetworkAction": " : " 422 | }, 423 | "latencyTracking": null, 424 | "sourcePropertyInfo": null, 425 | "contentGroup": { 426 | "contentGroup1": "(not set)", 427 | "contentGroup2": "Electronics", 428 | "contentGroup3": "(not set)", 429 | "contentGroup4": "(not set)", 430 | "contentGroup5": "(not set)", 431 | "previousContentGroup1": "(entrance)", 432 | "previousContentGroup2": "(entrance)", 433 | "previousContentGroup3": "(entrance)", 434 | "previousContentGroup4": "(entrance)", 435 | "previousContentGroup5": "(entrance)", 436 | "contentGroupUniqueViews1": null, 437 | "contentGroupUniqueViews2": "1", 438 | "contentGroupUniqueViews3": null, 439 | "contentGroupUniqueViews4": null, 440 | "contentGroupUniqueViews5": null 441 | }, 442 | "dataSource": "web", 443 | "publisher_infos": [] 444 | } 445 | ], 446 | "fullVisitorId": "2248281639583218707", 447 | "userId": null, 448 | "clientId": null, 449 | "channelGrouping": "Organic Search", 450 | "socialEngagementType": "Not Socially Engaged" 451 | } 452 | ] -------------------------------------------------------------------------------- /google_analytics_schema.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/google_analytics_schema.xlsx -------------------------------------------------------------------------------- /helper.py: -------------------------------------------------------------------------------- 1 | """Helper scintific module 2 | Module serves for custom methods to support Customer Journey Analytics Project 3 | """ 4 | 5 | # IMPORTS 6 | # ------- 7 | 8 | # Standard libraries 9 | import re 10 | import ipdb 11 | import string 12 | import math 13 | 14 | # 3rd party libraries 15 | from google.cloud import bigquery 16 | 17 | import numpy as np 18 | import pandas as pd 19 | 20 | import nltk 21 | nltk.download(['wordnet', 'stopwords']) 22 | STOPWORDS = nltk.corpus.stopwords.words('english') 23 | 24 | from scipy import stats 25 | import statsmodels.api as sm 26 | from statsmodels.formula.api import ols 27 | import scikit_posthocs as sp 28 | 29 | from sklearn.preprocessing import StandardScaler 30 | from sklearn.model_selection import train_test_split 31 | from sklearn.pipeline import Pipeline 32 | from sklearn.feature_extraction.text import TfidfVectorizer 33 | from sklearn.feature_extraction.text import CountVectorizer 34 | from sklearn.model_selection import GridSearchCV 35 | from sklearn.naive_bayes import MultinomialNB 36 | from sklearn.naive_bayes import ComplementNB 37 | from sklearn.metrics import f1_score 38 | from sklearn.decomposition import PCA 39 | 40 | import rpy2 41 | import rpy2.rlike.container as rlc 42 | from rpy2 import robjects 43 | from rpy2.robjects.vectors import FloatVector 44 | from rpy2.robjects.vectors import ListVector 45 | from rpy2.robjects.vectors import StrVector 46 | from rpy2.robjects import pandas2ri 47 | 48 | from matplotlib import pyplot as plt 49 | import seaborn as sns 50 | 51 | 52 | # MODULE FUNCTIONS 53 | # ---------------- 54 | 55 | def get_dissimilarity(df, normalize=True): 56 | '''Calculates dissimilarity of observations from average 57 | observation. 58 | 59 | Args: 60 | df: Data as dataframe of shape (# observations, # variables) 61 | 62 | Returns: 63 | ser: Calculated dissimilrity as series of size (# observations) 64 | ''' 65 | 66 | # normalize data 67 | if normalize: 68 | df_scaled = StandardScaler().fit_transform(df) 69 | df = pd.DataFrame(df_scaled, columns=df.columns, index=df.index) 70 | else: 71 | raise Exception('Not implemented') 72 | 73 | # calculate multivariate dissimilarity 74 | diss = ((df - df.mean())**2).sum(axis=1)**(1/2) 75 | return diss 76 | 77 | 78 | def split_data(df, diss_var, dataset_names, threshold, dis_kws={}, **split_kws): 79 | '''Function randomly splits data into two sets, calates multivariate 80 | dissimilarity and keep all oultiers determined by dissimilarity 81 | treshold in each set. 82 | 83 | Args: 84 | df: Data as dataframe of shape (# samles, # features) 85 | diss_var: Names of variables to calculate dissimilarity measure 86 | as list of strings 87 | dataset_names: Names of datasets as list of strings 88 | threshold: Threshold for dissimilarity measure 89 | to determine outliers as float 90 | dis_kws: Key word arguments of dissimilarity function as dictionary 91 | split_kws: Key word arguents of train_test_split function 92 | 93 | Returns: 94 | datasets: Dictionary of splitted datasets as dataframe 95 | ''' 96 | 97 | # calculate dissimilarity series 98 | dis_kws['normalize'] = (True if 'normalize' not in dis_kws 99 | else dis_kws['normalize']) 100 | 101 | dissimilarity = get_dissimilarity(df[diss_var], dis_kws['normalize']) 102 | 103 | # Pop outlier customers 104 | ext_mask = (dissimilarity > threshold) 105 | X_ext = df.loc[ext_mask] 106 | X = df.loc[~ext_mask] 107 | 108 | # drop one random sample to keep even samples in dataset 109 | # for purpose of having same number of samples after splitting 110 | if X.shape[0] % 2 != 0: 111 | split_kws['random_state'] = (1 if 'random_state' not in split_kws 112 | else split_kws['random_state']) 113 | remove_n = 1 114 | drop_indices = (X.sample(remove_n, 115 | random_state=split_kws['random_state']) 116 | .index) 117 | X = X.drop(drop_indices) 118 | 119 | # Random split of sample in two groups 120 | Xa, Xb = train_test_split(X, **split_kws) 121 | datasets = [Xa, Xb] 122 | 123 | # add outliers to each group 124 | datasets = {dataset_name: dataset 125 | for dataset_name, dataset in zip(dataset_names, datasets)} 126 | 127 | for name, dataset in datasets.items(): 128 | datasets[name] = dataset.append(X_ext) 129 | 130 | return datasets 131 | 132 | 133 | def analyze_cluster_solution(df, vars_, labels, **kws): 134 | '''Analyzes cluster solution. Following analyses are done: 135 | 1) Hypothesis testing of clusters averages difference 136 | a) One way ANOVA 137 | b) ANOVA assumptions 138 | - residuals normality test: Shapiro-Wilk test 139 | - equal variances test: Leven's test 140 | c) Kruskal-Wallis non parametric test 141 | d) All-Pair non parametric test, Conover test by default 142 | 2) Cluster profile vizualization 143 | 3) Cluster scatterplot vizualization 144 | 145 | Args: 146 | df: Dataset as pandas dataframe 147 | of shape(# observations, # variables) 148 | vars_: Clustering variables as list of strings 149 | labels: Variable holding cluster labels as string 150 | kws: Key words arguments of post-hoc test 151 | 152 | Returns: 153 | summary: Dataframe of hypothesis tests 154 | post_hoc: List of post_hoc test for each clustering variable 155 | prof_ax: Axes of profile vizualization 156 | clst_pg: PairGrid of cluster vizulization 157 | ''' 158 | 159 | def color_not_significant_red(val, signf=0.05): 160 | '''Takes a scalar and returns a string withthe css property 161 | `'color: red'` for non significant p_value 162 | ''' 163 | color = 'red' if val > signf else 'black' 164 | return 'color: %s' % color 165 | 166 | # get number of seeds 167 | num_seeds = len(df.groupby(labels).groups) 168 | 169 | # run tests 170 | kws['post_hoc_fnc'] = (sp.posthoc_conover if 'post_hoc_fnc' not in kws 171 | else kws['post_hoc_fnc']) 172 | 173 | summary, post_hoc = profile_cluster_labels( 174 | df, labels, vars_, **kws) 175 | 176 | # print hypothesis tests 177 | str_ = 'PROFILE SUMMARY FOR {}'.format(labels.upper()) 178 | print(str_ + '\n' + '-' * len(str_) + '\n') 179 | 180 | str_ = 'Hypothesis testing of clusters averages difference' 181 | print(str_ + '\n' + '-' * len(str_)) 182 | 183 | display(summary.round(2)) 184 | 185 | # print post-hoc tests 186 | str_ = '\nPost-hoc test: {}'.format(kws['post_hoc_fnc'].__name__) 187 | print(str_ + '\n' + '-' * len(str_) + '\n') 188 | 189 | for var in post_hoc: 190 | print('\nclustering variable:', var) 191 | display(post_hoc[var].round(2) 192 | .style.applymap(color_not_significant_red)) 193 | 194 | # print profiles 195 | str_ = '\nProfile vizualization' 196 | print(str_ + '\n' + '-' * len(str_)) 197 | 198 | prof_ax = (df 199 | .groupby(labels) 200 | [vars_] 201 | .mean() 202 | .transpose() 203 | .plot(title='Cluster Profile') 204 | ) 205 | plt.ylabel('Standardized scale') 206 | plt.xlabel('Clustering variables') 207 | plt.show() 208 | 209 | # print scatterplots 210 | str_ = '\nClusters vizualization' 211 | print(str_ + '\n' + '-' * len(str_)) 212 | clst_pg = sns.pairplot(x_vars=['recency', 'monetary'], 213 | y_vars=['frequency', 'monetary'], 214 | hue=labels, data=df, height=3.5) 215 | clst_pg.set(yscale='log') 216 | clst_pg.axes[0, 1].set_xscale('log') 217 | clst_pg.fig.suptitle('Candidate Solution: {} seeds' 218 | .format(num_seeds), y=1.01) 219 | plt.show() 220 | 221 | return summary, post_hoc, prof_ax, clst_pg 222 | 223 | 224 | def profile_cluster_labels(df, group, outputs, post_hoc_fnc=sp.posthoc_conover): 225 | '''Test distinctiveness of cluster (group) labes across clustering (output) 226 | variables using one way ANOVA, shapiro_wilk normality test, 227 | leven's test of equal variances, Kruskla-Wallis non parametric tests and 228 | selected all-pairs post hoc test for each output variables. 229 | 230 | Args: 231 | df: Data with clustering variables and candidate solutions 232 | as dataframe of shape (# samples, # of variables + 233 | candidate solutions) 234 | 235 | group: group variables for hypothesis testing as string 236 | output: output variables for hypothesis testing as list of string 237 | Returns: 238 | results: Dataframe of hypothesis tests for each output 239 | ''' 240 | 241 | # initiate summmary dataframe 242 | summary = (df.groupby(group)[outputs] 243 | .agg(['mean', 'median']) 244 | .T.unstack(level=-1) 245 | .swaplevel(axis=1) 246 | .sort_index(level=0, axis=1)) 247 | # initiate posthoc dictionary 248 | post_hoc = {} 249 | 250 | # cycle over ouptputs 251 | for i, output in enumerate(outputs): 252 | 253 | # split group levels 254 | levels = [df[output][df[group] == level] 255 | for level in df[group].unique()] 256 | 257 | # calculate F statistics and p-value 258 | _, summary.loc[output, 'anova_p'] = stats.f_oneway(*levels) 259 | 260 | # calculate leven's test for equal variances 261 | _, summary.loc[output, 'levene_p'] = stats.levene(*levels) 262 | 263 | # check if residuals are normally distributed by shapiro wilk test 264 | model = ols('{} ~ C({})'.format(output, group), data=df).fit() 265 | _, summary.loc[output, 'shapiro_wilk_p'] = stats.shapiro(model.resid) 266 | 267 | # calculate H statistics and p-value for Kruskal Wallis test 268 | _, summary.loc[output, 'kruskal_wallis_p'] = stats.kruskal(*levels) 269 | 270 | # multiple comparison Conover's test 271 | post_hoc[output] = post_hoc_fnc( 272 | df, val_col=output, group_col=group) #, p_adjust ='holm') 273 | 274 | return summary, post_hoc 275 | 276 | def get_missmatch(**kws): 277 | ''' 278 | Cross tabulates dataframe on 2 selected columns and 279 | calculates missmatch proportion of rows and total 280 | 281 | Args: 282 | kws: Key word arguments to pd.crosstab function 283 | 284 | Returns: 285 | crosst_tab: result of cross tabulation as dataframe 286 | missmatch_rows: missmatch proportion by rows as series 287 | total_missmatch: total missmatch proportion as float 288 | 289 | ''' 290 | 291 | cross_tab = pd.crosstab(**kws) 292 | missmatch_rows = (cross_tab.sum(axis=1) - cross_tab.max(axis=1)) 293 | total_missmatch = missmatch_rows.sum() / cross_tab.sum().sum() 294 | missmatch_rows = missmatch_rows / cross_tab.sum(axis=1) 295 | missmatch_rows.name = 'missmatch_proportion' 296 | 297 | return cross_tab, missmatch_rows, total_missmatch 298 | 299 | def query_product_info(client, query_params): 300 | '''Query product information from bigquery database. 301 | Distinct records of product_sku, product_name, 302 | product_brand, product_brand_grp, 303 | product_category, product_category_grp, 304 | Args: 305 | client: Instatiated bigquery.Client to query distinct product 306 | description(product_sku, product_name, product_category, 307 | product_category_grp) 308 | query_params: Query parameters for client 309 | Returns: 310 | product_df: product information as distict records 311 | as pandas dataframe (# records, # variables) 312 | ''' 313 | 314 | # Check arguments 315 | # ---------------- 316 | assert isinstance(client, bigquery.Client) 317 | assert isinstance(query_params, list) 318 | 319 | # Query distinct products descriptions 320 | # ------------------------------------ 321 | query=''' 322 | SELECT DISTINCT 323 | hits_product.productSku AS product_sku, 324 | hits_product.v2productName AS product_name, 325 | hits_product.productBrand AS product_brand, 326 | hits.contentGroup.contentGroup1 AS product_brand_grp, 327 | hits_product.v2productCategory AS product_category, 328 | hits.contentGroup.contentGroup2 AS product_category_grp 329 | FROM 330 | `bigquery-public-data.google_analytics_sample.ga_sessions_*` 331 | LEFT JOIN UNNEST(hits) AS hits 332 | LEFT JOIN UNNEST(hits.product) AS hits_product 333 | WHERE 334 | _TABLE_SUFFIX BETWEEN @start_date AND @end_date 335 | AND hits_product.productSku IS NOT NULL 336 | ORDER BY 337 | product_sku 338 | ''' 339 | 340 | job_config = bigquery.QueryJobConfig() 341 | job_config.query_parameters = query_params 342 | df = client.query(query, job_config=job_config).to_dataframe() 343 | 344 | return df 345 | 346 | def reconstruct_brand(product_sku, df): 347 | '''Reconstructs brand from product name and brand variables 348 | Args: 349 | product_sku: product_sku as of transaction records on product level 350 | of size # transactions on produc level 351 | df: Product information as output of 352 | helper.query_product_info in form of dataframe 353 | of shape (# of distinct records, # of variables) 354 | 355 | Returns: 356 | recon_brand: reconstructed brand column as pandas series 357 | of size # of transactions 358 | ''' 359 | 360 | # predict brand name from product name for each sku 361 | # ------------------------------------------------- 362 | 363 | # valid brands 364 | brands = ['Android', 365 | 'Chrome', 366 | r'\bGo\b', 367 | 'Google', 368 | 'Google Now', 369 | 'YouTube', 370 | 'Waze'] 371 | 372 | # concatenate different product names for each sku 373 | brand_df = (df[['product_sku', 'product_name']] 374 | .drop_duplicates() 375 | .groupby('product_sku') 376 | ['product_name'] 377 | .apply(lambda product_name: ' '.join(product_name)) 378 | .reset_index() 379 | ) 380 | 381 | # drop (no set) sku's 382 | brand_df = brand_df.drop( 383 | index=brand_df.index[brand_df['product_sku'] == '(not set)']) 384 | 385 | 386 | # predict brand name from product name for each sku 387 | brand_df['recon_brand'] = ( 388 | brand_df['product_name'] 389 | .str.extract(r'({})'.format('|'.join(set(brands)), 390 | flags=re.IGNORECASE)) 391 | ) 392 | 393 | # adjust brand taking account spelling errors in product names 394 | brand_df.loc[ 395 | brand_df['product_name'].str.contains('You Tube', case=False), 396 | 'recon_brand' 397 | ] = 'YouTube' 398 | 399 | 400 | # predict brand name from brand variables for sku's where 401 | # brand couldn't be predected from product name 402 | # -------------------------------------------------------- 403 | 404 | # get distinct product_sku and brand variables associations 405 | brand_vars = ['product_brand', 'product_brand_grp'] 406 | brand_var = dict() 407 | for brand in brand_vars: 408 | brand_var[brand] = (df[['product_sku', brand]] 409 | .drop(index=df.index[(df['product_sku'] == '(not set)') 410 | | df['product_sku'].isna() 411 | | (df[brand] == '(not set)') 412 | | df[brand].isna()]) 413 | .drop_duplicates() 414 | .drop_duplicates(subset='product_sku', keep=False)) 415 | 416 | # check for brand abiguity at sku level 417 | old_brand = brand_var['product_brand'].set_index('product_sku') 418 | new_brand = brand_var['product_brand_grp'].set_index('product_sku') 419 | shared_sku = old_brand.index.intersection(new_brand.index) 420 | 421 | if not shared_sku.empty: 422 | 423 | # delete sku's with abigious brands 424 | ambigious_sku = shared_sku[ 425 | old_brand[shared_sku].squeeze().values 426 | != new_brand[shared_sku].squeeze().values 427 | ] 428 | 429 | old_brand = old_brand.drop(index=ambigious_sku, errors='ignore') 430 | new_brand = new_brand.drop(index=ambigious_sku, errors='ignore') 431 | 432 | # delete sku's with multiple brands in new_brand 433 | multiple_sku = shared_sku[ 434 | old_brand[shared_sku].squeeze().values 435 | == new_brand[shared_sku].squeeze().values 436 | ] 437 | 438 | new_brand = new_brand.drop(index=multiple_sku, errors='ignore') 439 | 440 | # concatenate all associations of brand variables and product sku's 441 | brand_var = pd.concat([old_brand.rename(columns={'product_brand': 442 | 'recon_brand_var'}), 443 | new_brand.rename(columns={'product_brand_grp': 444 | 'recon_brand_var'})]) 445 | 446 | # predict brand name from brand variables 447 | brand_df.loc[brand_df['recon_brand'].isna(), 'recon_brand'] = ( 448 | pd.merge(brand_df['product_sku'], brand_var, on='product_sku', how='left') 449 | ['recon_brand_var'] 450 | ) 451 | 452 | # recode remaining missing (not set) brands by Google brand 453 | # --------------------------------------------------------- 454 | brand_df['recon_brand'] = brand_df['recon_brand'].fillna('Google') 455 | 456 | # predict brand from brand names and variables on transaction data 457 | # ---------------------------------------------------------------- 458 | recon_brand = (pd.merge(product_sku.to_frame(), 459 | brand_df[['product_sku', 'recon_brand']], 460 | on='product_sku', 461 | how='left') 462 | .reindex(product_sku.index) 463 | ['recon_brand']) 464 | 465 | return recon_brand 466 | 467 | def reconstruct_category(product_sku, df, category_spec): 468 | '''Reconstructs category from category variables and product names. 469 | 470 | Args: 471 | product_sku: product_sku from transaction records on product level 472 | of size # transactions on product level 473 | df: Product information as output of 474 | helper.query_product_info in form of dataframe 475 | of shape (# of distinct records, # of variables) 476 | 477 | category_spec: Dictionary with keys as category variable names 478 | and values as mappings between category variable levels 479 | to category labels in form of dataframe 480 | 481 | Returns: 482 | recon_category: reconstructed category column as pandas series 483 | of size # of trasactions on product level 484 | category_df: mappings of unique sku to category labels 485 | ''' 486 | 487 | # Check arguments 488 | # ---------------- 489 | assert isinstance(product_sku, pd.Series) 490 | assert isinstance(df, pd.DataFrame) 491 | assert isinstance(category_spec, dict) 492 | 493 | # reconstruct category name from product name for each sku 494 | # -------------------------------------------------------- 495 | 496 | def get_category_representation(category_label, valid_categories): 497 | '''Handle multiple categories assigned to one sku. 498 | For ambigious categories returns missing value. 499 | 500 | Args: 501 | category_label: Series of category labels for 502 | particular sku 503 | valid_categories: Index of valid unique categories 504 | Returns: 505 | label: valid category label or missing value 506 | ''' 507 | label = valid_categories[valid_categories.isin(category_label)] 508 | if label.empty or label.size > 1: 509 | return np.nan 510 | else: 511 | return label[0] 512 | 513 | 514 | def label_category_variable(df, category_var, label_spec): 515 | '''reconstruct category labels from category variable. 516 | 517 | Args: 518 | df: Product information dataframe. 519 | category_var: Name of category variabel to reconstruct labels 520 | label_spec: Label mapping between category variable levels 521 | and labels. 522 | 523 | Returns: 524 | var_label: Label mapping to sku as dataframe 525 | 526 | ''' 527 | 528 | valid_categories = pd.Index(label_spec 529 | .groupby(['category_label']) 530 | .groups 531 | .keys()) 532 | 533 | var_label = (pd.merge(df[['product_name', category_var]] 534 | .drop_duplicates(), 535 | label_spec, 536 | how='left', 537 | on=category_var) 538 | [['product_name', 'category_label']] 539 | .groupby('product_name') 540 | ['category_label'] 541 | .apply(get_category_representation, 542 | valid_categories=valid_categories) 543 | .reset_index()) 544 | 545 | return var_label 546 | 547 | def screen_fit_model(data): 548 | '''Screens Naive Bayes Classifiers and selects best model 549 | based on f1 weigted score. Returns fitted model and score. 550 | 551 | Args: 552 | data: Text and respective class labels as dataframe 553 | of shape (# samples, [text, labels]) 554 | 555 | Returns: 556 | model: Best fitted sklearn model 557 | f1_weighted_score: Test f1 weighted score 558 | 559 | Note: Following hyperparameters are tested 560 | Algorithm: MultinomialNB, ComplementNB 561 | ngrams range: (1, 1), (1, 2), (1, 3) 562 | binarization: False, True 563 | ''' 564 | 565 | # vectorize text inforomation in product_name 566 | def preprocessor(text): 567 | # not relevant words 568 | not_relevant_words = ['google', 569 | 'youtube', 570 | 'waze', 571 | 'android'] 572 | 573 | # transform text to lower case and remove punctuation 574 | text = ''.join([word.lower() for word in text 575 | if word not in string.punctuation]) 576 | 577 | # tokenize words 578 | tokens = re.split('\W+', text) 579 | 580 | # Drop not relevant words and lemmatize words 581 | wn = nltk.WordNetLemmatizer() 582 | text = ' '.join([wn.lemmatize(word) for word in tokens 583 | if word not in not_relevant_words + STOPWORDS]) 584 | 585 | return text 586 | 587 | # define pipeline 588 | pipe = Pipeline([('vectorizer', CountVectorizer()), 589 | ('classifier', None)]) 590 | 591 | # define hyperparameters 592 | param_grid = dict(vectorizer__ngram_range=[(1, 1), (1, 2), (1, 3)], 593 | vectorizer__binary=[False, True], 594 | classifier=[MultinomialNB(), 595 | ComplementNB()]) 596 | 597 | # screen naive buyes models 598 | grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5, 599 | scoring='f1_weighted', n_jobs=-1) 600 | 601 | # devide dataset to train and test set using stratification 602 | # due to high imbalance of lables frequencies 603 | x_train, x_test, y_train, y_test = train_test_split( 604 | data['product_name'], 605 | data['recon_category'], 606 | test_size=0.25, 607 | stratify=data['recon_category'], 608 | random_state=1) 609 | 610 | # execute screening and select best model 611 | grid_search.fit(x_train, y_train) 612 | 613 | # calculate f1 weighted test score 614 | y_pred = grid_search.predict(x_test) 615 | f1_weigted_score = f1_score(y_test, y_pred, average='weighted') 616 | 617 | return grid_search.best_estimator_, f1_weigted_score 618 | 619 | # reconstruct category label from cateogry variables 620 | recon_labels = dict() 621 | for var, label_spec in category_spec.items(): 622 | recon_labels[var] = (label_category_variable(df, var, label_spec) 623 | .set_index('product_name')) 624 | 625 | recon_labels['product_category'][ 626 | recon_labels['product_category'].isna() 627 | ] = recon_labels['product_category_grp'][ 628 | recon_labels['product_category'].isna() 629 | ] 630 | 631 | # reconstruct category label from produc names 632 | valid_categories = pd.Index(category_spec['product_category_grp'] 633 | .groupby(['category_label']) 634 | .groups 635 | .keys()) 636 | 637 | category_df = (pd.merge(df[['product_sku', 'product_name']] 638 | .drop_duplicates(), 639 | recon_labels['product_category'], 640 | how='left', 641 | on = 'product_name') 642 | [['product_sku', 'product_name', 'category_label']] 643 | .groupby('product_sku') 644 | .agg({'product_name': lambda name: name.str.cat(sep=' '), 645 | 'category_label': lambda label: 646 | get_category_representation(label, valid_categories)}) 647 | .reset_index()) 648 | 649 | category_df.rename(columns={'category_label': 'recon_category'}, 650 | inplace=True) 651 | 652 | # associate category from category names and variables on transaction data 653 | # ------------------------------------------------------------------------ 654 | recon_category = (pd.merge(product_sku.to_frame(), 655 | category_df[['product_sku', 'recon_category']], 656 | on='product_sku', 657 | how='left') 658 | ) 659 | 660 | # predict category of transactions where category is unknown 661 | # Multinomial and Complement Naive Bayes model is screened 662 | # and finetuned using 1-grams, 2-grams and 3-grams 663 | # as well as binarization (Tru or False) 664 | # best model is selected based on maximizing test f1 weigted score 665 | # ---------------------------------------------------------------- 666 | 667 | # screen best model and fit it on training data 668 | model, f1_weighted_score = screen_fit_model( 669 | category_df[['product_name', 'recon_category']] 670 | .dropna() 671 | ) 672 | 673 | # predict category labels if model has f1_weighted_score > threshold 674 | f1_weighted_score_threshold = 0.8 675 | if f1_weighted_score < f1_weighted_score_threshold: 676 | raise Exception( 677 | 'Accuracy of category prediction below threshold {:.2f}' 678 | .format(f1_weighted_score_threshold)) 679 | else: 680 | product_name = (pd.merge(recon_category 681 | .loc[recon_category['recon_category'].isna(), 682 | ['product_sku']], 683 | category_df[['product_sku', 'product_name']], 684 | how='left', 685 | on='product_sku') 686 | ['product_name']) 687 | 688 | category_label = model.predict(product_name) 689 | recon_category.loc[recon_category['recon_category'].isna(), 690 | 'recon_category'] = category_label 691 | 692 | return recon_category['recon_category'] 693 | 694 | def reconstruct_sales_region(subcontinent): 695 | '''Reconstruct sales region from subcontinent''' 696 | 697 | if (pd.isna(subcontinent) 698 | or subcontinent.lower() == '(not set)'): 699 | sales_region = np.nan 700 | 701 | elif ('africa' in subcontinent.lower() 702 | or 'europe' in subcontinent.lower()): 703 | sales_region = 'EMEA' 704 | 705 | elif ('caribbean' in subcontinent.lower() 706 | or subcontinent.lower() == 'central america'): 707 | sales_region = 'Central America' 708 | 709 | elif subcontinent.lower() == 'northern america': 710 | sales_region = 'North America' 711 | 712 | elif subcontinent.lower() == 'south america': 713 | sales_region = 'South America' 714 | 715 | elif ('asia' in subcontinent.lower() 716 | or subcontinent.lower() == 'australasia'): 717 | sales_region = 'APAC' 718 | 719 | else: 720 | raise Exception( 721 | 'Can not assign sales region to {} subcontinent' 722 | .format(subcontinent)) 723 | 724 | return sales_region 725 | 726 | def reconstruct_traffic_keyword(text): 727 | '''Reconstructs traffic keywords to more simple representation''' 728 | 729 | # if empty rename to not applicable 730 | if pd.isna(text): 731 | text = '(not applicable)' 732 | 733 | # if one word with mixed numbers & letters rename to (not relevant) 734 | elif re.search(r'(?=.*\d)(?=.*[A-Z=\-])(?=.*[a-z])([\w=-]+)', text): 735 | text = '(not relevant)' 736 | 737 | elif ((text != '(not provided)') 738 | and (re.search('(\s+)', text) is not None)): 739 | 740 | # transform text to lower case and remove punctuation 741 | text = ''.join([word.lower() for word in text 742 | if word not in string.punctuation.replace('/', '')]) 743 | 744 | # tokenize words 745 | tokens = re.split('\W+|/', text) 746 | 747 | # Drop not relevant words and lemmatize words 748 | wn = nltk.WordNetLemmatizer() 749 | text = ' '.join([wn.lemmatize(word) for word in tokens 750 | if word not in STOPWORDS]) 751 | 752 | return text 753 | 754 | 755 | def aggregate_data(df): 756 | '''Encode and aggregate engineered and missing value free data 757 | on client level 758 | 759 | Args: 760 | df: engineered and missing value free data as 761 | pandas dataframe of shape (# transaction items, # variables) 762 | 763 | agg_df: encoded and aggregated dataframe 764 | of shape(# clients, # encoded & engineered variables) 765 | with client_id index 766 | ''' 767 | # identifiers 768 | id_vars = pd.Index( 769 | ['client_id', 770 | 'session_id', 771 | 'transaction_id', 772 | 'product_sku'] 773 | ) 774 | 775 | # session variables 776 | session_vars = pd.Index( 777 | ['visit_number', # avg_visits 778 | 'date', # month, week, week_day + one hot encode + sum 779 | 'pageviews', # avg_pageviews 780 | 'time_on_site', # avg_time_on_site 781 | 'ad_campaign', # sum 782 | 'source', # one hot encode + sum 783 | 'browser', # one hot encode + sum 784 | 'operating_system', # one hot encode + sum 785 | 'device_category', # one hot encode + sum 786 | 'continent', # one hot encode + sum 787 | 'subcontinent', # one hot encode + sum 788 | 'country', # one hot encode + sum 789 | 'sales_region', # one hot encode + sum 790 | 'social_referral', # sum 791 | 'social_network', # one hot encode + sum 792 | 'channel_group'] # one hot encode + sum 793 | ) 794 | 795 | # group session variables from item to session level 796 | session_df = (df[['client_id', 797 | 'session_id', 798 | *session_vars.to_list()]] 799 | .drop_duplicates() 800 | 801 | # drop ambigious region 1 case 802 | .drop_duplicates(subset='session_id')) 803 | 804 | # reconstruct month, weeek and week day variables 805 | # session_df['month'] = session_df['date'].dt.month 806 | # session_df['week'] = session_df['date'].dt.week 807 | session_df['week_day'] = session_df['date'].dt.weekday + 1 808 | session_df = session_df.drop(columns='date') 809 | 810 | # encode variables on session level 811 | keep_vars = [ 812 | 'client_id', 813 | 'session_id', 814 | 'visit_number', 815 | 'pageviews', 816 | 'time_on_site', 817 | 'social_referral', 818 | 'ad_campaign' 819 | ] 820 | 821 | encode_vars = session_df.columns.drop(keep_vars) 822 | enc_session_df = pd.get_dummies(session_df, 823 | columns=encode_vars.to_list(), 824 | prefix_sep='*') 825 | 826 | # remove not relevant encoded variables 827 | enc_session_df = enc_session_df.drop( 828 | columns=enc_session_df.columns[ 829 | enc_session_df.columns.str.contains('not set|other') 830 | ] 831 | ) 832 | 833 | # summarize session level variables on customer level 834 | sum_vars = (pd.Index(['social_referral', 'ad_campaign']) 835 | .append(enc_session_df 836 | .columns 837 | .drop(keep_vars))) 838 | 839 | client_session_sum_df = (enc_session_df 840 | .groupby('client_id') 841 | [sum_vars] 842 | .sum()) 843 | 844 | client_session_avg_df = ( 845 | enc_session_df 846 | .groupby('client_id') 847 | .agg(avg_visits=('visit_number', 'mean'), 848 | avg_pageviews=('pageviews', 'mean'), 849 | avg_time_on_site=('time_on_site', 'mean')) 850 | ) 851 | 852 | client_session_df = pd.concat([client_session_avg_df, 853 | client_session_sum_df], 854 | axis=1) 855 | 856 | 857 | # product level variables 858 | product_vars = pd.Index([ 859 | # 'product_name', # one hot encode + sum 860 | 'product_category', # one hot encode + sum 861 | 'product_price', # avg_product_revenue 862 | 'product_quantity', # avg_product_revenue 863 | 'hour'] # one hot encoded + sum 864 | ) 865 | 866 | avg_vars = pd.Index([ 867 | 'product_price', 868 | 'product_quantity' 869 | ]) 870 | 871 | sum_vars = pd.Index([ 872 | # 'product_name', 873 | 'product_category', 874 | 'hour' 875 | ]) 876 | 877 | enc_product_df = pd.get_dummies(df[id_vars.union(product_vars)], 878 | columns=sum_vars, 879 | prefix_sep='*') 880 | 881 | # summarize product level variables on customer level 882 | client_product_sum_df = (enc_product_df 883 | .groupby('client_id') 884 | [enc_product_df.columns.drop(avg_vars)] 885 | .sum()) 886 | 887 | def average_product_vars(client): 888 | d = {} 889 | d['avg_product_revenue'] = ((client['product_price'] 890 | * client['product_quantity']) 891 | .sum() 892 | / client['product_quantity'].sum()) 893 | 894 | # ipdb.set_trace(context=15) 895 | d['avg_unique_products'] = (client 896 | .groupby('transaction_id') 897 | ['product_sku'] 898 | .apply(lambda sku: len(sku.unique())) 899 | .mean()) 900 | 901 | return pd.Series(d, index=['avg_product_revenue', 902 | 'avg_unique_products']) 903 | 904 | client_product_avg_df = (enc_product_df 905 | .groupby('client_id') 906 | .apply(average_product_vars)) 907 | 908 | client_product_df = pd.concat([client_product_avg_df, 909 | client_product_sum_df] 910 | , axis=1) 911 | 912 | agg_df = pd.concat([client_session_df, 913 | client_product_df], 914 | axis=1) 915 | 916 | return agg_df 917 | 918 | 919 | def do_pca(X_std, **kwargs): 920 | '''# Apply PCA to the data.''' 921 | pca = PCA(**kwargs) 922 | model = pca.fit(X_std) 923 | X_pca = model.transform(X_std) 924 | return pca, X_pca 925 | 926 | def scree_pca(pca, plot=False, **kwargs): 927 | '''Investigate the variance accounted for by each principal component.''' 928 | # PCA components 929 | n_pcs = len(pca.components_) 930 | pcs = pd.Index(range(1, n_pcs+1), name='principal component') 931 | 932 | # Eigen Values 933 | eig = pca.explained_variance_.reshape(n_pcs, 1) 934 | eig_df = pd.DataFrame(np.round(eig, 2), columns=['eigen_value'], index=pcs) 935 | eig_df['cum_eigen_value'] = np.round(eig_df['eigen_value'].cumsum(), 2) 936 | 937 | # Explained Variance % 938 | var = pca.explained_variance_ratio_.reshape(n_pcs, 1) 939 | var_df = pd.DataFrame(np.round(var, 4), 940 | columns=['explained_var'], 941 | index=pcs) 942 | var_df['cum_explained_var'] = (np.round(var_df['explained_var'].cumsum() 943 | / var_df['explained_var'].sum(), 4)) 944 | 945 | df = pd.concat([eig_df, var_df], axis=1) 946 | 947 | if plot: 948 | # scree plot limit 949 | limit = pd.DataFrame(np.ones((n_pcs, 1)), 950 | columns=['scree_plot_limit'], index=pcs) 951 | 952 | ax = (pd.concat([df, limit], axis=1) 953 | .plot(y=['eigen_value', 'explained_var', 'scree_plot_limit'], 954 | title='PCA: Scree test & Variance Analysis', **kwargs) 955 | ) 956 | df.plot(y=['cum_explained_var'], secondary_y=True, ax=ax) 957 | 958 | return df 959 | 960 | def get_pc_num(scree_df, pc_num = None, exp_var_threshold=None, 961 | eig_val_threshold=1): 962 | ''' 963 | Selects optimum number of prinipal components according specified ojectives 964 | wheter % of explained variance or eig_val criterion 965 | 966 | Args: 967 | scree_df: Dataframe as ouptu of scree_pca function 968 | exp_var_threshold: threshold for cumulative % of epxlained variance 969 | eig_val_threshold: min eigen value, 1 by default 970 | 971 | Returns: 972 | pc_num: Number of selelected principal components 973 | exp_var: Explained variance by selected components 974 | sum_eig: Sum of eigen values of selected components 975 | ''' 976 | # check arguments 977 | assert pc_num is None or pc_num <= scree_df.index.size 978 | assert exp_var_threshold is None or (0 < exp_var_threshold <= 1) 979 | assert 0 < eig_val_threshold < scree_df.index.size 980 | 981 | assert (pc_num is None or exp_var_threshold is not None) or \ 982 | (pc_num is not None or exp_var_threshold is None), \ 983 | ('''Either number of principal components or minimum variance 984 | explained should be selected''') 985 | 986 | if exp_var_threshold: 987 | pcs = scree_df.index[scree_df['cum_explained_var'] <= exp_var_threshold] 988 | 989 | elif pc_num: 990 | pcs = scree_df.index[range(1, pc_num+1)] 991 | 992 | elif exp_var_threshold is None: 993 | pcs = scree_df.index[scree_df['eigen_value'] > eig_val_threshold] 994 | 995 | pc_num = len(pcs) 996 | exp_var = scree_df.loc[pc_num, 'cum_explained_var'] 997 | sum_eig = scree_df.loc[[*pcs], 'eigen_value'].sum() 998 | 999 | return pc_num, exp_var, sum_eig 1000 | 1001 | def varimax(factor_df, **kwargs): 1002 | ''' 1003 | varimax rotation of factor matrix 1004 | 1005 | Args: 1006 | factor_df: factor matrix as pd.DataFrame with shape 1007 | (# features, # principal components) 1008 | 1009 | Return: 1010 | rot_factor_df: rotated factor matrix as pd.DataFrame 1011 | ''' 1012 | factor_mtr = df2mtr(factor_df) 1013 | varimax = robjects.r['varimax'] 1014 | rot_factor_mtr = varimax(factor_mtr) 1015 | return pandas2ri.ri2py(rot_factor_mtr.rx2('loadings')) 1016 | 1017 | def get_components(df, pca, rotation=None, sort_by='sig_ld', 1018 | feat_details=None, plot='None', **kwargs): 1019 | ''' 1020 | Show significant factor loadings depending on sample size 1021 | 1022 | Args: 1023 | df: data used for pca as pd.DataFrame 1024 | pca: fitted pca object 1025 | rotation: if to apply factor matrix rotation, by default None. 1026 | sort_by: sort sequence of components, by default accoring 1027 | number of significant loadings 'sig_load' 1028 | feat_details: Dictionary of mapped feature detials, by default None 1029 | plot: 'discrete' plots heatmap enhancing sifinigicant laodings 1030 | 'continuous' plots continous heatmap, 1031 | by default None 1032 | 1033 | Returns: 1034 | factor_df: factor matrix as pd.DataFrame 1035 | of shape (# features, # components) 1036 | sig_ld: number of significant loadings across components as 1037 | pd. Series of size # components 1038 | cross_ld: number of significant loadings across features 1039 | (cross loadings) as pd. Series of size # features 1040 | 1041 | ''' 1042 | # constants 1043 | # --------- 1044 | maxstr = 100 # amount of the characters to print 1045 | 1046 | # guidelines for indentifying significant factor loadings 1047 | # based on sample size. Source: Multivariate Data Analysis. 7th Edition. 1048 | factor_ld = np.linspace(0.3, 0.75, 10) 1049 | signif_sz = np.array([350, 250, 200, 150, 120, 100, 85, 70, 60, 50]) 1050 | 1051 | # loadings significant treshold 1052 | ld_sig = factor_ld[len(factor_ld) - (signif_sz <= df.index.size).sum()] 1053 | 1054 | if rotation == 'varimax': 1055 | components = varimax(pd.DataFrame(pca.components_.T)) 1056 | else: 1057 | components = pca.components_.T 1058 | 1059 | # annotate factor matrix 1060 | index = pd.Index([]) 1061 | for feat in df.columns: 1062 | try: 1063 | index = index.append( 1064 | pd.Index([feat]) if feat_details is None else \ 1065 | pd.Index([feat_details[feat]['long_name'][:maxstr]])) 1066 | except KeyError: 1067 | index = index.append(pd.Index([feat])) 1068 | 1069 | factor_df = pd.DataFrame( 1070 | np.round(components, 2), 1071 | columns = pd.Index(range(1, components.shape[1]+1), 1072 | name='principal_components'), 1073 | index = index.rename('features') 1074 | ) 1075 | 1076 | # select significant loadings 1077 | sig_mask = (factor_df.abs() >= ld_sig) 1078 | 1079 | # calculate cross loadings 1080 | cross_ld = (sig_mask.sum(axis=1) 1081 | .sort_values(ascending=False) 1082 | .rename('cross_loadings')) 1083 | 1084 | # calculate number of significant loadings per component 1085 | sig_ld = (sig_mask.sum() 1086 | .sort_values(ascending=False) 1087 | .rename('significant_loadings')) 1088 | 1089 | # sort vactor matrix by loadings in components 1090 | sort_by = [*sig_ld.index] if sort_by == 'sig_ld' else sort_by 1091 | factor_df.sort_values(sort_by, ascending=False, inplace=True) 1092 | 1093 | if plot == 'continuous': 1094 | plt.figure(**kwargs) 1095 | sns.heatmap( 1096 | factor_df.sort_values(sort_by, ascending=False).T, 1097 | cmap='RdYlBu', vmin=-1, vmax=1, square=True 1098 | ) 1099 | plt.title('Factor matrix') 1100 | 1101 | elif plot == 'discrete': 1102 | # loadings limits 1103 | ld_min, ld_sig_low, ld_sig_high, ld_max = -1, -ld_sig, ld_sig, 1 1104 | vmin, vmax = ld_min, ld_max 1105 | 1106 | # create descrete scale to distingish sifnificant diffrence categories 1107 | data = factor_df.apply( 1108 | lambda col: pd.to_numeric(pd.cut(col, 1109 | [ld_min, -ld_sig, ld_sig, ld_max], 1110 | labels=[-ld_sig, 0, ld_sig]))) 1111 | 1112 | # plot heat map 1113 | fig = plt.figure(**kwargs) 1114 | sns.heatmap(data.T, cmap='viridis', vmin=vmin, vmax=vmax, square=True) 1115 | plt.title('Factor matrix with significant laodings: {} > loading > {}' 1116 | .format(-ld_sig, ld_sig)); 1117 | 1118 | return factor_df, sig_ld, cross_ld 1119 | 1120 | def df2mtr(df): 1121 | ''' 1122 | Convert pandas dataframe to r matrix. Category dtype is casted as 1123 | factorVector considering missing values 1124 | (original py2ri function of rpy2 can't handle this properly so far) 1125 | 1126 | Args: 1127 | data: pandas dataframe of shape (# samples, # features) 1128 | with numeric dtype 1129 | 1130 | Returns: 1131 | mtr: r matrix of shape (# samples # features) 1132 | ''' 1133 | # check arguments 1134 | assert isinstance(df, pd.DataFrame), 'Argument df need to be a pd.Dataframe.' 1135 | 1136 | # select only numeric columns 1137 | df = df.select_dtypes('number') 1138 | 1139 | # create and return r matrix 1140 | values = FloatVector(df.values.flatten()) 1141 | dimnames = ListVector( 1142 | rlc.OrdDict([('index', StrVector(tuple(df.index))), 1143 | ('columns', StrVector(tuple(df.columns)))]) 1144 | ) 1145 | 1146 | return robjects.r.matrix(values, nrow=len(df.index), ncol=len(df.columns), 1147 | dimnames = dimnames, byrow=True) 1148 | 1149 | def screen_model(X_train, X_test, y_train, y_test, grid_search, fine_param=None, 1150 | title='MODEL SCREENING EVALUATION', verbose='text'): 1151 | '''Screen pipeline with diffrent hyperparameters. 1152 | 1153 | Args: 1154 | X_train, X_test: Pandas DataFrame of shape (# samples, # features) 1155 | _train - training set, _test - test set 1156 | y_train, y_test: Pandas Series of size (# of samples, label) 1157 | grid_search: GridSearchCV object 1158 | verbose: 'text' shows grid_search results DataFrame 1159 | 'plot' shows scores run chart with fine_param 1160 | fine_param: name of the parameter to fine tune model. Used only with 1161 | 'plot' option 1162 | 1163 | Returns: 1164 | grid_search: fitted the grid_search object 1165 | 1166 | ''' 1167 | # screen models 1168 | grid_search.fit(X_train, y_train) 1169 | 1170 | # print output 1171 | if verbose == 'text': 1172 | # screen results 1173 | screen_results = (pd.DataFrame(grid_search.cv_results_) 1174 | .sort_values('rank_test_score')) 1175 | 1176 | hyper_params = screen_results.columns[ 1177 | screen_results.columns.str.contains('param_') 1178 | ] 1179 | 1180 | if 'param_classifier' in screen_results: 1181 | screen_results['param_classifier'] = ( 1182 | screen_results['param_classifier'] 1183 | .apply(lambda cls_: type(cls_).__name__) 1184 | ) 1185 | 1186 | screen_results['overfitting'] = ( 1187 | screen_results['mean_train_score'] 1188 | - screen_results['mean_test_score'] 1189 | ) 1190 | 1191 | # calculate f1 weighted test score 1192 | y_pred = grid_search.predict(X_test) 1193 | 1194 | f1_weighted_score = f1_score(y_test, y_pred, average='weighted') 1195 | 1196 | # print results 1197 | print(title + '\n' + '-' * len(title)) 1198 | 1199 | display(screen_results 1200 | [hyper_params.union(pd.Index(['mean_train_score', 1201 | 'std_train_score', 1202 | 'mean_test_score', 1203 | 'std_test_score', 1204 | 'mean_fit_time', 1205 | 'mean_score_time', 1206 | 'overfitting']))]) 1207 | 1208 | print('Best model is {} with F1 test weighted score {:.3f}\n' 1209 | .format(type(grid_search 1210 | .best_estimator_ 1211 | .named_steps 1212 | .classifier) 1213 | .__name__, 1214 | f1_weighted_score)) 1215 | 1216 | elif verbose == 'plot': 1217 | if fine_param in grid_search.cv_results_: 1218 | 1219 | # screen results 1220 | screen_results = pd.DataFrame(grid_search.cv_results_) 1221 | 1222 | # plot results 1223 | screen_results = pd.melt( 1224 | screen_results, 1225 | id_vars=fine_param, 1226 | value_vars=(screen_results.columns[ 1227 | screen_results.columns 1228 | .str.contains(r'split\d_\w{4,5}_score', regex=True) 1229 | ]), 1230 | var_name='score_type', 1231 | value_name=grid_search.scoring 1232 | ) 1233 | 1234 | screen_results['score_type'] = screen_results['score_type'].replace( 1235 | regex=r'split\d_(\w{4,5}_score)', value=r'\1' 1236 | ) 1237 | 1238 | sns.lineplot(x=fine_param, 1239 | y=grid_search.scoring, 1240 | hue='score_type', 1241 | data=screen_results, 1242 | err_style='bars', 1243 | ax=plt.gca(), 1244 | marker='o', 1245 | linestyle='dashed') 1246 | plt.gca().set_title(title); 1247 | 1248 | return grid_search 1249 | 1250 | def plot_features_significance(estimator, X_std, y, feature_names, class_names, 1251 | threshold = -np.inf, title=''): 1252 | '''Analyzes features significance of the estimator. 1253 | 1254 | Args: 1255 | estimator: Sklearn estimator with coef_ or feature_improtances 1256 | attribute 1257 | X_std: Standardized inputs as dataframe 1258 | of shape (# of samples, # of features) 1259 | y: Class labels as Series of size # of samples 1260 | feature_names: list/index of features names 1261 | class_names: list of class names 1262 | threshold: Filters only significant coeficients following 1263 | |coeficient| <= threshold 1264 | title: title to put on each plot + class name will be added. 1265 | ''' 1266 | 1267 | assert ('coef_' in dir(estimator) 1268 | or 'feature_importances' in dir(estimator)) 1269 | 1270 | # get factor matrix 1271 | factor_matrix = pd.DataFrame(estimator.coef_ if 'coef_' in dir(estimator) 1272 | else estimator.feature_importances_, 1273 | index=estimator.classes_, 1274 | columns=feature_names) 1275 | 1276 | 1277 | cols = 2 1278 | rows = math.ceil(len(estimator.classes_) / cols) 1279 | fig, axes = plt.subplots(rows, cols, 1280 | figsize=(10*cols, 10*rows), 1281 | sharex=True); 1282 | plt.subplots_adjust(hspace=0.07, wspace=0.4) 1283 | 1284 | for i, (ax, class_idx, class_name) in enumerate( 1285 | zip(axes.flatten(), estimator.classes_, class_names)): 1286 | 1287 | # sort feature weigths and select features 1288 | sorted_coef = (factor_matrix 1289 | .loc[class_idx] 1290 | .abs() 1291 | .sort_values(ascending=False)) 1292 | 1293 | selected_feats = sorted_coef[sorted_coef >= threshold].index 1294 | 1295 | selected_coef = (factor_matrix 1296 | .loc[class_idx, selected_feats] 1297 | .rename('feature weights')) 1298 | 1299 | # calculate one-to-rest standardized average differences 1300 | selected_diff = ( 1301 | (X_std.loc[y == class_idx, selected_feats].mean() 1302 | - X_std.loc[y != class_idx, selected_feats].mean()) 1303 | .rename('standardized difference of one-to-rest everages') 1304 | ) 1305 | 1306 | # print bar chars 1307 | selected_df = (pd.concat([selected_coef, selected_diff], axis=1) 1308 | .sort_values('feature weights')) 1309 | 1310 | selected_df.plot.barh(ax=ax, legend=True if i==0 else False) 1311 | ax.set_title(title + ' ' + class_name) 1312 | -------------------------------------------------------------------------------- /product_categories.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/product_categories.xlsx -------------------------------------------------------------------------------- /temp_data.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/temp_data.h5 --------------------------------------------------------------------------------