├── .ipynb_checkpoints
├── Untitled-checkpoint.ipynb
└── customer-journey-analytics-checkpoint.ipynb
├── README.md
├── __pycache__
├── bigquery_.cpython-37.pyc
├── custom_pca.cpython-37.pyc
└── helper.cpython-37.pyc
├── bigquery_.py
├── custom_pca.py
├── customer-journey-analytics.html
├── customer-journey-analytics.ipynb
├── google_analytics_schema.json
├── google_analytics_schema.xlsx
├── helper.py
├── product_categories.xlsx
└── temp_data.h5
/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [],
3 | "metadata": {},
4 | "nbformat": 4,
5 | "nbformat_minor": 2
6 | }
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Customer Journey Analytics
2 |
3 | Do you want to increase revenue and marketing ROI from your e-commerce platform? If yes, this is the project for you, so read further.
4 |
5 | ### Table of Contents
6 | 1. [Project Motivation](#motivation)
7 | 2. [Results](#results)
8 | 4. [Installation](#installation)
9 | 3. [File Descriptions](#files)
10 | 5. [Licensing, Authors, and Acknowledgements](#licensing)
11 |
12 | ## Project Motivation
13 |
14 | Customer journey mapping has become very popular in recent years. Making it right can upgrade marketing strategy, boost personalized branding and offerings and result in increased revenue and marketing ROI. To bring value to the business, it requires a healthy balance of qualitative knowledge of customer-facing functions about market and customers and quantitative insights, which can be gained from an e-commerce platform, CRM and other market related external sources.
15 |
16 | The purpose of this project is to share with fellow sales, marketing professionals and data scientists how to approach quantitative part of customer journey mapping, namely answer questions:
17 | 1. How many buyer personas do we have?
18 | 2. What are their unique characteristics?
19 | 3. How accurately can we predict buyer persona from the first customer purchase transaction?
20 | 4. How can we adapt the marketing strategy concerning buyer personas to increase ROI?
21 |
22 | From a data science perspective it means:
23 | 1. How to use hierarchical and non-hierarchical clustering to identify buyer personas?
24 | 2. How to use ensemble and linear-based models to profile buyer personas characteristics?
25 |
26 | ## Results
27 |
28 | The main findings of the code can be found at the post available [here](https://medium.com/@alfred.sasko/customer-journey-analytics-will-make-you-more-money-ba7a11cda063).
29 |
30 | ## Installation
31 |
32 | There are several necessary 3rd party libraries beyond the Anaconda distribution of Python which needs to be installed and imported to run code. These are:
33 | - [scikit_posthocs](https://scikit-posthocs.readthedocs.io/en/latest/) providing posthoc tests for multiple comparison
34 | - [google cloud SDK](https://anaconda.org/conda-forge/google-cloud-sdk) providing access to BigQuerry and [Google Analytics Sample Dataset](https://support.google.com/analytics/answer/7586738?hl=en)
35 |
36 | ## File Descriptions
37 |
38 | There is 1 notebook available here to showcase work related to the above questions. Markdown cells were used to assist in walking through the thought process for individual steps.
39 |
40 | There are additional files:
41 | - `bigquery_.py` provides custom classes of BigqueryTable and BiqqueryDataset to query data to [Google Merchandise Store](https://www.googlemerchandisestore.com/) sample dataset.
42 | - `helper_py` provides custom functions for various analyses, to keep notebook manageable to read.
43 | - `custom_pca.py` holds adaptation of [scikit-learn PCA class](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) including _Varimax Rotation_ and _Latent Root Criterion_
44 | - `google_analytics_schema.xlsx` contains an analysis of variables in [Big Query Export Schema](https://support.google.com/analytics/answer/3437719?hl=en) used as a schema for Google Analytics Sample.
45 | - `product_categories.xlsx` ensures encoding of broken product category variables in the dataset
46 | - `temp.data.h5` stores codes/levels of each variable in the dataset
47 |
48 | ## Licensing, Authors, Acknowledgements
49 |
50 | Must give credit to Google for the data. @alexisbcook for a nice introduction to [Nested and Repeated Data](https://www.kaggle.com/alexisbcook/nested-and-repeated-data). Daqing Chen, Sai Laing Sain & Kun Guo for their technical article [Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining](https://link.springer.com/article/10.1057/dbm.2012.17).
51 |
--------------------------------------------------------------------------------
/__pycache__/bigquery_.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/bigquery_.cpython-37.pyc
--------------------------------------------------------------------------------
/__pycache__/custom_pca.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/custom_pca.cpython-37.pyc
--------------------------------------------------------------------------------
/__pycache__/helper.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/__pycache__/helper.cpython-37.pyc
--------------------------------------------------------------------------------
/bigquery_.py:
--------------------------------------------------------------------------------
1 | """Bigquery Custom Module
2 |
3 |
4 | Module serves for custome classes and functions to access Google's bigquery
5 | """
6 |
7 | # IMPORTS
8 | # -------
9 |
10 | # internal constants
11 | __ENV_VAR_NAME__ = "GOOGLE_APPLICATION_CREDENTIALS"
12 |
13 |
14 | # Standard libraries
15 | import os
16 | import numpy as np
17 | import pandas as pd
18 | import ipdb
19 |
20 | # 3rd party libraries
21 | from google.cloud import bigquery
22 | from tqdm import tqdm_notebook as tqdm
23 |
24 | def authenticate_service_account(service_account_key):
25 | """Set Windows environment variable for bigquery authentification
26 |
27 | Args:
28 | service_account_key: path to service account key
29 | """
30 | if not __ENV_VAR_NAME__ in os.environ:
31 | os.environ[__ENV_VAR_NAME__] = service_account_key
32 | else:
33 | raise Exception(
34 | 'Environment variable {} = {} already exists.'
35 | .format(__ENV_VAR_NAME__, os.environ[__ENV_VAR_NAME__])
36 | )
37 |
38 |
39 | class BigqueryTable (bigquery.table.Table):
40 | '''Bigquery table object customized to work with Pandas'''
41 |
42 | def _append_schema_field(
43 | self, schema_list, schema_field, parent_field_name=''):
44 | '''Explodes schema field and append it to list of exploded schema fields
45 |
46 | Args:
47 | schema_list: list of schema fields
48 | schema_field: bigquery.schema.SchemaField object
49 | parent_field_name: field name of patent schema, None by defualt
50 |
51 | Returns:
52 | schema_list: updated schema list with appened schema_field record
53 | '''
54 |
55 | # explode schema field
56 | schema_field = (schema_field.name if parent_field_name == '' \
57 | else parent_field_name + '.' + schema_field.name,
58 | schema_field.field_type,
59 | schema_field.mode,
60 | schema_field.description)
61 |
62 | # append exploded schema to list of exploded schemas
63 | schema_list.append(schema_field)
64 | return schema_list
65 |
66 | def _traverse_schema(self, schema_list, schema_field,
67 | parent_field_name = ''):
68 | '''Traverses table schema, unnest schema field objects, explodes their
69 | attributes to list of tuples
70 |
71 | Args:
72 | schema_list: unnnested and exploded schema list
73 | schema_field: bigquery.schema.SchemaField object
74 | parent_field_name: field name of patent schema, empty string by default
75 |
76 | Returns:
77 | schema_list: updated schema list by schema_field
78 | '''
79 |
80 | # for nested schema field go deeper
81 | if schema_field.field_type == 'RECORD':
82 |
83 | # explode and append parent schema field to schema list
84 | schema_list = self._append_schema_field(
85 | schema_list, schema_field, parent_field_name)
86 |
87 | # explode and append children schema fields
88 | for child_field in schema_field.fields:
89 | schema_list = self._traverse_schema(
90 | schema_list, child_field,
91 | schema_field.name if parent_field_name == ''
92 | else parent_field_name + '.' + schema_field.name
93 | )
94 |
95 | # explode and append not nested schema field
96 | else:
97 | schema_list = self._append_schema_field(
98 | schema_list, schema_field, parent_field_name)
99 |
100 | return schema_list
101 |
102 | def schema_to_dataframe(self, bq_exp_schema=None):
103 | '''Unnest schema and recast it to dataframe
104 |
105 | Args:
106 | bq_exp_schema: BigQuery export schema as dataframe with descriptions,
107 | None by default
108 |
109 | Returns:
110 | schema_df: Table schema as dataframe
111 | '''
112 |
113 | # traverse and explode table schema to list of tuples
114 | schema_list = []
115 | for schema_field in self.schema:
116 | schema_list = self._traverse_schema(schema_list, schema_field)
117 |
118 | # transform exploded schema to dataframe
119 | schema_df = pd.DataFrame(
120 | np.array(schema_list),
121 | columns=['Field Name', 'Data Type', 'Mode', 'Description']
122 | ).set_index('Field Name')
123 |
124 | # merge BigQuery export schema with table schema to get field
125 | # description
126 | if bq_exp_schema is not None:
127 |
128 | # merge schemas
129 | bq_exp_schema = bq_exp_schema.set_index('Field Name')
130 | schema_df = (
131 | pd.merge(schema_df.drop(columns=['Description']),
132 | bq_exp_schema['Description'],
133 | left_index=True, right_index=True,
134 | how='left')
135 | .fillna('Not specified in BigQuerry Export Schema')
136 | )
137 |
138 | # match depreciated fields
139 | depreciated_fields = {
140 | 'hits.appInfo.name': 'hits.appInfo.appName',
141 | 'hits.appInfo.version': 'hits.appInfo.appVersion',
142 | 'hits.appInfo.id': 'hits.appInfo.appId',
143 | 'hits.appInfo.installerId': 'hits.appInfo.appInstallerId',
144 |
145 | 'hits.publisher.adxClicks':
146 | 'hits.publisher.adxBackfillDfpClicks',
147 | 'hits.publisher.adxImpressions':
148 | 'hits.publisher.adxBackfillDfpImpressions',
149 | 'hits.publisher.adxMatchedQueries':
150 | 'hits.publisher.adxBackfillDfpMatchedQueries',
151 | 'hits.publisher.adxMeasurableImpressions':
152 | 'hits.publisher.adxBackfillDfpMeasurableImpressions',
153 | 'hits.publisher.adxQueries':
154 | 'hits.publisher.adxBackfillDfpQueries',
155 | 'hits.publisher.adxViewableImpressions':
156 | 'hits.publisher.adxBackfillDfpViewableImpressions',
157 | 'hits.publisher.adxPagesViewed':
158 | 'hits.publisher.adxBackfillDfpPagesViewed'
159 | }
160 | for ga_field, exp_field in depreciated_fields.items():
161 | schema_df['Description'][ga_field] = (
162 | bq_exp_schema['Description'][exp_field])
163 |
164 | # match multiple fields
165 | multiple_fields = {
166 | 'totals.uniqueScreenviews': 'totals.UniqueScreenViews',
167 | 'hits.contentGroup.contentGroup':
168 | 'hits.contentGroup.contentGroupX',
169 | 'hits.contentGroup.previousContentGroup':
170 | 'hits.contentGroup.previousContentGroupX',
171 | 'hits.contentGroup.contentGroupUniqueViews':
172 | 'hits.contentGroup.contentGroupUniqueViewsX'}
173 |
174 | fields = schema_df.index
175 | for ga_field, exp_field in multiple_fields.items():
176 | schema_df['Description'][fields.str.contains(ga_field)] = (
177 | bq_exp_schema['Description'][exp_field]
178 | )
179 |
180 | return schema_df.reset_index()
181 |
182 | def display_schema(self, schema_df):
183 | '''Displays left justified schema to full cell width
184 |
185 | Args:
186 | schema_df: Exploded bigquerry schema as dataframe
187 | '''
188 | with pd.option_context('display.max_colwidth', 999):
189 | display(
190 | schema_df.style.set_table_styles(
191 | [dict(selector='th', props=[('text-align', 'left')]),
192 | dict(selector='td', props=[('text-align', 'left')])])
193 | )
194 |
195 |
196 | def to_dataframe(self, client):
197 | '''Expands and Converts table to DataFrame
198 |
199 | Args:
200 | client: instatiated Bigquery client
201 |
202 | Returns:
203 | df: Dataframe with expanded table fields
204 | '''
205 |
206 | # convert table schema to DataFrame
207 | schema = self.schema_to_dataframe(bq_exp_schema=None)
208 |
209 | # extract SELECT aliases from schema
210 | repeated_fields = schema[schema['Mode'] == 'REPEATED']
211 | select_aliases = schema.apply(self._get_select_aliases,
212 | axis=1,
213 | args=(repeated_fields,))
214 |
215 | select_aliases = select_aliases[select_aliases['Data Type'] != 'RECORD']
216 | select = ',\n'.join(
217 | (select_aliases['Field Name']
218 | + ' AS '
219 | + select_aliases['Field Alias'])
220 | .to_list()
221 | )
222 |
223 | # extract FROM aliases from schema
224 | from_aliases = repeated_fields.apply(self._get_from_aliases,
225 | axis=1,
226 | args=(repeated_fields,))
227 | from_ = '\n'.join(
228 | ('LEFT JOIN UNNEST(' + from_aliases['Field Name']
229 | + ') AS '
230 | + from_aliases['Field Alias'])
231 | .to_list()
232 | )
233 |
234 | # query table and return as dataframe
235 | query = '''
236 | SELECT
237 | {}
238 | FROM
239 | {}
240 | {}
241 | '''.format(select,
242 | '`' + self.full_table_id.replace(':', '.') + '`',
243 | from_)
244 | df = client.query(query).to_dataframe()
245 | return df
246 |
247 | def _get_select_aliases(self, field, repeated_fields):
248 | '''Create aliases for nested fields for SELECT statment'''
249 |
250 | if '.' in field['Field Name'] and field['Mode'] != 'REPEATED':
251 | parent = '.'.join(field['Field Name'].split('.')[:-1])
252 | child = field['Field Name'].split('.')[-1]
253 |
254 | if parent in repeated_fields.values:
255 | field['Field Name'] = parent.replace('.', '_') + '.' + child
256 |
257 | field['Field Alias'] = field['Field Name'].replace('.', '_')
258 |
259 | return field
260 |
261 | def _get_from_aliases(self, field, repeated_fields):
262 | '''Create aliases for nested fields for FROM statment'''
263 |
264 | if '.' in field['Field Name']:
265 | parent = '.'.join(field['Field Name'].split('.')[:-1])
266 | child = field['Field Name'].split('.')[-1]
267 |
268 | if '.' in parent:
269 | field['Field Name'] = parent.replace('.', '_') + '.' + child
270 |
271 | field['Field Alias'] = field['Field Name'].replace('.', '_')
272 |
273 | return field
274 |
275 |
276 | class BigqueryDataset(bigquery.dataset.Dataset):
277 | '''Customize biqquery Dataset Class'''
278 |
279 | def __init__(self, client, *args, **kwg):
280 | super().__init__(*args, **kwg)
281 | self.schema = self.get_information_schema(client)
282 |
283 | def get_information_schema(self, client):
284 | '''Returns information schema including table names in dataset
285 | as dataframe
286 |
287 | Args:
288 | client: gigquery.client.Client object
289 |
290 | Returns:
291 | schema_df: dataset information schema as DataFrame
292 | '''
293 |
294 | query = '''
295 | SELECT
296 | * EXCEPT(is_typed)
297 | FROM
298 | `{project}.{dataset_id}`.INFORMATION_SCHEMA.TABLES
299 | '''.format(project=self.project, dataset_id=self.dataset_id)
300 | schema = (client.query(query)
301 | .to_dataframe()
302 | .sort_values('table_name'))
303 | return schema
304 |
305 | def get_levels(self, client, bq_exp_schema=None):
306 | '''Add to table schema two columns: 'Levels' which contains
307 | unique values of each variable as set and 'Num of Levels'
308 | determining number of unique values. Purpose is to get
309 | fealing what values are hold by variable and scale of the variable
310 | (Ex: Unary, Binary, Multilevel etc...)
311 |
312 | Note: Functions may run long time for big datasets as sequentialy
313 | loads all tables in dataset to safe memory.
314 |
315 | Args:
316 | client: bigquery.client.Client object
317 | bq_query_exp_schema: BigQuery export schema including field names
318 | Description as dataframe
319 | Returns:
320 | schema: updated schema with level charateristics as
321 | dataframe
322 | '''
323 |
324 | # union variable levels of consecutive queried tables
325 | def level_union(var, var_levels):
326 | try:
327 | var['Levels'] = var['Levels'] | var_levels[var.name]
328 | except:
329 | pass
330 | return var['Levels']
331 |
332 | # get last table of dataset
333 | dataset_ref = client.dataset(self.dataset_id, self.project)
334 | table_id = self.schema['table_name'].values[-1]
335 | table_ref = dataset_ref.table(table_id)
336 | table = client.get_table(table_ref)
337 | table.__class__ = BigqueryTable
338 |
339 | # get last table schema
340 | schema = table.schema_to_dataframe(bq_exp_schema)
341 |
342 | # keep only non record fields in schema
343 | schema = schema[schema['Data Type'] != 'RECORD'].copy()
344 |
345 | # ad variable names
346 | var_name = (schema['Field Name']
347 | .apply(lambda field: field.replace('.', '_')))
348 | schema.insert(0, 'Variable Name', var_name)
349 | schema.insert(4, 'Levels', [set()] * schema.index.size)
350 | schema = schema.set_index('Variable Name')
351 |
352 | # analyze variable codes/levels
353 | for i, table_id in tqdm(enumerate(self.schema['table_name'])):
354 |
355 | # load table
356 | table_ref = dataset_ref.table(table_id)
357 | table = client.get_table(table_ref)
358 | table.__class__ = BigqueryTable
359 | df = table.to_dataframe(client)
360 |
361 | # get and update variable levels
362 | var_levels = df.apply(lambda var: set(var.unique()))
363 | schema['Levels'] = (
364 | schema.apply(level_union, args=(var_levels,),
365 | axis = 1)
366 | )
367 |
368 | if i == 2: break
369 |
370 | # calculate number of variable levels
371 | schema.insert(4, 'Num of Levels',
372 | schema.apply(lambda var: len(var['Levels']),
373 | axis=1)
374 | )
375 | return schema
376 |
--------------------------------------------------------------------------------
/custom_pca.py:
--------------------------------------------------------------------------------
1 | """Custom principle component analysis module.
2 |
3 | Iplemented customizations
4 | - varimax rotation for better interpretation of principal components
5 | - communalities calculation for selecting significant features
6 | - loading significant threshold comparison as function of sample size
7 | for selecting significant features
8 | - 'surrogate' feature selection used for dimensionality reduction -
9 | features with maximum laoding instead of principal components are selected
10 | - ...
11 | """
12 |
13 | # IMPORTS
14 | # -------
15 |
16 | # Standard libraries
17 |
18 | import ipdb
19 | from inspect import isclass
20 | from inspect import currentframe
21 | import numbers
22 |
23 |
24 | # 3rd party libraries
25 | import numpy as np
26 | from numpy import linalg
27 | import pandas as pd
28 |
29 | from scipy.sparse import issparse
30 |
31 | import rpy2
32 | import rpy2.rlike.container as rlc
33 | from rpy2 import robjects
34 | from rpy2.robjects.vectors import FloatVector
35 | from rpy2.robjects.vectors import ListVector
36 | from rpy2.robjects.vectors import StrVector
37 | from rpy2.robjects import pandas2ri
38 |
39 | from sklearn.preprocessing import StandardScaler
40 | from sklearn.model_selection import train_test_split
41 | from sklearn.pipeline import Pipeline
42 | from sklearn.feature_extraction.text import TfidfVectorizer
43 | from sklearn.feature_extraction.text import CountVectorizer
44 | from sklearn.model_selection import GridSearchCV
45 | from sklearn.naive_bayes import MultinomialNB
46 | from sklearn.naive_bayes import ComplementNB
47 | from sklearn.metrics import f1_score
48 | from sklearn.decomposition import PCA
49 |
50 | from sklearn.decomposition import PCA
51 | from sklearn.utils import check_array
52 | from sklearn.utils.extmath import svd_flip
53 | from sklearn.utils.extmath import stable_cumsum
54 | from sklearn.utils.validation import check_is_fitted
55 |
56 | from matplotlib import pyplot as plt
57 | import seaborn as sns
58 |
59 | class CustomPCA(PCA):
60 | '''Customized PCA with following options
61 | - varimax rotation
62 | - diffrent feature selection methods
63 |
64 | and calculated communalities.
65 | '''
66 |
67 | def __init__(self, rotation=None, feature_selection='all', **kws):
68 | '''
69 | rotation: string, name of the rotation method.
70 | 'varimax': varimax rotation
71 | None: no rotation by by default
72 |
73 | feature_selection: string, Features selection method
74 | 'all': All features are selected, means principal
75 | components are output of tranformation
76 | 'surrogate': Only features with highest loading
77 | with principal components are used
78 | 'summated scales': Summated scales consturcted as
79 | sum of features having significant
80 | loadings onn principal components,
81 | not implemented yet.
82 | '''
83 |
84 | super().__init__(**kws)
85 | self.rotation = rotation
86 | self.feature_selection = feature_selection
87 |
88 |
89 | def _df2mtr(self, df):
90 | '''Convert pandas dataframe to r matrix. Category dtype is casted as
91 | factorVector considering missing values
92 | (original py2ri function of rpy2 can't handle this properly so far)
93 |
94 | Args:
95 | data: pandas dataframe of shape (# samples, # features)
96 | with numeric dtype
97 |
98 | Returns:
99 | mtr: r matrix of shape (# samples # features)
100 | '''
101 | # check arguments
102 | assert isinstance(df, pd.DataFrame), 'Argument df need to be a pd.Dataframe.'
103 |
104 | # select only numeric columns
105 | df = df.select_dtypes('number')
106 |
107 | # create and return r matrix
108 | values = FloatVector(df.values.flatten())
109 | dimnames = ListVector(
110 | rlc.OrdDict([('index', StrVector(tuple(df.index))),
111 | ('columns', StrVector(tuple(df.columns)))])
112 | )
113 |
114 | return robjects.r.matrix(values, nrow=len(df.index), ncol=len(df.columns),
115 | dimnames = dimnames, byrow=True)
116 |
117 | def _varimax(self, factor_df, **kwargs):
118 | '''
119 | varimax rotation of factor matrix
120 |
121 | Args:
122 | factor_df: factor matrix as pd.DataFrame with shape
123 | (# features, # principal components)
124 |
125 | Return:
126 | rot_factor_df: rotated factor matrix as pd.DataFrame
127 | '''
128 | factor_mtr = self._df2mtr(factor_df)
129 | varimax = robjects.r['varimax']
130 | rot_factor_mtr = varimax(factor_mtr)
131 | return pandas2ri.ri2py(rot_factor_mtr.rx2('loadings'))
132 |
133 | def _fit(self, X):
134 | """Dispatch to the right submethod depending on the chosen solver."""
135 | # Raise an error for sparse input.
136 | # This is more informative than the generic one raised by check_array.
137 | if issparse(X):
138 | raise TypeError('PCA does not support sparse input. See '
139 | 'TruncatedSVD for a possible alternative.')
140 |
141 | X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
142 | copy=self.copy)
143 |
144 | # Handle n_components==None
145 | if self.n_components is None:
146 | if self.svd_solver != 'arpack':
147 | n_components = min(X.shape)
148 | else:
149 | n_components = min(X.shape) - 1
150 | else:
151 | n_components = self.n_components
152 |
153 | # Handle svd_solver
154 | self._fit_svd_solver = self.svd_solver
155 | if self._fit_svd_solver == 'auto':
156 | # Small problem or n_components == 'mle', just call full PCA
157 | if (max(X.shape) <= 500
158 | or n_components == 'mle'
159 | or n_components == 'latent_root'):
160 | self._fit_svd_solver = 'full'
161 | elif n_components >= 1 and n_components < .8 * min(X.shape):
162 | self._fit_svd_solver = 'randomized'
163 | # This is also the case of n_components in (0,1)
164 | else:
165 | self._fit_svd_solver = 'full'
166 |
167 | # Call different fits for either full or truncated SVD
168 | if self._fit_svd_solver == 'full':
169 | U, S , V = self._fit_full(X, n_components)
170 | elif self._fit_svd_solver in ['arpack', 'randomized']:
171 | U, S, V = self._fit_truncated(X, n_components, self._fit_svd_solver)
172 | else:
173 | raise ValueError("Unrecognized svd_solver='{0}'"
174 | "".format(self._fit_svd_solver))
175 |
176 | # implmentation of varimax rotation
177 | if self.rotation == 'varimax':
178 | if self.n_samples_ > self.n_components_:
179 |
180 | factor_matrix = (
181 | self.components_.T
182 | * (self.explained_variance_.reshape(1, -1) ** (1/2))
183 | )
184 |
185 | rot_factor_matrix = self._varimax(pd.DataFrame(factor_matrix))
186 |
187 | self.explained_variance_ = (rot_factor_matrix ** 2).sum(axis=0)
188 |
189 | self.components_ = (
190 | rot_factor_matrix
191 | / (self.explained_variance_.reshape(1, -1) ** (1/2))
192 | ).T
193 |
194 | # sort components by explained variance in descanding order
195 | self.components_ = self.components_[
196 | np.argsort(self.explained_variance_)[::-1], :
197 | ]
198 |
199 | self.explained_variance_ = np.sort(
200 | self.explained_variance_
201 | )[::-1]
202 |
203 | total_var = self.n_features_
204 | self.explained_variance_ratio_ = (
205 | self.explained_variance_ / total_var
206 | )
207 |
208 | self.singular_values_ = None
209 |
210 | if self._fit_svd_solver == 'full':
211 | if self.n_components_ < min(self.n_features_, self.n_samples_):
212 | self.noise_variance_ = (
213 | (total_var - self.explained_variance_.sum())
214 | / (self.n_features_ - self.n_components_)
215 | )
216 | else:
217 | self.noise_variance_ = 0.
218 |
219 | elif self._fit_svd_solver in ['arpack', 'randomized']:
220 | if self.n_components_ < min(self.n_features_, self.n_samples_):
221 |
222 | total_var = np.var(X, ddof=1, axis=0)
223 |
224 | self.noise_variance_ = (
225 | total_var.sum() - self.explained_variance_.sum()
226 | )
227 |
228 | self.noise_variance_ /= (
229 | min(self.n_features_, self.n_samples_)
230 | - self.n_components_
231 | )
232 | else:
233 | self.noise_variance_ = 0.
234 |
235 | else:
236 | raise ValueError("Unrecognized svd_solver='{0}'"
237 | "".format(self._fit_svd_solver))
238 | else:
239 | raise ValueError(
240 | "Varimax rotation requires n_samples > n_components")
241 |
242 | U, S, V = None, None, None
243 |
244 | # implmentation of communalties
245 | self.communalities_ = (
246 | ((self.components_.T
247 | * (self.explained_variance_.reshape(1, -1) ** (1/2)))
248 | ** 2)
249 | .sum(axis=1)
250 | )
251 |
252 | return U, S, V
253 |
254 | def _fit_full(self, X, n_components):
255 | """Fit the model by computing full SVD on X"""
256 | n_samples, n_features = X.shape
257 |
258 | if n_components == 'mle':
259 | if n_samples < n_features:
260 | raise ValueError("n_components='mle' is only supported "
261 | "if n_samples >= n_features")
262 | elif n_components == 'latent_root':
263 | pass
264 | elif not 0 <= n_components <= min(n_samples, n_features):
265 | raise ValueError("n_components=%r must be between 0 and "
266 | "min(n_samples, n_features)=%r with "
267 | "svd_solver='full'"
268 | % (n_components, min(n_samples, n_features)))
269 | elif n_components >= 1:
270 | if not isinstance(n_components, numbers.Integral):
271 | raise ValueError("n_components=%r must be of type int "
272 | "when greater than or equal to 1, "
273 | "was of type=%r"
274 | % (n_components, type(n_components)))
275 |
276 | # Center data
277 | self.mean_ = np.mean(X, axis=0)
278 | X -= self.mean_
279 |
280 | U, S, V = linalg.svd(X, full_matrices=False)
281 | # flip eigenvectors' sign to enforce deterministic output
282 | U, V = svd_flip(U, V)
283 |
284 | components_ = V
285 |
286 | # Get variance explained by singular values
287 | explained_variance_ = (S ** 2) / (n_samples - 1)
288 | total_var = explained_variance_.sum()
289 | explained_variance_ratio_ = explained_variance_ / total_var
290 | singular_values_ = S.copy() # Store the singular values.
291 |
292 | # Postprocess the number of components required
293 | if n_components == 'mle':
294 | n_components = \
295 | _infer_dimension_(explained_variance_, n_samples, n_features)
296 | elif n_components == 'latent_root':
297 | n_components = (explained_variance_ > 1).sum()
298 | elif 0 < n_components < 1.0:
299 | # number of components for which the cumulated explained
300 | # variance percentage is superior to the desired threshold
301 | ratio_cumsum = stable_cumsum(explained_variance_ratio_)
302 | n_components = np.searchsorted(ratio_cumsum, n_components) + 1
303 |
304 | # Compute noise covariance using Probabilistic PCA model
305 | # The sigma2 maximum likelihood (cf. eq. 12.46)
306 | if n_components < min(n_features, n_samples):
307 | self.noise_variance_ = explained_variance_[n_components:].mean()
308 |
309 | else:
310 | self.noise_variance_ = 0.
311 |
312 | self.n_samples_, self.n_features_ = n_samples, n_features
313 | self.components_ = components_[:n_components]
314 | self.n_components_ = n_components
315 | self.explained_variance_ = explained_variance_[:n_components]
316 | self.explained_variance_ratio_ = \
317 | explained_variance_ratio_[:n_components]
318 | self.singular_values_ = singular_values_[:n_components]
319 |
320 | return U, S, V
321 |
322 | def get_support(self, indices=False):
323 | """
324 | Get a mask, or integer index, of the features selected
325 | Parameters
326 | ----------
327 | indices : boolean (default False)
328 | If True, the return value will be an array of integers, rather
329 | than a boolean mask.
330 | Returns
331 | -------
332 | support : array
333 | An index that selects the retained features from a feature vector.
334 | If `indices` is False, this is a boolean array of shape
335 | [# input features], in which an element is True iff its
336 | corresponding feature is selected for retention. If `indices` is
337 | True, this is an integer array of shape [# output features] whose
338 | values are indices into the input feature vector.
339 | """
340 | mask = self._get_support_mask()
341 | return mask if not indices else np.where(mask)[0]
342 |
343 | def _get_support_mask(self):
344 | """
345 | Get the boolean mask indicating which features are selected
346 | Returns
347 | -------
348 | support : boolean array of shape [# input features]
349 | An element is True iff its corresponding feature is selected for
350 | retention.
351 | """
352 |
353 | attrs = [v for v in vars(self)
354 | if (v.endswith("_") or v.startswith("_"))
355 | and not v.startswith("__")]
356 | check_is_fitted(self, attributes=attrs,
357 | all_or_any=all)
358 |
359 | # Keep only features with at least one of the significant loading
360 | # and communality > 0.5
361 | significant_features_mask = (
362 | ((np.absolute(self.components_)
363 | >= self.significance_threshold())
364 | .any(axis=0))
365 | & (self.communalities_ >= 0.5)
366 | )
367 |
368 | if self.feature_selection == 'all':
369 | mask = significant_features_mask
370 |
371 | elif self.feature_selection == 'surrogate':
372 | # Select significant feature with maximum loading
373 | # on each principal component
374 | mask = np.full(self.n_features_, False, dtype=bool)
375 | surrogate_features_idx = np.unique(
376 | np.absolute(np.argmax(self.components_, axis=1))
377 | )
378 | mask[surrogate_features_idx] = True
379 | mask = (mask & significant_features_mask)
380 |
381 | elif self.feature_selection == 'summated scales':
382 | raise Exception('Not implemented yet.')
383 | else:
384 | raise ValueError('Not valid selection method.')
385 | return mask
386 |
387 | def significance_threshold(self):
388 | sample_sizes = np.array([50, 60, 70, 85, 100,
389 | 120, 150, 200, 250, 300])
390 | thresholds = np.array([0.75, 0.70, 0.65, 0.60, 0.55,
391 | 0.50, 0.45, 0.40, 0.35, 0.30])
392 | return min(thresholds[sample_sizes <= self.n_samples_])
393 |
394 | def transform(self, X):
395 | """Apply dimensionality reduction to X.
396 | X is projected on the first principal components previously extracted
397 | from a training set.
398 | Parameters
399 | ----------
400 | X : array-like, shape (n_samples, n_features)
401 | New data, where n_samples is the number of samples
402 | and n_features is the number of features.
403 |
404 | Returns
405 | -------
406 | X_new : array-like, shape (n_samples, n_components)
407 | Examples
408 | --------
409 | >>> import numpy as np
410 | >>> from sklearn.decomposition import IncrementalPCA
411 | >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
412 | >>> ipca = IncrementalPCA(n_components=2, batch_size=3)
413 | >>> ipca.fit(X)
414 | IncrementalPCA(batch_size=3, n_components=2)
415 | >>> ipca.transform(X) # doctest: +SKIP
416 | """
417 | attrs = [v for v in vars(self)
418 | if (v.endswith("_") or v.startswith("_"))
419 | and not v.startswith("__")]
420 | check_is_fitted(self, attributes=attrs,
421 | all_or_any=all)
422 |
423 | X = check_array(X)
424 | if self.mean_ is not None:
425 | X = X - self.mean_
426 |
427 | if self.feature_selection == 'all':
428 | X_transformed = np.dot(X, self.components_.T)
429 | if self.whiten:
430 | X_transformed /= np.sqrt(self.explained_variance_)
431 |
432 | else:
433 | X_transformed = X[:, self._get_support_mask()]
434 |
435 | return X_transformed
436 |
437 | def fit_transform(self, X, y=None):
438 | """Fit the model with X and apply the dimensionality reduction on X.
439 | Parameters
440 | ----------
441 | X : array-like, shape (n_samples, n_features)
442 | Training data, where n_samples is the number of samples
443 | and n_features is the number of features.
444 | y : None
445 | Ignored variable.
446 | Returns
447 | -------
448 | X_new : array-like, shape (n_samples, n_components)
449 | Transformed values.
450 | Notes
451 | -----
452 | This method returns a Fortran-ordered array. To convert it to a
453 | C-ordered array, use 'np.ascontiguousarray'.
454 | """
455 |
456 | U, S, V = self._fit(X)
457 | X_transformed = self.transform(X)
458 |
459 | return X_transformed
460 |
--------------------------------------------------------------------------------
/google_analytics_schema.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "visitorId": null,
4 | "visitNumber": "1",
5 | "visitId": "1501583974",
6 | "visitStartTime": "1501583974",
7 | "date": "20170801",
8 | "totals": {
9 | "visits": "1",
10 | "hits": "1",
11 | "pageviews": "1",
12 | "timeOnSite": null,
13 | "bounces": "1",
14 | "transactions": null,
15 | "transactionRevenue": null,
16 | "newVisits": "1",
17 | "screenviews": null,
18 | "uniqueScreenviews": null,
19 | "timeOnScreen": null,
20 | "totalTransactionRevenue": null,
21 | "sessionQualityDim": "1"
22 | },
23 | "trafficSource": {
24 | "referralPath": null,
25 | "campaign": "(not set)",
26 | "source": "(direct)",
27 | "medium": "(none)",
28 | "keyword": null,
29 | "adContent": null,
30 | "adwordsClickInfo": {
31 | "campaignId": null,
32 | "adGroupId": null,
33 | "creativeId": null,
34 | "criteriaId": null,
35 | "page": null,
36 | "slot": null,
37 | "criteriaParameters": "not available in demo dataset",
38 | "gclId": null,
39 | "customerId": null,
40 | "adNetworkType": null,
41 | "targetingCriteria": null,
42 | "isVideoAd": null
43 | },
44 | "isTrueDirect": null,
45 | "campaignCode": null
46 | },
47 | "device": {
48 | "browser": "Chrome",
49 | "browserVersion": "not available in demo dataset",
50 | "browserSize": "not available in demo dataset",
51 | "operatingSystem": "Android",
52 | "operatingSystemVersion": "not available in demo dataset",
53 | "isMobile": "true",
54 | "mobileDeviceBranding": "not available in demo dataset",
55 | "mobileDeviceModel": "not available in demo dataset",
56 | "mobileInputSelector": "not available in demo dataset",
57 | "mobileDeviceInfo": "not available in demo dataset",
58 | "mobileDeviceMarketingName": "not available in demo dataset",
59 | "flashVersion": "not available in demo dataset",
60 | "javaEnabled": null,
61 | "language": "not available in demo dataset",
62 | "screenColors": "not available in demo dataset",
63 | "screenResolution": "not available in demo dataset",
64 | "deviceCategory": "mobile"
65 | },
66 | "geoNetwork": {
67 | "continent": "Americas",
68 | "subContinent": "Caribbean",
69 | "country": "St. Lucia",
70 | "region": "(not set)",
71 | "metro": "(not set)",
72 | "city": "(not set)",
73 | "cityId": "not available in demo dataset",
74 | "networkDomain": "unknown.unknown",
75 | "latitude": "not available in demo dataset",
76 | "longitude": "not available in demo dataset",
77 | "networkLocation": "not available in demo dataset"
78 | },
79 | "hits": [
80 | {
81 | "hitNumber": "1",
82 | "time": "0",
83 | "hour": "3",
84 | "minute": "39",
85 | "isSecure": null,
86 | "isInteraction": "true",
87 | "isEntrance": "true",
88 | "isExit": "true",
89 | "referer": "http://www.google.com/",
90 | "page": {
91 | "pagePath": "/google+redesign/electronics",
92 | "hostname": "shop.googlemerchandisestore.com",
93 | "pageTitle": "Electronics | Google Merchandise Store",
94 | "searchKeyword": null,
95 | "searchCategory": null,
96 | "pagePathLevel1": "/google+redesign/",
97 | "pagePathLevel2": "/electronics",
98 | "pagePathLevel3": "",
99 | "pagePathLevel4": ""
100 | },
101 | "transaction": {
102 | "transactionId": null,
103 | "transactionRevenue": null,
104 | "transactionTax": null,
105 | "transactionShipping": null,
106 | "affiliation": null,
107 | "currencyCode": "USD",
108 | "localTransactionRevenue": null,
109 | "localTransactionTax": null,
110 | "localTransactionShipping": null,
111 | "transactionCoupon": null
112 | },
113 | "item": {
114 | "transactionId": null,
115 | "productName": null,
116 | "productCategory": null,
117 | "productSku": null,
118 | "itemQuantity": null,
119 | "itemRevenue": null,
120 | "currencyCode": "USD",
121 | "localItemRevenue": null
122 | },
123 | "contentInfo": null,
124 | "appInfo": {
125 | "name": null,
126 | "version": null,
127 | "id": null,
128 | "installerId": null,
129 | "appInstallerId": null,
130 | "appName": null,
131 | "appVersion": null,
132 | "appId": null,
133 | "screenName": "shop.googlemerchandisestore.com/google+redesign/electronics",
134 | "landingScreenName": "shop.googlemerchandisestore.com/google+redesign/electronics",
135 | "exitScreenName": "shop.googlemerchandisestore.com/google+redesign/electronics",
136 | "screenDepth": "0"
137 | },
138 | "exceptionInfo": {
139 | "description": null,
140 | "isFatal": "true",
141 | "exceptions": null,
142 | "fatalExceptions": null
143 | },
144 | "eventInfo": null,
145 | "product": [
146 | {
147 | "productSKU": "GGOEGBFC018799",
148 | "v2ProductName": "Electronics Accessory Pouch",
149 | "v2ProductCategory": "Home/Electronics/",
150 | "productVariant": "(not set)",
151 | "productBrand": "(not set)",
152 | "productRevenue": null,
153 | "localProductRevenue": null,
154 | "productPrice": "4990000",
155 | "localProductPrice": "4990000",
156 | "productQuantity": null,
157 | "productRefundAmount": null,
158 | "localProductRefundAmount": null,
159 | "isImpression": "true",
160 | "isClick": null,
161 | "customDimensions": [],
162 | "customMetrics": [],
163 | "productListName": "Category",
164 | "productListPosition": "1",
165 | "productCouponCode": null
166 | },
167 | {
168 | "productSKU": "GGOEGESB015199",
169 | "v2ProductName": "Google Flashlight",
170 | "v2ProductCategory": "Home/Electronics/",
171 | "productVariant": "(not set)",
172 | "productBrand": "(not set)",
173 | "productRevenue": null,
174 | "localProductRevenue": null,
175 | "productPrice": "59990000",
176 | "localProductPrice": "59990000",
177 | "productQuantity": null,
178 | "productRefundAmount": null,
179 | "localProductRefundAmount": null,
180 | "isImpression": "true",
181 | "isClick": null,
182 | "customDimensions": [],
183 | "customMetrics": [],
184 | "productListName": "Category",
185 | "productListPosition": "2",
186 | "productCouponCode": null
187 | },
188 | {
189 | "productSKU": "GGOEGEVA022399",
190 | "v2ProductName": "Micro Wireless Earbud",
191 | "v2ProductCategory": "Home/Electronics/",
192 | "productVariant": "(not set)",
193 | "productBrand": "(not set)",
194 | "productRevenue": null,
195 | "localProductRevenue": null,
196 | "productPrice": "39990000",
197 | "localProductPrice": "39990000",
198 | "productQuantity": null,
199 | "productRefundAmount": null,
200 | "localProductRefundAmount": null,
201 | "isImpression": "true",
202 | "isClick": null,
203 | "customDimensions": [],
204 | "customMetrics": [],
205 | "productListName": "Category",
206 | "productListPosition": "3",
207 | "productCouponCode": null
208 | },
209 | {
210 | "productSKU": "GGOEGCBB074199",
211 | "v2ProductName": "Google Car Clip Phone Holder",
212 | "v2ProductCategory": "Home/Electronics/",
213 | "productVariant": "(not set)",
214 | "productBrand": "(not set)",
215 | "productRevenue": null,
216 | "localProductRevenue": null,
217 | "productPrice": "6990000",
218 | "localProductPrice": "6990000",
219 | "productQuantity": null,
220 | "productRefundAmount": null,
221 | "localProductRefundAmount": null,
222 | "isImpression": "true",
223 | "isClick": null,
224 | "customDimensions": [],
225 | "customMetrics": [],
226 | "productListName": "Category",
227 | "productListPosition": "4",
228 | "productCouponCode": null
229 | },
230 | {
231 | "productSKU": "GGOEGFKA022299",
232 | "v2ProductName": "Keyboard DOT Sticker",
233 | "v2ProductCategory": "Home/Electronics/",
234 | "productVariant": "(not set)",
235 | "productBrand": "(not set)",
236 | "productRevenue": null,
237 | "localProductRevenue": null,
238 | "productPrice": "1500000",
239 | "localProductPrice": "1500000",
240 | "productQuantity": null,
241 | "productRefundAmount": null,
242 | "localProductRefundAmount": null,
243 | "isImpression": "true",
244 | "isClick": null,
245 | "customDimensions": [],
246 | "customMetrics": [],
247 | "productListName": "Category",
248 | "productListPosition": "5",
249 | "productCouponCode": null
250 | },
251 | {
252 | "productSKU": "GGOEGCBB074399",
253 | "v2ProductName": "Google Device Holder Sticky Pad",
254 | "v2ProductCategory": "Home/Electronics/",
255 | "productVariant": "(not set)",
256 | "productBrand": "(not set)",
257 | "productRevenue": null,
258 | "localProductRevenue": null,
259 | "productPrice": "4990000",
260 | "localProductPrice": "4990000",
261 | "productQuantity": null,
262 | "productRefundAmount": null,
263 | "localProductRefundAmount": null,
264 | "isImpression": "true",
265 | "isClick": null,
266 | "customDimensions": [],
267 | "customMetrics": [],
268 | "productListName": "Category",
269 | "productListPosition": "6",
270 | "productCouponCode": null
271 | },
272 | {
273 | "productSKU": "GGOEGCBC074299",
274 | "v2ProductName": "Google Device Stand",
275 | "v2ProductCategory": "Home/Electronics/",
276 | "productVariant": "(not set)",
277 | "productBrand": "(not set)",
278 | "productRevenue": null,
279 | "localProductRevenue": null,
280 | "productPrice": "4990000",
281 | "localProductPrice": "4990000",
282 | "productQuantity": null,
283 | "productRefundAmount": null,
284 | "localProductRefundAmount": null,
285 | "isImpression": "true",
286 | "isClick": null,
287 | "customDimensions": [],
288 | "customMetrics": [],
289 | "productListName": "Category",
290 | "productListPosition": "7",
291 | "productCouponCode": null
292 | },
293 | {
294 | "productSKU": "GGOEGEHQ072499",
295 | "v2ProductName": "Google 2200mAh Micro Charger",
296 | "v2ProductCategory": "Home/Electronics/",
297 | "productVariant": "(not set)",
298 | "productBrand": "(not set)",
299 | "productRevenue": null,
300 | "localProductRevenue": null,
301 | "productPrice": "22990000",
302 | "localProductPrice": "22990000",
303 | "productQuantity": null,
304 | "productRefundAmount": null,
305 | "localProductRefundAmount": null,
306 | "isImpression": "true",
307 | "isClick": null,
308 | "customDimensions": [],
309 | "customMetrics": [],
310 | "productListName": "Category",
311 | "productListPosition": "8",
312 | "productCouponCode": null
313 | },
314 | {
315 | "productSKU": "GGOEGEHQ072599",
316 | "v2ProductName": "Google 4400mAh Power Bank",
317 | "v2ProductCategory": "Home/Electronics/",
318 | "productVariant": "(not set)",
319 | "productBrand": "(not set)",
320 | "productRevenue": null,
321 | "localProductRevenue": null,
322 | "productPrice": "37990000",
323 | "localProductPrice": "37990000",
324 | "productQuantity": null,
325 | "productRefundAmount": null,
326 | "localProductRefundAmount": null,
327 | "isImpression": "true",
328 | "isClick": null,
329 | "customDimensions": [],
330 | "customMetrics": [],
331 | "productListName": "Category",
332 | "productListPosition": "9",
333 | "productCouponCode": null
334 | },
335 | {
336 | "productSKU": "GGOEGESB015099",
337 | "v2ProductName": "Basecamp Explorer Powerbank Flashlight",
338 | "v2ProductCategory": "Home/Electronics/",
339 | "productVariant": "(not set)",
340 | "productBrand": "(not set)",
341 | "productRevenue": null,
342 | "localProductRevenue": null,
343 | "productPrice": "22990000",
344 | "localProductPrice": "22990000",
345 | "productQuantity": null,
346 | "productRefundAmount": null,
347 | "localProductRefundAmount": null,
348 | "isImpression": "true",
349 | "isClick": null,
350 | "customDimensions": [],
351 | "customMetrics": [],
352 | "productListName": "Category",
353 | "productListPosition": "10",
354 | "productCouponCode": null
355 | },
356 | {
357 | "productSKU": "GGOEGESC014099",
358 | "v2ProductName": "Rocket Flashlight",
359 | "v2ProductCategory": "Home/Electronics/",
360 | "productVariant": "(not set)",
361 | "productBrand": "(not set)",
362 | "productRevenue": null,
363 | "localProductRevenue": null,
364 | "productPrice": "4990000",
365 | "localProductPrice": "4990000",
366 | "productQuantity": null,
367 | "productRefundAmount": null,
368 | "localProductRefundAmount": null,
369 | "isImpression": "true",
370 | "isClick": null,
371 | "customDimensions": [],
372 | "customMetrics": [],
373 | "productListName": "Category",
374 | "productListPosition": "11",
375 | "productCouponCode": null
376 | },
377 | {
378 | "productSKU": "GGOEGESQ016799",
379 | "v2ProductName": "Plastic Sliding Flashlight",
380 | "v2ProductCategory": "Home/Electronics/",
381 | "productVariant": "(not set)",
382 | "productBrand": "(not set)",
383 | "productRevenue": null,
384 | "localProductRevenue": null,
385 | "productPrice": "12990000",
386 | "localProductPrice": "12990000",
387 | "productQuantity": null,
388 | "productRefundAmount": null,
389 | "localProductRefundAmount": null,
390 | "isImpression": "true",
391 | "isClick": null,
392 | "customDimensions": [],
393 | "customMetrics": [],
394 | "productListName": "Category",
395 | "productListPosition": "12",
396 | "productCouponCode": null
397 | }
398 | ],
399 | "promotion": [],
400 | "promotionActionInfo": null,
401 | "refund": null,
402 | "eCommerceAction": {
403 | "action_type": "0",
404 | "step": "1",
405 | "option": null
406 | },
407 | "experiment": [],
408 | "publisher": null,
409 | "customVariables": [],
410 | "customDimensions": [],
411 | "customMetrics": [],
412 | "type": "PAGE",
413 | "social": {
414 | "socialInteractionNetwork": null,
415 | "socialInteractionAction": null,
416 | "socialInteractions": null,
417 | "socialInteractionTarget": null,
418 | "socialNetwork": "(not set)",
419 | "uniqueSocialInteractions": null,
420 | "hasSocialSourceReferral": "No",
421 | "socialInteractionNetworkAction": " : "
422 | },
423 | "latencyTracking": null,
424 | "sourcePropertyInfo": null,
425 | "contentGroup": {
426 | "contentGroup1": "(not set)",
427 | "contentGroup2": "Electronics",
428 | "contentGroup3": "(not set)",
429 | "contentGroup4": "(not set)",
430 | "contentGroup5": "(not set)",
431 | "previousContentGroup1": "(entrance)",
432 | "previousContentGroup2": "(entrance)",
433 | "previousContentGroup3": "(entrance)",
434 | "previousContentGroup4": "(entrance)",
435 | "previousContentGroup5": "(entrance)",
436 | "contentGroupUniqueViews1": null,
437 | "contentGroupUniqueViews2": "1",
438 | "contentGroupUniqueViews3": null,
439 | "contentGroupUniqueViews4": null,
440 | "contentGroupUniqueViews5": null
441 | },
442 | "dataSource": "web",
443 | "publisher_infos": []
444 | }
445 | ],
446 | "fullVisitorId": "2248281639583218707",
447 | "userId": null,
448 | "clientId": null,
449 | "channelGrouping": "Organic Search",
450 | "socialEngagementType": "Not Socially Engaged"
451 | }
452 | ]
--------------------------------------------------------------------------------
/google_analytics_schema.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/google_analytics_schema.xlsx
--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
1 | """Helper scintific module
2 | Module serves for custom methods to support Customer Journey Analytics Project
3 | """
4 |
5 | # IMPORTS
6 | # -------
7 |
8 | # Standard libraries
9 | import re
10 | import ipdb
11 | import string
12 | import math
13 |
14 | # 3rd party libraries
15 | from google.cloud import bigquery
16 |
17 | import numpy as np
18 | import pandas as pd
19 |
20 | import nltk
21 | nltk.download(['wordnet', 'stopwords'])
22 | STOPWORDS = nltk.corpus.stopwords.words('english')
23 |
24 | from scipy import stats
25 | import statsmodels.api as sm
26 | from statsmodels.formula.api import ols
27 | import scikit_posthocs as sp
28 |
29 | from sklearn.preprocessing import StandardScaler
30 | from sklearn.model_selection import train_test_split
31 | from sklearn.pipeline import Pipeline
32 | from sklearn.feature_extraction.text import TfidfVectorizer
33 | from sklearn.feature_extraction.text import CountVectorizer
34 | from sklearn.model_selection import GridSearchCV
35 | from sklearn.naive_bayes import MultinomialNB
36 | from sklearn.naive_bayes import ComplementNB
37 | from sklearn.metrics import f1_score
38 | from sklearn.decomposition import PCA
39 |
40 | import rpy2
41 | import rpy2.rlike.container as rlc
42 | from rpy2 import robjects
43 | from rpy2.robjects.vectors import FloatVector
44 | from rpy2.robjects.vectors import ListVector
45 | from rpy2.robjects.vectors import StrVector
46 | from rpy2.robjects import pandas2ri
47 |
48 | from matplotlib import pyplot as plt
49 | import seaborn as sns
50 |
51 |
52 | # MODULE FUNCTIONS
53 | # ----------------
54 |
55 | def get_dissimilarity(df, normalize=True):
56 | '''Calculates dissimilarity of observations from average
57 | observation.
58 |
59 | Args:
60 | df: Data as dataframe of shape (# observations, # variables)
61 |
62 | Returns:
63 | ser: Calculated dissimilrity as series of size (# observations)
64 | '''
65 |
66 | # normalize data
67 | if normalize:
68 | df_scaled = StandardScaler().fit_transform(df)
69 | df = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)
70 | else:
71 | raise Exception('Not implemented')
72 |
73 | # calculate multivariate dissimilarity
74 | diss = ((df - df.mean())**2).sum(axis=1)**(1/2)
75 | return diss
76 |
77 |
78 | def split_data(df, diss_var, dataset_names, threshold, dis_kws={}, **split_kws):
79 | '''Function randomly splits data into two sets, calates multivariate
80 | dissimilarity and keep all oultiers determined by dissimilarity
81 | treshold in each set.
82 |
83 | Args:
84 | df: Data as dataframe of shape (# samles, # features)
85 | diss_var: Names of variables to calculate dissimilarity measure
86 | as list of strings
87 | dataset_names: Names of datasets as list of strings
88 | threshold: Threshold for dissimilarity measure
89 | to determine outliers as float
90 | dis_kws: Key word arguments of dissimilarity function as dictionary
91 | split_kws: Key word arguents of train_test_split function
92 |
93 | Returns:
94 | datasets: Dictionary of splitted datasets as dataframe
95 | '''
96 |
97 | # calculate dissimilarity series
98 | dis_kws['normalize'] = (True if 'normalize' not in dis_kws
99 | else dis_kws['normalize'])
100 |
101 | dissimilarity = get_dissimilarity(df[diss_var], dis_kws['normalize'])
102 |
103 | # Pop outlier customers
104 | ext_mask = (dissimilarity > threshold)
105 | X_ext = df.loc[ext_mask]
106 | X = df.loc[~ext_mask]
107 |
108 | # drop one random sample to keep even samples in dataset
109 | # for purpose of having same number of samples after splitting
110 | if X.shape[0] % 2 != 0:
111 | split_kws['random_state'] = (1 if 'random_state' not in split_kws
112 | else split_kws['random_state'])
113 | remove_n = 1
114 | drop_indices = (X.sample(remove_n,
115 | random_state=split_kws['random_state'])
116 | .index)
117 | X = X.drop(drop_indices)
118 |
119 | # Random split of sample in two groups
120 | Xa, Xb = train_test_split(X, **split_kws)
121 | datasets = [Xa, Xb]
122 |
123 | # add outliers to each group
124 | datasets = {dataset_name: dataset
125 | for dataset_name, dataset in zip(dataset_names, datasets)}
126 |
127 | for name, dataset in datasets.items():
128 | datasets[name] = dataset.append(X_ext)
129 |
130 | return datasets
131 |
132 |
133 | def analyze_cluster_solution(df, vars_, labels, **kws):
134 | '''Analyzes cluster solution. Following analyses are done:
135 | 1) Hypothesis testing of clusters averages difference
136 | a) One way ANOVA
137 | b) ANOVA assumptions
138 | - residuals normality test: Shapiro-Wilk test
139 | - equal variances test: Leven's test
140 | c) Kruskal-Wallis non parametric test
141 | d) All-Pair non parametric test, Conover test by default
142 | 2) Cluster profile vizualization
143 | 3) Cluster scatterplot vizualization
144 |
145 | Args:
146 | df: Dataset as pandas dataframe
147 | of shape(# observations, # variables)
148 | vars_: Clustering variables as list of strings
149 | labels: Variable holding cluster labels as string
150 | kws: Key words arguments of post-hoc test
151 |
152 | Returns:
153 | summary: Dataframe of hypothesis tests
154 | post_hoc: List of post_hoc test for each clustering variable
155 | prof_ax: Axes of profile vizualization
156 | clst_pg: PairGrid of cluster vizulization
157 | '''
158 |
159 | def color_not_significant_red(val, signf=0.05):
160 | '''Takes a scalar and returns a string withthe css property
161 | `'color: red'` for non significant p_value
162 | '''
163 | color = 'red' if val > signf else 'black'
164 | return 'color: %s' % color
165 |
166 | # get number of seeds
167 | num_seeds = len(df.groupby(labels).groups)
168 |
169 | # run tests
170 | kws['post_hoc_fnc'] = (sp.posthoc_conover if 'post_hoc_fnc' not in kws
171 | else kws['post_hoc_fnc'])
172 |
173 | summary, post_hoc = profile_cluster_labels(
174 | df, labels, vars_, **kws)
175 |
176 | # print hypothesis tests
177 | str_ = 'PROFILE SUMMARY FOR {}'.format(labels.upper())
178 | print(str_ + '\n' + '-' * len(str_) + '\n')
179 |
180 | str_ = 'Hypothesis testing of clusters averages difference'
181 | print(str_ + '\n' + '-' * len(str_))
182 |
183 | display(summary.round(2))
184 |
185 | # print post-hoc tests
186 | str_ = '\nPost-hoc test: {}'.format(kws['post_hoc_fnc'].__name__)
187 | print(str_ + '\n' + '-' * len(str_) + '\n')
188 |
189 | for var in post_hoc:
190 | print('\nclustering variable:', var)
191 | display(post_hoc[var].round(2)
192 | .style.applymap(color_not_significant_red))
193 |
194 | # print profiles
195 | str_ = '\nProfile vizualization'
196 | print(str_ + '\n' + '-' * len(str_))
197 |
198 | prof_ax = (df
199 | .groupby(labels)
200 | [vars_]
201 | .mean()
202 | .transpose()
203 | .plot(title='Cluster Profile')
204 | )
205 | plt.ylabel('Standardized scale')
206 | plt.xlabel('Clustering variables')
207 | plt.show()
208 |
209 | # print scatterplots
210 | str_ = '\nClusters vizualization'
211 | print(str_ + '\n' + '-' * len(str_))
212 | clst_pg = sns.pairplot(x_vars=['recency', 'monetary'],
213 | y_vars=['frequency', 'monetary'],
214 | hue=labels, data=df, height=3.5)
215 | clst_pg.set(yscale='log')
216 | clst_pg.axes[0, 1].set_xscale('log')
217 | clst_pg.fig.suptitle('Candidate Solution: {} seeds'
218 | .format(num_seeds), y=1.01)
219 | plt.show()
220 |
221 | return summary, post_hoc, prof_ax, clst_pg
222 |
223 |
224 | def profile_cluster_labels(df, group, outputs, post_hoc_fnc=sp.posthoc_conover):
225 | '''Test distinctiveness of cluster (group) labes across clustering (output)
226 | variables using one way ANOVA, shapiro_wilk normality test,
227 | leven's test of equal variances, Kruskla-Wallis non parametric tests and
228 | selected all-pairs post hoc test for each output variables.
229 |
230 | Args:
231 | df: Data with clustering variables and candidate solutions
232 | as dataframe of shape (# samples, # of variables +
233 | candidate solutions)
234 |
235 | group: group variables for hypothesis testing as string
236 | output: output variables for hypothesis testing as list of string
237 | Returns:
238 | results: Dataframe of hypothesis tests for each output
239 | '''
240 |
241 | # initiate summmary dataframe
242 | summary = (df.groupby(group)[outputs]
243 | .agg(['mean', 'median'])
244 | .T.unstack(level=-1)
245 | .swaplevel(axis=1)
246 | .sort_index(level=0, axis=1))
247 | # initiate posthoc dictionary
248 | post_hoc = {}
249 |
250 | # cycle over ouptputs
251 | for i, output in enumerate(outputs):
252 |
253 | # split group levels
254 | levels = [df[output][df[group] == level]
255 | for level in df[group].unique()]
256 |
257 | # calculate F statistics and p-value
258 | _, summary.loc[output, 'anova_p'] = stats.f_oneway(*levels)
259 |
260 | # calculate leven's test for equal variances
261 | _, summary.loc[output, 'levene_p'] = stats.levene(*levels)
262 |
263 | # check if residuals are normally distributed by shapiro wilk test
264 | model = ols('{} ~ C({})'.format(output, group), data=df).fit()
265 | _, summary.loc[output, 'shapiro_wilk_p'] = stats.shapiro(model.resid)
266 |
267 | # calculate H statistics and p-value for Kruskal Wallis test
268 | _, summary.loc[output, 'kruskal_wallis_p'] = stats.kruskal(*levels)
269 |
270 | # multiple comparison Conover's test
271 | post_hoc[output] = post_hoc_fnc(
272 | df, val_col=output, group_col=group) #, p_adjust ='holm')
273 |
274 | return summary, post_hoc
275 |
276 | def get_missmatch(**kws):
277 | '''
278 | Cross tabulates dataframe on 2 selected columns and
279 | calculates missmatch proportion of rows and total
280 |
281 | Args:
282 | kws: Key word arguments to pd.crosstab function
283 |
284 | Returns:
285 | crosst_tab: result of cross tabulation as dataframe
286 | missmatch_rows: missmatch proportion by rows as series
287 | total_missmatch: total missmatch proportion as float
288 |
289 | '''
290 |
291 | cross_tab = pd.crosstab(**kws)
292 | missmatch_rows = (cross_tab.sum(axis=1) - cross_tab.max(axis=1))
293 | total_missmatch = missmatch_rows.sum() / cross_tab.sum().sum()
294 | missmatch_rows = missmatch_rows / cross_tab.sum(axis=1)
295 | missmatch_rows.name = 'missmatch_proportion'
296 |
297 | return cross_tab, missmatch_rows, total_missmatch
298 |
299 | def query_product_info(client, query_params):
300 | '''Query product information from bigquery database.
301 | Distinct records of product_sku, product_name,
302 | product_brand, product_brand_grp,
303 | product_category, product_category_grp,
304 | Args:
305 | client: Instatiated bigquery.Client to query distinct product
306 | description(product_sku, product_name, product_category,
307 | product_category_grp)
308 | query_params: Query parameters for client
309 | Returns:
310 | product_df: product information as distict records
311 | as pandas dataframe (# records, # variables)
312 | '''
313 |
314 | # Check arguments
315 | # ----------------
316 | assert isinstance(client, bigquery.Client)
317 | assert isinstance(query_params, list)
318 |
319 | # Query distinct products descriptions
320 | # ------------------------------------
321 | query='''
322 | SELECT DISTINCT
323 | hits_product.productSku AS product_sku,
324 | hits_product.v2productName AS product_name,
325 | hits_product.productBrand AS product_brand,
326 | hits.contentGroup.contentGroup1 AS product_brand_grp,
327 | hits_product.v2productCategory AS product_category,
328 | hits.contentGroup.contentGroup2 AS product_category_grp
329 | FROM
330 | `bigquery-public-data.google_analytics_sample.ga_sessions_*`
331 | LEFT JOIN UNNEST(hits) AS hits
332 | LEFT JOIN UNNEST(hits.product) AS hits_product
333 | WHERE
334 | _TABLE_SUFFIX BETWEEN @start_date AND @end_date
335 | AND hits_product.productSku IS NOT NULL
336 | ORDER BY
337 | product_sku
338 | '''
339 |
340 | job_config = bigquery.QueryJobConfig()
341 | job_config.query_parameters = query_params
342 | df = client.query(query, job_config=job_config).to_dataframe()
343 |
344 | return df
345 |
346 | def reconstruct_brand(product_sku, df):
347 | '''Reconstructs brand from product name and brand variables
348 | Args:
349 | product_sku: product_sku as of transaction records on product level
350 | of size # transactions on produc level
351 | df: Product information as output of
352 | helper.query_product_info in form of dataframe
353 | of shape (# of distinct records, # of variables)
354 |
355 | Returns:
356 | recon_brand: reconstructed brand column as pandas series
357 | of size # of transactions
358 | '''
359 |
360 | # predict brand name from product name for each sku
361 | # -------------------------------------------------
362 |
363 | # valid brands
364 | brands = ['Android',
365 | 'Chrome',
366 | r'\bGo\b',
367 | 'Google',
368 | 'Google Now',
369 | 'YouTube',
370 | 'Waze']
371 |
372 | # concatenate different product names for each sku
373 | brand_df = (df[['product_sku', 'product_name']]
374 | .drop_duplicates()
375 | .groupby('product_sku')
376 | ['product_name']
377 | .apply(lambda product_name: ' '.join(product_name))
378 | .reset_index()
379 | )
380 |
381 | # drop (no set) sku's
382 | brand_df = brand_df.drop(
383 | index=brand_df.index[brand_df['product_sku'] == '(not set)'])
384 |
385 |
386 | # predict brand name from product name for each sku
387 | brand_df['recon_brand'] = (
388 | brand_df['product_name']
389 | .str.extract(r'({})'.format('|'.join(set(brands)),
390 | flags=re.IGNORECASE))
391 | )
392 |
393 | # adjust brand taking account spelling errors in product names
394 | brand_df.loc[
395 | brand_df['product_name'].str.contains('You Tube', case=False),
396 | 'recon_brand'
397 | ] = 'YouTube'
398 |
399 |
400 | # predict brand name from brand variables for sku's where
401 | # brand couldn't be predected from product name
402 | # --------------------------------------------------------
403 |
404 | # get distinct product_sku and brand variables associations
405 | brand_vars = ['product_brand', 'product_brand_grp']
406 | brand_var = dict()
407 | for brand in brand_vars:
408 | brand_var[brand] = (df[['product_sku', brand]]
409 | .drop(index=df.index[(df['product_sku'] == '(not set)')
410 | | df['product_sku'].isna()
411 | | (df[brand] == '(not set)')
412 | | df[brand].isna()])
413 | .drop_duplicates()
414 | .drop_duplicates(subset='product_sku', keep=False))
415 |
416 | # check for brand abiguity at sku level
417 | old_brand = brand_var['product_brand'].set_index('product_sku')
418 | new_brand = brand_var['product_brand_grp'].set_index('product_sku')
419 | shared_sku = old_brand.index.intersection(new_brand.index)
420 |
421 | if not shared_sku.empty:
422 |
423 | # delete sku's with abigious brands
424 | ambigious_sku = shared_sku[
425 | old_brand[shared_sku].squeeze().values
426 | != new_brand[shared_sku].squeeze().values
427 | ]
428 |
429 | old_brand = old_brand.drop(index=ambigious_sku, errors='ignore')
430 | new_brand = new_brand.drop(index=ambigious_sku, errors='ignore')
431 |
432 | # delete sku's with multiple brands in new_brand
433 | multiple_sku = shared_sku[
434 | old_brand[shared_sku].squeeze().values
435 | == new_brand[shared_sku].squeeze().values
436 | ]
437 |
438 | new_brand = new_brand.drop(index=multiple_sku, errors='ignore')
439 |
440 | # concatenate all associations of brand variables and product sku's
441 | brand_var = pd.concat([old_brand.rename(columns={'product_brand':
442 | 'recon_brand_var'}),
443 | new_brand.rename(columns={'product_brand_grp':
444 | 'recon_brand_var'})])
445 |
446 | # predict brand name from brand variables
447 | brand_df.loc[brand_df['recon_brand'].isna(), 'recon_brand'] = (
448 | pd.merge(brand_df['product_sku'], brand_var, on='product_sku', how='left')
449 | ['recon_brand_var']
450 | )
451 |
452 | # recode remaining missing (not set) brands by Google brand
453 | # ---------------------------------------------------------
454 | brand_df['recon_brand'] = brand_df['recon_brand'].fillna('Google')
455 |
456 | # predict brand from brand names and variables on transaction data
457 | # ----------------------------------------------------------------
458 | recon_brand = (pd.merge(product_sku.to_frame(),
459 | brand_df[['product_sku', 'recon_brand']],
460 | on='product_sku',
461 | how='left')
462 | .reindex(product_sku.index)
463 | ['recon_brand'])
464 |
465 | return recon_brand
466 |
467 | def reconstruct_category(product_sku, df, category_spec):
468 | '''Reconstructs category from category variables and product names.
469 |
470 | Args:
471 | product_sku: product_sku from transaction records on product level
472 | of size # transactions on product level
473 | df: Product information as output of
474 | helper.query_product_info in form of dataframe
475 | of shape (# of distinct records, # of variables)
476 |
477 | category_spec: Dictionary with keys as category variable names
478 | and values as mappings between category variable levels
479 | to category labels in form of dataframe
480 |
481 | Returns:
482 | recon_category: reconstructed category column as pandas series
483 | of size # of trasactions on product level
484 | category_df: mappings of unique sku to category labels
485 | '''
486 |
487 | # Check arguments
488 | # ----------------
489 | assert isinstance(product_sku, pd.Series)
490 | assert isinstance(df, pd.DataFrame)
491 | assert isinstance(category_spec, dict)
492 |
493 | # reconstruct category name from product name for each sku
494 | # --------------------------------------------------------
495 |
496 | def get_category_representation(category_label, valid_categories):
497 | '''Handle multiple categories assigned to one sku.
498 | For ambigious categories returns missing value.
499 |
500 | Args:
501 | category_label: Series of category labels for
502 | particular sku
503 | valid_categories: Index of valid unique categories
504 | Returns:
505 | label: valid category label or missing value
506 | '''
507 | label = valid_categories[valid_categories.isin(category_label)]
508 | if label.empty or label.size > 1:
509 | return np.nan
510 | else:
511 | return label[0]
512 |
513 |
514 | def label_category_variable(df, category_var, label_spec):
515 | '''reconstruct category labels from category variable.
516 |
517 | Args:
518 | df: Product information dataframe.
519 | category_var: Name of category variabel to reconstruct labels
520 | label_spec: Label mapping between category variable levels
521 | and labels.
522 |
523 | Returns:
524 | var_label: Label mapping to sku as dataframe
525 |
526 | '''
527 |
528 | valid_categories = pd.Index(label_spec
529 | .groupby(['category_label'])
530 | .groups
531 | .keys())
532 |
533 | var_label = (pd.merge(df[['product_name', category_var]]
534 | .drop_duplicates(),
535 | label_spec,
536 | how='left',
537 | on=category_var)
538 | [['product_name', 'category_label']]
539 | .groupby('product_name')
540 | ['category_label']
541 | .apply(get_category_representation,
542 | valid_categories=valid_categories)
543 | .reset_index())
544 |
545 | return var_label
546 |
547 | def screen_fit_model(data):
548 | '''Screens Naive Bayes Classifiers and selects best model
549 | based on f1 weigted score. Returns fitted model and score.
550 |
551 | Args:
552 | data: Text and respective class labels as dataframe
553 | of shape (# samples, [text, labels])
554 |
555 | Returns:
556 | model: Best fitted sklearn model
557 | f1_weighted_score: Test f1 weighted score
558 |
559 | Note: Following hyperparameters are tested
560 | Algorithm: MultinomialNB, ComplementNB
561 | ngrams range: (1, 1), (1, 2), (1, 3)
562 | binarization: False, True
563 | '''
564 |
565 | # vectorize text inforomation in product_name
566 | def preprocessor(text):
567 | # not relevant words
568 | not_relevant_words = ['google',
569 | 'youtube',
570 | 'waze',
571 | 'android']
572 |
573 | # transform text to lower case and remove punctuation
574 | text = ''.join([word.lower() for word in text
575 | if word not in string.punctuation])
576 |
577 | # tokenize words
578 | tokens = re.split('\W+', text)
579 |
580 | # Drop not relevant words and lemmatize words
581 | wn = nltk.WordNetLemmatizer()
582 | text = ' '.join([wn.lemmatize(word) for word in tokens
583 | if word not in not_relevant_words + STOPWORDS])
584 |
585 | return text
586 |
587 | # define pipeline
588 | pipe = Pipeline([('vectorizer', CountVectorizer()),
589 | ('classifier', None)])
590 |
591 | # define hyperparameters
592 | param_grid = dict(vectorizer__ngram_range=[(1, 1), (1, 2), (1, 3)],
593 | vectorizer__binary=[False, True],
594 | classifier=[MultinomialNB(),
595 | ComplementNB()])
596 |
597 | # screen naive buyes models
598 | grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5,
599 | scoring='f1_weighted', n_jobs=-1)
600 |
601 | # devide dataset to train and test set using stratification
602 | # due to high imbalance of lables frequencies
603 | x_train, x_test, y_train, y_test = train_test_split(
604 | data['product_name'],
605 | data['recon_category'],
606 | test_size=0.25,
607 | stratify=data['recon_category'],
608 | random_state=1)
609 |
610 | # execute screening and select best model
611 | grid_search.fit(x_train, y_train)
612 |
613 | # calculate f1 weighted test score
614 | y_pred = grid_search.predict(x_test)
615 | f1_weigted_score = f1_score(y_test, y_pred, average='weighted')
616 |
617 | return grid_search.best_estimator_, f1_weigted_score
618 |
619 | # reconstruct category label from cateogry variables
620 | recon_labels = dict()
621 | for var, label_spec in category_spec.items():
622 | recon_labels[var] = (label_category_variable(df, var, label_spec)
623 | .set_index('product_name'))
624 |
625 | recon_labels['product_category'][
626 | recon_labels['product_category'].isna()
627 | ] = recon_labels['product_category_grp'][
628 | recon_labels['product_category'].isna()
629 | ]
630 |
631 | # reconstruct category label from produc names
632 | valid_categories = pd.Index(category_spec['product_category_grp']
633 | .groupby(['category_label'])
634 | .groups
635 | .keys())
636 |
637 | category_df = (pd.merge(df[['product_sku', 'product_name']]
638 | .drop_duplicates(),
639 | recon_labels['product_category'],
640 | how='left',
641 | on = 'product_name')
642 | [['product_sku', 'product_name', 'category_label']]
643 | .groupby('product_sku')
644 | .agg({'product_name': lambda name: name.str.cat(sep=' '),
645 | 'category_label': lambda label:
646 | get_category_representation(label, valid_categories)})
647 | .reset_index())
648 |
649 | category_df.rename(columns={'category_label': 'recon_category'},
650 | inplace=True)
651 |
652 | # associate category from category names and variables on transaction data
653 | # ------------------------------------------------------------------------
654 | recon_category = (pd.merge(product_sku.to_frame(),
655 | category_df[['product_sku', 'recon_category']],
656 | on='product_sku',
657 | how='left')
658 | )
659 |
660 | # predict category of transactions where category is unknown
661 | # Multinomial and Complement Naive Bayes model is screened
662 | # and finetuned using 1-grams, 2-grams and 3-grams
663 | # as well as binarization (Tru or False)
664 | # best model is selected based on maximizing test f1 weigted score
665 | # ----------------------------------------------------------------
666 |
667 | # screen best model and fit it on training data
668 | model, f1_weighted_score = screen_fit_model(
669 | category_df[['product_name', 'recon_category']]
670 | .dropna()
671 | )
672 |
673 | # predict category labels if model has f1_weighted_score > threshold
674 | f1_weighted_score_threshold = 0.8
675 | if f1_weighted_score < f1_weighted_score_threshold:
676 | raise Exception(
677 | 'Accuracy of category prediction below threshold {:.2f}'
678 | .format(f1_weighted_score_threshold))
679 | else:
680 | product_name = (pd.merge(recon_category
681 | .loc[recon_category['recon_category'].isna(),
682 | ['product_sku']],
683 | category_df[['product_sku', 'product_name']],
684 | how='left',
685 | on='product_sku')
686 | ['product_name'])
687 |
688 | category_label = model.predict(product_name)
689 | recon_category.loc[recon_category['recon_category'].isna(),
690 | 'recon_category'] = category_label
691 |
692 | return recon_category['recon_category']
693 |
694 | def reconstruct_sales_region(subcontinent):
695 | '''Reconstruct sales region from subcontinent'''
696 |
697 | if (pd.isna(subcontinent)
698 | or subcontinent.lower() == '(not set)'):
699 | sales_region = np.nan
700 |
701 | elif ('africa' in subcontinent.lower()
702 | or 'europe' in subcontinent.lower()):
703 | sales_region = 'EMEA'
704 |
705 | elif ('caribbean' in subcontinent.lower()
706 | or subcontinent.lower() == 'central america'):
707 | sales_region = 'Central America'
708 |
709 | elif subcontinent.lower() == 'northern america':
710 | sales_region = 'North America'
711 |
712 | elif subcontinent.lower() == 'south america':
713 | sales_region = 'South America'
714 |
715 | elif ('asia' in subcontinent.lower()
716 | or subcontinent.lower() == 'australasia'):
717 | sales_region = 'APAC'
718 |
719 | else:
720 | raise Exception(
721 | 'Can not assign sales region to {} subcontinent'
722 | .format(subcontinent))
723 |
724 | return sales_region
725 |
726 | def reconstruct_traffic_keyword(text):
727 | '''Reconstructs traffic keywords to more simple representation'''
728 |
729 | # if empty rename to not applicable
730 | if pd.isna(text):
731 | text = '(not applicable)'
732 |
733 | # if one word with mixed numbers & letters rename to (not relevant)
734 | elif re.search(r'(?=.*\d)(?=.*[A-Z=\-])(?=.*[a-z])([\w=-]+)', text):
735 | text = '(not relevant)'
736 |
737 | elif ((text != '(not provided)')
738 | and (re.search('(\s+)', text) is not None)):
739 |
740 | # transform text to lower case and remove punctuation
741 | text = ''.join([word.lower() for word in text
742 | if word not in string.punctuation.replace('/', '')])
743 |
744 | # tokenize words
745 | tokens = re.split('\W+|/', text)
746 |
747 | # Drop not relevant words and lemmatize words
748 | wn = nltk.WordNetLemmatizer()
749 | text = ' '.join([wn.lemmatize(word) for word in tokens
750 | if word not in STOPWORDS])
751 |
752 | return text
753 |
754 |
755 | def aggregate_data(df):
756 | '''Encode and aggregate engineered and missing value free data
757 | on client level
758 |
759 | Args:
760 | df: engineered and missing value free data as
761 | pandas dataframe of shape (# transaction items, # variables)
762 |
763 | agg_df: encoded and aggregated dataframe
764 | of shape(# clients, # encoded & engineered variables)
765 | with client_id index
766 | '''
767 | # identifiers
768 | id_vars = pd.Index(
769 | ['client_id',
770 | 'session_id',
771 | 'transaction_id',
772 | 'product_sku']
773 | )
774 |
775 | # session variables
776 | session_vars = pd.Index(
777 | ['visit_number', # avg_visits
778 | 'date', # month, week, week_day + one hot encode + sum
779 | 'pageviews', # avg_pageviews
780 | 'time_on_site', # avg_time_on_site
781 | 'ad_campaign', # sum
782 | 'source', # one hot encode + sum
783 | 'browser', # one hot encode + sum
784 | 'operating_system', # one hot encode + sum
785 | 'device_category', # one hot encode + sum
786 | 'continent', # one hot encode + sum
787 | 'subcontinent', # one hot encode + sum
788 | 'country', # one hot encode + sum
789 | 'sales_region', # one hot encode + sum
790 | 'social_referral', # sum
791 | 'social_network', # one hot encode + sum
792 | 'channel_group'] # one hot encode + sum
793 | )
794 |
795 | # group session variables from item to session level
796 | session_df = (df[['client_id',
797 | 'session_id',
798 | *session_vars.to_list()]]
799 | .drop_duplicates()
800 |
801 | # drop ambigious region 1 case
802 | .drop_duplicates(subset='session_id'))
803 |
804 | # reconstruct month, weeek and week day variables
805 | # session_df['month'] = session_df['date'].dt.month
806 | # session_df['week'] = session_df['date'].dt.week
807 | session_df['week_day'] = session_df['date'].dt.weekday + 1
808 | session_df = session_df.drop(columns='date')
809 |
810 | # encode variables on session level
811 | keep_vars = [
812 | 'client_id',
813 | 'session_id',
814 | 'visit_number',
815 | 'pageviews',
816 | 'time_on_site',
817 | 'social_referral',
818 | 'ad_campaign'
819 | ]
820 |
821 | encode_vars = session_df.columns.drop(keep_vars)
822 | enc_session_df = pd.get_dummies(session_df,
823 | columns=encode_vars.to_list(),
824 | prefix_sep='*')
825 |
826 | # remove not relevant encoded variables
827 | enc_session_df = enc_session_df.drop(
828 | columns=enc_session_df.columns[
829 | enc_session_df.columns.str.contains('not set|other')
830 | ]
831 | )
832 |
833 | # summarize session level variables on customer level
834 | sum_vars = (pd.Index(['social_referral', 'ad_campaign'])
835 | .append(enc_session_df
836 | .columns
837 | .drop(keep_vars)))
838 |
839 | client_session_sum_df = (enc_session_df
840 | .groupby('client_id')
841 | [sum_vars]
842 | .sum())
843 |
844 | client_session_avg_df = (
845 | enc_session_df
846 | .groupby('client_id')
847 | .agg(avg_visits=('visit_number', 'mean'),
848 | avg_pageviews=('pageviews', 'mean'),
849 | avg_time_on_site=('time_on_site', 'mean'))
850 | )
851 |
852 | client_session_df = pd.concat([client_session_avg_df,
853 | client_session_sum_df],
854 | axis=1)
855 |
856 |
857 | # product level variables
858 | product_vars = pd.Index([
859 | # 'product_name', # one hot encode + sum
860 | 'product_category', # one hot encode + sum
861 | 'product_price', # avg_product_revenue
862 | 'product_quantity', # avg_product_revenue
863 | 'hour'] # one hot encoded + sum
864 | )
865 |
866 | avg_vars = pd.Index([
867 | 'product_price',
868 | 'product_quantity'
869 | ])
870 |
871 | sum_vars = pd.Index([
872 | # 'product_name',
873 | 'product_category',
874 | 'hour'
875 | ])
876 |
877 | enc_product_df = pd.get_dummies(df[id_vars.union(product_vars)],
878 | columns=sum_vars,
879 | prefix_sep='*')
880 |
881 | # summarize product level variables on customer level
882 | client_product_sum_df = (enc_product_df
883 | .groupby('client_id')
884 | [enc_product_df.columns.drop(avg_vars)]
885 | .sum())
886 |
887 | def average_product_vars(client):
888 | d = {}
889 | d['avg_product_revenue'] = ((client['product_price']
890 | * client['product_quantity'])
891 | .sum()
892 | / client['product_quantity'].sum())
893 |
894 | # ipdb.set_trace(context=15)
895 | d['avg_unique_products'] = (client
896 | .groupby('transaction_id')
897 | ['product_sku']
898 | .apply(lambda sku: len(sku.unique()))
899 | .mean())
900 |
901 | return pd.Series(d, index=['avg_product_revenue',
902 | 'avg_unique_products'])
903 |
904 | client_product_avg_df = (enc_product_df
905 | .groupby('client_id')
906 | .apply(average_product_vars))
907 |
908 | client_product_df = pd.concat([client_product_avg_df,
909 | client_product_sum_df]
910 | , axis=1)
911 |
912 | agg_df = pd.concat([client_session_df,
913 | client_product_df],
914 | axis=1)
915 |
916 | return agg_df
917 |
918 |
919 | def do_pca(X_std, **kwargs):
920 | '''# Apply PCA to the data.'''
921 | pca = PCA(**kwargs)
922 | model = pca.fit(X_std)
923 | X_pca = model.transform(X_std)
924 | return pca, X_pca
925 |
926 | def scree_pca(pca, plot=False, **kwargs):
927 | '''Investigate the variance accounted for by each principal component.'''
928 | # PCA components
929 | n_pcs = len(pca.components_)
930 | pcs = pd.Index(range(1, n_pcs+1), name='principal component')
931 |
932 | # Eigen Values
933 | eig = pca.explained_variance_.reshape(n_pcs, 1)
934 | eig_df = pd.DataFrame(np.round(eig, 2), columns=['eigen_value'], index=pcs)
935 | eig_df['cum_eigen_value'] = np.round(eig_df['eigen_value'].cumsum(), 2)
936 |
937 | # Explained Variance %
938 | var = pca.explained_variance_ratio_.reshape(n_pcs, 1)
939 | var_df = pd.DataFrame(np.round(var, 4),
940 | columns=['explained_var'],
941 | index=pcs)
942 | var_df['cum_explained_var'] = (np.round(var_df['explained_var'].cumsum()
943 | / var_df['explained_var'].sum(), 4))
944 |
945 | df = pd.concat([eig_df, var_df], axis=1)
946 |
947 | if plot:
948 | # scree plot limit
949 | limit = pd.DataFrame(np.ones((n_pcs, 1)),
950 | columns=['scree_plot_limit'], index=pcs)
951 |
952 | ax = (pd.concat([df, limit], axis=1)
953 | .plot(y=['eigen_value', 'explained_var', 'scree_plot_limit'],
954 | title='PCA: Scree test & Variance Analysis', **kwargs)
955 | )
956 | df.plot(y=['cum_explained_var'], secondary_y=True, ax=ax)
957 |
958 | return df
959 |
960 | def get_pc_num(scree_df, pc_num = None, exp_var_threshold=None,
961 | eig_val_threshold=1):
962 | '''
963 | Selects optimum number of prinipal components according specified ojectives
964 | wheter % of explained variance or eig_val criterion
965 |
966 | Args:
967 | scree_df: Dataframe as ouptu of scree_pca function
968 | exp_var_threshold: threshold for cumulative % of epxlained variance
969 | eig_val_threshold: min eigen value, 1 by default
970 |
971 | Returns:
972 | pc_num: Number of selelected principal components
973 | exp_var: Explained variance by selected components
974 | sum_eig: Sum of eigen values of selected components
975 | '''
976 | # check arguments
977 | assert pc_num is None or pc_num <= scree_df.index.size
978 | assert exp_var_threshold is None or (0 < exp_var_threshold <= 1)
979 | assert 0 < eig_val_threshold < scree_df.index.size
980 |
981 | assert (pc_num is None or exp_var_threshold is not None) or \
982 | (pc_num is not None or exp_var_threshold is None), \
983 | ('''Either number of principal components or minimum variance
984 | explained should be selected''')
985 |
986 | if exp_var_threshold:
987 | pcs = scree_df.index[scree_df['cum_explained_var'] <= exp_var_threshold]
988 |
989 | elif pc_num:
990 | pcs = scree_df.index[range(1, pc_num+1)]
991 |
992 | elif exp_var_threshold is None:
993 | pcs = scree_df.index[scree_df['eigen_value'] > eig_val_threshold]
994 |
995 | pc_num = len(pcs)
996 | exp_var = scree_df.loc[pc_num, 'cum_explained_var']
997 | sum_eig = scree_df.loc[[*pcs], 'eigen_value'].sum()
998 |
999 | return pc_num, exp_var, sum_eig
1000 |
1001 | def varimax(factor_df, **kwargs):
1002 | '''
1003 | varimax rotation of factor matrix
1004 |
1005 | Args:
1006 | factor_df: factor matrix as pd.DataFrame with shape
1007 | (# features, # principal components)
1008 |
1009 | Return:
1010 | rot_factor_df: rotated factor matrix as pd.DataFrame
1011 | '''
1012 | factor_mtr = df2mtr(factor_df)
1013 | varimax = robjects.r['varimax']
1014 | rot_factor_mtr = varimax(factor_mtr)
1015 | return pandas2ri.ri2py(rot_factor_mtr.rx2('loadings'))
1016 |
1017 | def get_components(df, pca, rotation=None, sort_by='sig_ld',
1018 | feat_details=None, plot='None', **kwargs):
1019 | '''
1020 | Show significant factor loadings depending on sample size
1021 |
1022 | Args:
1023 | df: data used for pca as pd.DataFrame
1024 | pca: fitted pca object
1025 | rotation: if to apply factor matrix rotation, by default None.
1026 | sort_by: sort sequence of components, by default accoring
1027 | number of significant loadings 'sig_load'
1028 | feat_details: Dictionary of mapped feature detials, by default None
1029 | plot: 'discrete' plots heatmap enhancing sifinigicant laodings
1030 | 'continuous' plots continous heatmap,
1031 | by default None
1032 |
1033 | Returns:
1034 | factor_df: factor matrix as pd.DataFrame
1035 | of shape (# features, # components)
1036 | sig_ld: number of significant loadings across components as
1037 | pd. Series of size # components
1038 | cross_ld: number of significant loadings across features
1039 | (cross loadings) as pd. Series of size # features
1040 |
1041 | '''
1042 | # constants
1043 | # ---------
1044 | maxstr = 100 # amount of the characters to print
1045 |
1046 | # guidelines for indentifying significant factor loadings
1047 | # based on sample size. Source: Multivariate Data Analysis. 7th Edition.
1048 | factor_ld = np.linspace(0.3, 0.75, 10)
1049 | signif_sz = np.array([350, 250, 200, 150, 120, 100, 85, 70, 60, 50])
1050 |
1051 | # loadings significant treshold
1052 | ld_sig = factor_ld[len(factor_ld) - (signif_sz <= df.index.size).sum()]
1053 |
1054 | if rotation == 'varimax':
1055 | components = varimax(pd.DataFrame(pca.components_.T))
1056 | else:
1057 | components = pca.components_.T
1058 |
1059 | # annotate factor matrix
1060 | index = pd.Index([])
1061 | for feat in df.columns:
1062 | try:
1063 | index = index.append(
1064 | pd.Index([feat]) if feat_details is None else \
1065 | pd.Index([feat_details[feat]['long_name'][:maxstr]]))
1066 | except KeyError:
1067 | index = index.append(pd.Index([feat]))
1068 |
1069 | factor_df = pd.DataFrame(
1070 | np.round(components, 2),
1071 | columns = pd.Index(range(1, components.shape[1]+1),
1072 | name='principal_components'),
1073 | index = index.rename('features')
1074 | )
1075 |
1076 | # select significant loadings
1077 | sig_mask = (factor_df.abs() >= ld_sig)
1078 |
1079 | # calculate cross loadings
1080 | cross_ld = (sig_mask.sum(axis=1)
1081 | .sort_values(ascending=False)
1082 | .rename('cross_loadings'))
1083 |
1084 | # calculate number of significant loadings per component
1085 | sig_ld = (sig_mask.sum()
1086 | .sort_values(ascending=False)
1087 | .rename('significant_loadings'))
1088 |
1089 | # sort vactor matrix by loadings in components
1090 | sort_by = [*sig_ld.index] if sort_by == 'sig_ld' else sort_by
1091 | factor_df.sort_values(sort_by, ascending=False, inplace=True)
1092 |
1093 | if plot == 'continuous':
1094 | plt.figure(**kwargs)
1095 | sns.heatmap(
1096 | factor_df.sort_values(sort_by, ascending=False).T,
1097 | cmap='RdYlBu', vmin=-1, vmax=1, square=True
1098 | )
1099 | plt.title('Factor matrix')
1100 |
1101 | elif plot == 'discrete':
1102 | # loadings limits
1103 | ld_min, ld_sig_low, ld_sig_high, ld_max = -1, -ld_sig, ld_sig, 1
1104 | vmin, vmax = ld_min, ld_max
1105 |
1106 | # create descrete scale to distingish sifnificant diffrence categories
1107 | data = factor_df.apply(
1108 | lambda col: pd.to_numeric(pd.cut(col,
1109 | [ld_min, -ld_sig, ld_sig, ld_max],
1110 | labels=[-ld_sig, 0, ld_sig])))
1111 |
1112 | # plot heat map
1113 | fig = plt.figure(**kwargs)
1114 | sns.heatmap(data.T, cmap='viridis', vmin=vmin, vmax=vmax, square=True)
1115 | plt.title('Factor matrix with significant laodings: {} > loading > {}'
1116 | .format(-ld_sig, ld_sig));
1117 |
1118 | return factor_df, sig_ld, cross_ld
1119 |
1120 | def df2mtr(df):
1121 | '''
1122 | Convert pandas dataframe to r matrix. Category dtype is casted as
1123 | factorVector considering missing values
1124 | (original py2ri function of rpy2 can't handle this properly so far)
1125 |
1126 | Args:
1127 | data: pandas dataframe of shape (# samples, # features)
1128 | with numeric dtype
1129 |
1130 | Returns:
1131 | mtr: r matrix of shape (# samples # features)
1132 | '''
1133 | # check arguments
1134 | assert isinstance(df, pd.DataFrame), 'Argument df need to be a pd.Dataframe.'
1135 |
1136 | # select only numeric columns
1137 | df = df.select_dtypes('number')
1138 |
1139 | # create and return r matrix
1140 | values = FloatVector(df.values.flatten())
1141 | dimnames = ListVector(
1142 | rlc.OrdDict([('index', StrVector(tuple(df.index))),
1143 | ('columns', StrVector(tuple(df.columns)))])
1144 | )
1145 |
1146 | return robjects.r.matrix(values, nrow=len(df.index), ncol=len(df.columns),
1147 | dimnames = dimnames, byrow=True)
1148 |
1149 | def screen_model(X_train, X_test, y_train, y_test, grid_search, fine_param=None,
1150 | title='MODEL SCREENING EVALUATION', verbose='text'):
1151 | '''Screen pipeline with diffrent hyperparameters.
1152 |
1153 | Args:
1154 | X_train, X_test: Pandas DataFrame of shape (# samples, # features)
1155 | _train - training set, _test - test set
1156 | y_train, y_test: Pandas Series of size (# of samples, label)
1157 | grid_search: GridSearchCV object
1158 | verbose: 'text' shows grid_search results DataFrame
1159 | 'plot' shows scores run chart with fine_param
1160 | fine_param: name of the parameter to fine tune model. Used only with
1161 | 'plot' option
1162 |
1163 | Returns:
1164 | grid_search: fitted the grid_search object
1165 |
1166 | '''
1167 | # screen models
1168 | grid_search.fit(X_train, y_train)
1169 |
1170 | # print output
1171 | if verbose == 'text':
1172 | # screen results
1173 | screen_results = (pd.DataFrame(grid_search.cv_results_)
1174 | .sort_values('rank_test_score'))
1175 |
1176 | hyper_params = screen_results.columns[
1177 | screen_results.columns.str.contains('param_')
1178 | ]
1179 |
1180 | if 'param_classifier' in screen_results:
1181 | screen_results['param_classifier'] = (
1182 | screen_results['param_classifier']
1183 | .apply(lambda cls_: type(cls_).__name__)
1184 | )
1185 |
1186 | screen_results['overfitting'] = (
1187 | screen_results['mean_train_score']
1188 | - screen_results['mean_test_score']
1189 | )
1190 |
1191 | # calculate f1 weighted test score
1192 | y_pred = grid_search.predict(X_test)
1193 |
1194 | f1_weighted_score = f1_score(y_test, y_pred, average='weighted')
1195 |
1196 | # print results
1197 | print(title + '\n' + '-' * len(title))
1198 |
1199 | display(screen_results
1200 | [hyper_params.union(pd.Index(['mean_train_score',
1201 | 'std_train_score',
1202 | 'mean_test_score',
1203 | 'std_test_score',
1204 | 'mean_fit_time',
1205 | 'mean_score_time',
1206 | 'overfitting']))])
1207 |
1208 | print('Best model is {} with F1 test weighted score {:.3f}\n'
1209 | .format(type(grid_search
1210 | .best_estimator_
1211 | .named_steps
1212 | .classifier)
1213 | .__name__,
1214 | f1_weighted_score))
1215 |
1216 | elif verbose == 'plot':
1217 | if fine_param in grid_search.cv_results_:
1218 |
1219 | # screen results
1220 | screen_results = pd.DataFrame(grid_search.cv_results_)
1221 |
1222 | # plot results
1223 | screen_results = pd.melt(
1224 | screen_results,
1225 | id_vars=fine_param,
1226 | value_vars=(screen_results.columns[
1227 | screen_results.columns
1228 | .str.contains(r'split\d_\w{4,5}_score', regex=True)
1229 | ]),
1230 | var_name='score_type',
1231 | value_name=grid_search.scoring
1232 | )
1233 |
1234 | screen_results['score_type'] = screen_results['score_type'].replace(
1235 | regex=r'split\d_(\w{4,5}_score)', value=r'\1'
1236 | )
1237 |
1238 | sns.lineplot(x=fine_param,
1239 | y=grid_search.scoring,
1240 | hue='score_type',
1241 | data=screen_results,
1242 | err_style='bars',
1243 | ax=plt.gca(),
1244 | marker='o',
1245 | linestyle='dashed')
1246 | plt.gca().set_title(title);
1247 |
1248 | return grid_search
1249 |
1250 | def plot_features_significance(estimator, X_std, y, feature_names, class_names,
1251 | threshold = -np.inf, title=''):
1252 | '''Analyzes features significance of the estimator.
1253 |
1254 | Args:
1255 | estimator: Sklearn estimator with coef_ or feature_improtances
1256 | attribute
1257 | X_std: Standardized inputs as dataframe
1258 | of shape (# of samples, # of features)
1259 | y: Class labels as Series of size # of samples
1260 | feature_names: list/index of features names
1261 | class_names: list of class names
1262 | threshold: Filters only significant coeficients following
1263 | |coeficient| <= threshold
1264 | title: title to put on each plot + class name will be added.
1265 | '''
1266 |
1267 | assert ('coef_' in dir(estimator)
1268 | or 'feature_importances' in dir(estimator))
1269 |
1270 | # get factor matrix
1271 | factor_matrix = pd.DataFrame(estimator.coef_ if 'coef_' in dir(estimator)
1272 | else estimator.feature_importances_,
1273 | index=estimator.classes_,
1274 | columns=feature_names)
1275 |
1276 |
1277 | cols = 2
1278 | rows = math.ceil(len(estimator.classes_) / cols)
1279 | fig, axes = plt.subplots(rows, cols,
1280 | figsize=(10*cols, 10*rows),
1281 | sharex=True);
1282 | plt.subplots_adjust(hspace=0.07, wspace=0.4)
1283 |
1284 | for i, (ax, class_idx, class_name) in enumerate(
1285 | zip(axes.flatten(), estimator.classes_, class_names)):
1286 |
1287 | # sort feature weigths and select features
1288 | sorted_coef = (factor_matrix
1289 | .loc[class_idx]
1290 | .abs()
1291 | .sort_values(ascending=False))
1292 |
1293 | selected_feats = sorted_coef[sorted_coef >= threshold].index
1294 |
1295 | selected_coef = (factor_matrix
1296 | .loc[class_idx, selected_feats]
1297 | .rename('feature weights'))
1298 |
1299 | # calculate one-to-rest standardized average differences
1300 | selected_diff = (
1301 | (X_std.loc[y == class_idx, selected_feats].mean()
1302 | - X_std.loc[y != class_idx, selected_feats].mean())
1303 | .rename('standardized difference of one-to-rest everages')
1304 | )
1305 |
1306 | # print bar chars
1307 | selected_df = (pd.concat([selected_coef, selected_diff], axis=1)
1308 | .sort_values('feature weights'))
1309 |
1310 | selected_df.plot.barh(ax=ax, legend=True if i==0 else False)
1311 | ax.set_title(title + ' ' + class_name)
1312 |
--------------------------------------------------------------------------------
/product_categories.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/product_categories.xlsx
--------------------------------------------------------------------------------
/temp_data.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alfredsasko/Customer-Journey-Analytics/c0cc8c28167c837f257753a8cb6870adf017b9aa/temp_data.h5
--------------------------------------------------------------------------------