├── .gitignore ├── README.md ├── bin ├── p1q1.sql ├── p1q2.sql └── p3.py ├── data ├── input │ └── ds_challenge_v2_1_data.csv └── output │ └── signups_enriched.csv ├── docs ├── instructions │ ├── DSInterviewChallengeV_2_4.pdf │ └── ds_challenge_v2_1_data.csv ├── modeling_notes.txt └── responses │ ├── q3_sample_output.log │ ├── responses.docx │ └── responses.pdf └── environment.yml /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled python code 2 | ~.pyc 3 | 4 | # Mac folder info 5 | .DS_Store 6 | 7 | # Idea / Pycharm files 8 | .idea/* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | #Uber-DS-Challenge 2 | 3 | A data challenge provided by Uber, for a Data Scientist position. 4 | ## Getting started 5 | 6 | ### Repo structure 7 | Data Challenge instructions and provided data are in `docs/instructions`, and responses to questions are in 8 | `docs/responses`. Code underlying responses to questions is in `bin/`. 9 | 10 | ### Python Environment 11 | Python code in this repo utilizes packages that are not part of the common library. To make sure you have all of the 12 | appropriate packages, please install [Anaconda](https://www.continuum.io/downloads), and install the environment 13 | described in environment.yml (Instructions [here](http://conda.pydata.org/docs/using/envs.html), under *Use 14 | environment from file*, and *Change environments (activate/deactivate)*). 15 | 16 | ### To run code 17 | Part 1 and Part 2 are example code. They have not been validated against the tables described in the instructions, and 18 | as such should not be considered compile-ready. 19 | 20 | To run the Python code for Part 3, complete the following: 21 | ```bash 22 | # Install anaconda environment 23 | conda env create -f environment.yml 24 | 25 | # Activate environment 26 | source activate uber_env 27 | 28 | # Run script 29 | cd bin/ 30 | python p3.py 31 | ``` 32 | 33 | 34 | ## Contact 35 | Feel free to contact me at 13herger@gmail.com 36 | -------------------------------------------------------------------------------- /bin/p1q1.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | PERCENTILE_CONT(.9) 3 | WITHIN GROUP (ORDER BY trips.actual_eta-predicted_eta) 4 | AS 90th_percentile 5 | FROM trips 6 | LEFT OUTER JOIN cities 7 | WHERE trips.city_id == cities.city_id 8 | WHERE cities.city_name IN ('Qarth', 'Meereen') 9 | AND trips.status == 'completed' 10 | AND trips.request_at > (CURRENT_TIMESTAMP- INTERVAL '10 days'); 11 | -------------------------------------------------------------------------------- /bin/p1q2.sql: -------------------------------------------------------------------------------- 1 | SELECT signups_enhanced.day_of_week, AVG(rode_in_first_week::int) 2 | FROM 3 | 4 | -- Create sub-table with one row for every rider who signed up, with rode_in_first_week metric 5 | ( SELECT events.* 6 | EXTRACT( DOW FROM _ts) AS day_of_week 7 | -- Actually compute rode_in_first_week metric 8 | -- Check if user has a ride 9 | (MIN(trips.request_at) IS NOT NULL 10 | -- First ride within 168 hours 11 | AND MIN(trips.request_at) <= MIN(events._ts) + INTERVAL '168 hours' 12 | -- No rides before sign up 13 | AND MIN(trips.request_at) >= MIN(events._ts)) 14 | AS rode_in_first_week 15 | FROM trips 16 | LEFT OUTER JOIN 17 | 18 | -- Create sub-table with every rider's first completed trip 19 | (SELECT DISTINCT ON (trips.client_id) trips.client_id, request_at 20 | FROM trips 21 | WHERE trips.status == 'completed' 22 | ORDER BY trips.request_at ASC 23 | ) AS first_completed_trips 24 | 25 | WHERE events.rider_id == first_completed_trips.client_id 26 | AND event_name == 'sign_up_success' 27 | ) AS signups_enhanced 28 | 29 | GROUP BY signups_enhanced.day_of_week 30 | WHERE EXTRACT(WEEK FROM signup_ts) == 1 31 | AND EXTRACT(YEAR FROM signup_ts) == 2016; 32 | AND city_name IN ('Qarth', 'Meereen'); 33 | -------------------------------------------------------------------------------- /bin/p3.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | coding=utf-8 4 | 5 | Code supporting question 3 of Uber DS challenge (see docs/instructions) 6 | 7 | """ 8 | # Imports 9 | import logging 10 | 11 | from sklearn.metrics import roc_auc_score 12 | 13 | logging.basicConfig(level=logging.DEBUG) 14 | 15 | import numpy as np 16 | import pandas as pd 17 | import statsmodels.api as sm 18 | 19 | from patsy.highlevel import dmatrices 20 | 21 | 22 | 23 | 24 | # Functions 25 | def main(): 26 | """ 27 | Main method for question 3 of Uber DS challenge. 28 | 29 | This method reads, cleans and enriches the data set, and then runs models on that dataset. 30 | 31 | Model output is printed to screen, though the accompanying write up should be the primary reference for reviewing 32 | these results. 33 | :return: None 34 | :rtype: None 35 | """ 36 | 37 | # Create enriched data set, save to file for external validation 38 | signup_df = create_signup_df('../data/input/ds_challenge_v2_1_data.csv') 39 | signup_df.to_csv('../data/output/signups_enriched.csv') 40 | 41 | drive_percentage = signup_df['drove'].astype(int).mean() 42 | print 'Q1: Percentage of drivers w/ completed ride: %s ' % drive_percentage 43 | # Prepare model for statsmodels consumption 44 | stats_df = statsmodel_prep_df(signup_df) 45 | 46 | # Train test split 47 | train = stats_df.sample(frac=0.8, random_state=200) 48 | test = stats_df.drop(train.index) 49 | # Run logistic regression models 50 | run_statsmodels_models(train, test, 'drove ~ signup_channel_referral + city_Berton + signup_weekday + ' 51 | 'vehicle_inspection_known') 52 | run_statsmodels_models(train, test, 'drove ~ signup_channel_referral + city_Berton + ' 53 | 'signup_to_vehicle_add + signup_to_bgc') 54 | 55 | 56 | def extract_days(input_delta): 57 | """ 58 | Helper function to extract the number of days from a time delta. Returns: 59 | - Number of days, if valid time delta 60 | - np.NaN if time delta is null or invalid 61 | :param input_delta: 62 | :return: number of days in time delta 63 | :rtype: float 64 | """ 65 | 66 | # Attempt to coerce into Pandas time delta 67 | delta = pd.Timedelta(input_delta) 68 | 69 | # Attempt to extract number of days 70 | days = np.NaN 71 | if pd.notnull(delta): 72 | days = delta.days 73 | 74 | # Return result 75 | return days 76 | 77 | 78 | def create_signup_df(data_path): 79 | """ 80 | Read signup data in from path, clean, enrich and return 81 | :param data_path: path to signup data 82 | :type data_path: str 83 | :return: Pandas data frame, containing original and enriched data 84 | :rtype: pd.DataFrame 85 | """ 86 | logging.info('Beginning import of signup data, from path: %s' % (data_path)) 87 | 88 | # Read in data 89 | logging.info('Reading in data') 90 | signup_df = pd.read_csv(data_path) 91 | isinstance(signup_df, pd.DataFrame) 92 | logging.info('Data shape: %s rows, %s columns' % signup_df.shape) 93 | logging.info('Data columns: %s' % (list(signup_df.columns))) 94 | 95 | # Create drove field 96 | logging.info('Creating drove field') 97 | signup_df['drove'] = signup_df['first_completed_date'].notnull() 98 | 99 | # Data cleaning 100 | logging.info('Cleaning data') 101 | # Replace invalid vehicle_year value w/ NaN 102 | signup_df['vehicle_year'] = signup_df['vehicle_year'].replace(to_replace=[0], value=np.NaN) 103 | 104 | # Convert data types 105 | logging.info('Converting data types') 106 | signup_df['signup_date'] = pd.to_datetime(signup_df['signup_date'], format='%m/%d/%y') 107 | signup_df['bgc_date'] = pd.to_datetime(signup_df['bgc_date'], format='%m/%d/%y') 108 | signup_df['vehicle_added_date'] = pd.to_datetime(signup_df['vehicle_added_date'], format='%m/%d/%y') 109 | 110 | # Create simple enriched features 111 | logging.info('Enriching data with simple derived features') 112 | signup_df['bgc_known'] = signup_df['bgc_date'].notnull() 113 | signup_df['vehicle_inspection_known'] = signup_df['vehicle_added_date'].notnull() 114 | signup_df['signup_os_known'] = signup_df['signup_os'].notnull() 115 | signup_df['vehicle_make_known'] = signup_df['vehicle_make'].notnull() 116 | 117 | # Create time series based enriched features 118 | logging.info('Enriching data with timeseries derived features') 119 | signup_df['signup_to_bgc'] = (signup_df['bgc_date'] - signup_df['signup_date']).apply(extract_days) 120 | signup_df['signup_to_vehicle_add'] = (signup_df['bgc_date'] - signup_df['vehicle_added_date']).apply(extract_days) 121 | 122 | # Clip at 0 to prevent events before signups 123 | signup_df['signup_to_bgc'] = signup_df['signup_to_bgc'].clip(lower=0) 124 | signup_df['signup_to_vehicle_add'] = signup_df['signup_to_vehicle_add'].clip(lower=0) 125 | 126 | # For weekday() description, see 127 | # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.weekday.html 128 | signup_df['signup_weekday'] = signup_df['bgc_date'].apply(lambda x: x.weekday() <= 4) 129 | 130 | # Re-centering 131 | signup_df['vehicle_year_past_2000'] = signup_df['vehicle_year'] - 2000 132 | 133 | # Return enriched DF 134 | logging.debug('Enriched data frame description: \n%s' % signup_df.describe()) 135 | return signup_df 136 | 137 | 138 | def statsmodel_prep_df(input_df): 139 | """ 140 | Prepare the input data frame to be consumed by statsmodels. This process includes: 141 | - Zero filling columns where NaN logically means 0 142 | - Smarter null filling 143 | :param input_df: Raw data frame to be prepared 144 | :type input_df: pd.DataFrame 145 | :return: Prepared data frame 146 | :rtype: pd.DataFrame 147 | """ 148 | 149 | prepped_df = input_df.copy(deep=True) 150 | isinstance(prepped_df, pd.DataFrame) 151 | 152 | # Appropriately deal w/ NaN values 153 | zero_fill_list = ['bgc_known', 'signup_os_known', 'vehicle_make_known', 'signup_weekday'] 154 | for col in zero_fill_list: 155 | num_null = prepped_df[col].isnull().sum() 156 | if num_null > 0: 157 | logging.warn('%s null values found in column: %s' % (num_null, col)) 158 | prepped_df[col] = prepped_df[col].fillna(0) 159 | 160 | isinstance(prepped_df, pd.DataFrame) 161 | 162 | # Smarter null filling 163 | prepped_df['signup_channel'] = prepped_df['signup_channel'].fillna('not_known') 164 | prepped_df['city_name'] = prepped_df['city_name'].fillna('not_known') 165 | 166 | # Manual dummy'ing 167 | prepped_df['city_Berton'] = prepped_df['city_name'] == 'Berton' 168 | prepped_df['signup_channel_organic'] = prepped_df['signup_channel'] == 'Organic' 169 | prepped_df['signup_channel_referral'] = prepped_df['signup_channel'] == 'Referral' 170 | 171 | # Remove instances missing drove field 172 | prepped_df = prepped_df[prepped_df['drove'].notnull()] 173 | 174 | # Convert drove to int 175 | prepped_df['drove'] = prepped_df['drove'].astype(int) 176 | 177 | # Return formatted DF 178 | logging.debug('Statsmodel prepped df: \n%s' % prepped_df.describe()) 179 | return prepped_df 180 | 181 | 182 | def run_statsmodels_models(train, test, model_description): 183 | """ 184 | Run logistic regression model to predict whether a signed up driver ever actually drove. 185 | :param input_df: Data frame prepared for statsmodels regression 186 | :type input_df: pd.DataFrame 187 | :return: AUC for model generated 188 | :rtype: float 189 | """ 190 | # Run model on all observations 191 | # Use dmatrices to format data 192 | logging.info('Running model w/ description: %s' %model_description) 193 | logging.debug('Train df: \n%s' % train.describe()) 194 | logging.debug('Test df: \n%s' % test.describe()) 195 | y_train, X_train = dmatrices(model_description, data=train, return_type='dataframe', NA_action='drop') 196 | y_test, X_test = dmatrices(model_description, data=test, return_type='dataframe', NA_action='drop') 197 | 198 | # Create, fit model 199 | mod = sm.Logit(endog=y_train, exog=X_train) 200 | res = mod.fit(method='bfgs', maxiter=100) 201 | 202 | # Output model summary 203 | print train['city_name'].value_counts() 204 | print train['signup_channel'].value_counts() 205 | print res.summary() 206 | 207 | # Create, output AUC 208 | predicted = res.predict(X_test) 209 | auc = roc_auc_score(y_true=y_test, y_score=predicted) 210 | print 'AUC for 20%% holdout: %s' %auc 211 | 212 | # Return AUC for model generated 213 | return auc 214 | 215 | 216 | 217 | # Main section 218 | if __name__ == '__main__': 219 | main() 220 | -------------------------------------------------------------------------------- /docs/instructions/DSInterviewChallengeV_2_4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/Uber-DS-Challenge/7d904c62849a85b93707b075b2168240df1534ba/docs/instructions/DSInterviewChallengeV_2_4.pdf -------------------------------------------------------------------------------- /docs/modeling_notes.txt: -------------------------------------------------------------------------------- 1 | 2 | Visual inspection in Tableau: 3 | bgc date - signup date: conversion percentage correlates negatively with difference between fields 4 | vehicle added date - signup date: conversion percentage correlates negatively with difference between fields 5 | vehicle make, model, year: do not appear to make a difference in conversion percentage 6 | vehicle year: More vehicles are 5+ years old than expected 7 | 8 | Data cleaning / enrichming 9 | vehicle year: contains 0's. replacing w/ NAN 10 | signup_to_bgc: some bgc seem to occur before signup. Setting min value of 0 11 | signup_to_vehicle_add: some vehicle_add seem to occur before signup. Setting min value of 0 12 | 13 | Modeling: 14 | No multicollinearity issues apparent 15 | No heteroskedasticity issues apparent 16 | background check is indistinguishable from noise 17 | vehicle make is indistinguishable from noise 18 | vehicle year statistically significant, recentering around 2000 19 | 20 | vehicle inspect and vehicle year both statistically significant. Possible multicollinearity? Is year gathered at 21 | inspection? 22 | -------------------------------------------------------------------------------- /docs/responses/q3_sample_output.log: -------------------------------------------------------------------------------- 1 | /Users/bjherger/anaconda2/envs/uber_env/bin/python /Users/bjherger/Documents/Uber-DS-Challenge/bin/p3.py 2 | INFO:root:Beginning import of signup data, from path: ../data/input/ds_challenge_v2_1_data.csv 3 | INFO:root:Reading in data 4 | INFO:root:Data shape: 54681 rows, 11 columns 5 | INFO:root:Data columns: ['id', 'city_name', 'signup_os', 'signup_channel', 'signup_date', 'bgc_date', 'vehicle_added_date', 'vehicle_make', 'vehicle_model', 'vehicle_year', 'first_completed_date'] 6 | INFO:root:Creating drove field 7 | INFO:root:Cleaning data 8 | INFO:root:Converting data types 9 | INFO:root:Enriching data with simple derived features 10 | INFO:root:Enriching data with timeseries derived features 11 | /Users/bjherger/anaconda2/envs/uber_env/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile 12 | RuntimeWarning) 13 | DEBUG:root:Enriched data frame description: 14 | id vehicle_year signup_to_bgc signup_to_vehicle_add \ 15 | count 54681.000000 13219.000000 32896.000000 12879.000000 16 | mean 27341.000000 2011.176413 10.046541 0.127106 17 | std 15785.189372 4.135149 10.519617 1.341945 18 | min 1.000000 1995.000000 0.000000 0.000000 19 | 25% 13671.000000 NaN NaN NaN 20 | 50% 27341.000000 NaN NaN NaN 21 | 75% 41011.000000 NaN NaN NaN 22 | max 54681.000000 2017.000000 69.000000 30.000000 23 | 24 | vehicle_year_past_2000 25 | count 13219.000000 26 | mean 11.176413 27 | std 4.135149 28 | min -5.000000 29 | 25% NaN 30 | 50% NaN 31 | 75% NaN 32 | max 17.000000 33 | Q1: Percentage of drivers w/ completed ride: 0.112232768238 34 | DEBUG:root:Statsmodel prepped df: 35 | id vehicle_year drove signup_to_bgc \ 36 | count 54681.000000 13219.000000 54681.000000 32896.000000 37 | mean 27341.000000 2011.176413 0.112233 10.046541 38 | std 15785.189372 4.135149 0.315656 10.519617 39 | min 1.000000 1995.000000 0.000000 0.000000 40 | 25% 13671.000000 NaN 0.000000 NaN 41 | 50% 27341.000000 NaN 0.000000 NaN 42 | 75% 41011.000000 NaN 0.000000 NaN 43 | max 54681.000000 2017.000000 1.000000 69.000000 44 | 45 | signup_to_vehicle_add vehicle_year_past_2000 46 | count 12879.000000 13219.000000 47 | mean 0.127106 11.176413 48 | std 1.341945 4.135149 49 | min 0.000000 -5.000000 50 | 25% NaN NaN 51 | 50% NaN NaN 52 | 75% NaN NaN 53 | max 30.000000 17.000000 54 | INFO:root:Running model w/ description: drove ~ signup_channel_referral + city_Berton + signup_weekday + vehicle_inspection_known 55 | DEBUG:root:Train df: 56 | id vehicle_year drove signup_to_bgc \ 57 | count 43745.000000 10570.000000 43745.000000 26318.000000 58 | mean 27352.690708 2011.212204 0.112996 10.003268 59 | std 15795.139353 4.104407 0.316591 10.497880 60 | min 1.000000 1995.000000 0.000000 0.000000 61 | 25% 13643.000000 NaN 0.000000 NaN 62 | 50% 27378.000000 NaN 0.000000 NaN 63 | 75% 41055.000000 NaN 0.000000 NaN 64 | max 54681.000000 2017.000000 1.000000 69.000000 65 | 66 | signup_to_vehicle_add vehicle_year_past_2000 67 | count 10309.000000 10570.000000 68 | mean 0.126879 11.212204 69 | std 1.339746 4.104407 70 | min 0.000000 -5.000000 71 | 25% NaN NaN 72 | 50% NaN NaN 73 | 75% NaN NaN 74 | max 30.000000 17.000000 75 | DEBUG:root:Test df: 76 | id vehicle_year drove signup_to_bgc \ 77 | count 10936.000000 2649.000000 10936.000000 6578.000000 78 | mean 27294.236101 2011.033598 0.109181 10.219672 79 | std 15745.959694 4.253398 0.311880 10.605176 80 | min 3.000000 1996.000000 0.000000 0.000000 81 | 25% 13731.750000 NaN 0.000000 NaN 82 | 50% 27193.000000 NaN 0.000000 NaN 83 | 75% 40854.750000 NaN 0.000000 NaN 84 | max 54676.000000 2017.000000 1.000000 57.000000 85 | 86 | signup_to_vehicle_add vehicle_year_past_2000 87 | count 2570.000000 2649.000000 88 | mean 0.128016 11.033598 89 | std 1.350992 4.253398 90 | min 0.000000 -4.000000 91 | 25% NaN NaN 92 | 50% NaN NaN 93 | 75% NaN NaN 94 | max 28.000000 17.000000 95 | Optimization terminated successfully. 96 | Current function value: 0.192102 97 | Iterations: 34 98 | Function evaluations: 36 99 | Gradient evaluations: 36 100 | Strark 23663 101 | Berton 16061 102 | Wrouver 4021 103 | Name: city_name, dtype: int64 104 | Paid 19135 105 | Referral 13915 106 | Organic 10695 107 | Name: signup_channel, dtype: int64 108 | DEBUG:root:Enter SimpleTable.data2rows. 109 | DEBUG:root:Exit SimpleTable.data2rows. 110 | DEBUG:root:Enter SimpleTable.data2rows. 111 | DEBUG:root:Exit SimpleTable.data2rows. 112 | DEBUG:root:Enter SimpleTable.data2rows. 113 | DEBUG:root:Exit SimpleTable.data2rows. 114 | Logit Regression Results 115 | ============================================================================== 116 | Dep. Variable: drove No. Observations: 43745 117 | Model: Logit Df Residuals: 43740 118 | Method: MLE Df Model: 4 119 | Date: Mon, 01 Aug 2016 Pseudo R-squ.: 0.4554 120 | Time: 22:56:30 Log-Likelihood: -8403.5 121 | converged: True LL-Null: -15430. 122 | LLR p-value: 0.000 123 | ==================================================================================================== 124 | coef std err z P>|z| [95.0% Conf. Int.] 125 | ---------------------------------------------------------------------------------------------------- 126 | Intercept -5.3724 0.074 -72.934 0.000 -5.517 -5.228 127 | signup_channel_referral[T.True] 0.4938 0.038 12.906 0.000 0.419 0.569 128 | city_Berton[T.True] 0.1165 0.039 2.966 0.003 0.040 0.193 129 | signup_weekday[T.True] 0.3623 0.042 8.619 0.000 0.280 0.445 130 | vehicle_inspection_known[T.True] 4.6163 0.073 63.663 0.000 4.474 4.758 131 | ==================================================================================================== 132 | AUC for 20% holdout: 0.922580207546 133 | INFO:root:Running model w/ description: drove ~ signup_channel_referral + city_Berton + signup_to_vehicle_add + signup_to_bgc 134 | DEBUG:root:Train df: 135 | id vehicle_year drove signup_to_bgc \ 136 | count 43745.000000 10570.000000 43745.000000 26318.000000 137 | mean 27352.690708 2011.212204 0.112996 10.003268 138 | std 15795.139353 4.104407 0.316591 10.497880 139 | min 1.000000 1995.000000 0.000000 0.000000 140 | 25% 13643.000000 NaN 0.000000 NaN 141 | 50% 27378.000000 NaN 0.000000 NaN 142 | 75% 41055.000000 NaN 0.000000 NaN 143 | max 54681.000000 2017.000000 1.000000 69.000000 144 | 145 | signup_to_vehicle_add vehicle_year_past_2000 146 | count 10309.000000 10570.000000 147 | mean 0.126879 11.212204 148 | std 1.339746 4.104407 149 | min 0.000000 -5.000000 150 | 25% NaN NaN 151 | 50% NaN NaN 152 | 75% NaN NaN 153 | max 30.000000 17.000000 154 | DEBUG:root:Test df: 155 | id vehicle_year drove signup_to_bgc \ 156 | count 10936.000000 2649.000000 10936.000000 6578.000000 157 | mean 27294.236101 2011.033598 0.109181 10.219672 158 | std 15745.959694 4.253398 0.311880 10.605176 159 | min 3.000000 1996.000000 0.000000 0.000000 160 | 25% 13731.750000 NaN 0.000000 NaN 161 | 50% 27193.000000 NaN 0.000000 NaN 162 | 75% 40854.750000 NaN 0.000000 NaN 163 | max 54676.000000 2017.000000 1.000000 57.000000 164 | 165 | signup_to_vehicle_add vehicle_year_past_2000 166 | count 2570.000000 2649.000000 167 | mean 0.128016 11.033598 168 | std 1.350992 4.253398 169 | min 0.000000 -4.000000 170 | 25% NaN NaN 171 | 50% NaN NaN 172 | 75% NaN NaN 173 | max 28.000000 17.000000 174 | Optimization terminated successfully. 175 | Current function value: 0.547824 176 | Iterations: 24 177 | Function evaluations: 26 178 | Gradient evaluations: 26 179 | Strark 23663 180 | Berton 16061 181 | Wrouver 4021 182 | Name: city_name, dtype: int64 183 | Paid 19135 184 | Referral 13915 185 | Organic 10695 186 | Name: signup_channel, dtype: int64 187 | DEBUG:root:Enter SimpleTable.data2rows. 188 | DEBUG:root:Exit SimpleTable.data2rows. 189 | DEBUG:root:Enter SimpleTable.data2rows. 190 | DEBUG:root:Exit SimpleTable.data2rows. 191 | DEBUG:root:Enter SimpleTable.data2rows. 192 | DEBUG:root:Exit SimpleTable.data2rows. 193 | Logit Regression Results 194 | ============================================================================== 195 | Dep. Variable: drove No. Observations: 10309 196 | Model: Logit Df Residuals: 10304 197 | Method: MLE Df Model: 4 198 | Date: Mon, 01 Aug 2016 Pseudo R-squ.: 0.2057 199 | Time: 22:56:30 Log-Likelihood: -5647.5 200 | converged: True LL-Null: -7110.3 201 | LLR p-value: 0.000 202 | =================================================================================================== 203 | coef std err z P>|z| [95.0% Conf. Int.] 204 | --------------------------------------------------------------------------------------------------- 205 | Intercept 0.6030 0.041 14.599 0.000 0.522 0.684 206 | signup_channel_referral[T.True] 0.4867 0.046 10.673 0.000 0.397 0.576 207 | city_Berton[T.True] 0.0860 0.047 1.829 0.067 -0.006 0.178 208 | signup_to_vehicle_add 0.1751 0.019 9.169 0.000 0.138 0.213 209 | signup_to_bgc -0.1758 0.005 -38.530 0.000 -0.185 -0.167 210 | =================================================================================================== 211 | AUC for 20% holdout: 0.794798031562 212 | 213 | Process finished with exit code 0 214 | -------------------------------------------------------------------------------- /docs/responses/responses.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/Uber-DS-Challenge/7d904c62849a85b93707b075b2168240df1534ba/docs/responses/responses.docx -------------------------------------------------------------------------------- /docs/responses/responses.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/Uber-DS-Challenge/7d904c62849a85b93707b075b2168240df1534ba/docs/responses/responses.pdf -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: uber_env 2 | dependencies: 3 | - mkl=11.3.3=0 4 | - numpy=1.11.1=py27_0 5 | - openssl=1.0.2h=1 6 | - pandas=0.18.1=np111py27_0 7 | - patsy=0.4.1=py27_0 8 | - pip=8.1.2=py27_0 9 | - python=2.7.12=1 10 | - python-dateutil=2.5.3=py27_0 11 | - pytz=2016.6.1=py27_0 12 | - readline=6.2=2 13 | - scikit-learn=0.17.1=np111py27_2 14 | - scipy=0.18.0=np111py27_0 15 | - setuptools=23.0.0=py27_0 16 | - six=1.10.0=py27_0 17 | - sqlite=3.13.0=0 18 | - statsmodels=0.6.1=np111py27_1 19 | - tk=8.5.18=0 20 | - wheel=0.29.0=py27_0 21 | - zlib=1.2.8=3 22 | prefix: /Users/bjherger/anaconda2/envs/uber_env 23 | 24 | --------------------------------------------------------------------------------