├── Ads Analysis
    ├── Ads_analysis.Rmd
    ├── Ads_analysis.md
    └── Ads_analysis_files
    │   ├── figure-html
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-11-1.png
    │       └── unnamed-chunk-14-1.png
    │   └── figure-markdown_github
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-11-1.png
    │       └── unnamed-chunk-14-1.png
├── Clustering Grocery Items
    ├── Clustering_Grocery_items.Rmd
    ├── Clustering_Grocery_items.md
    └── Clustering_Grocery_items_files
    │   └── figure-markdown_github
    │       └── unnamed-chunk-14-1.png
├── Diversity in Workplace
    ├── Diversity in the Workplace.Rmd
    ├── Diversity_in_the_Workplace.md
    └── Diversity_in_the_Workplace_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-11-1.png
    │       ├── unnamed-chunk-12-1.png
    │       ├── unnamed-chunk-13-1.png
    │       ├── unnamed-chunk-13-2.png
    │       ├── unnamed-chunk-13-3.png
    │       ├── unnamed-chunk-13-4.png
    │       ├── unnamed-chunk-13-5.png
    │       ├── unnamed-chunk-13-6.png
    │       ├── unnamed-chunk-13-7.png
    │       ├── unnamed-chunk-13-8.png
    │       ├── unnamed-chunk-5-1.png
    │       ├── unnamed-chunk-6-1.png
    │       ├── unnamed-chunk-7-1.png
    │       ├── unnamed-chunk-8-1.png
    │       └── unnamed-chunk-9-1.png
├── Engagement Test
    ├── Engagement Test.Rmd
    ├── Engagement_Test.md
    └── Engagement_Test_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-11-1.png
    │       └── unnamed-chunk-13-1.png
├── Funnel Analysis
    ├── Funnel_analysis.Rmd
    ├── Funnel_analysis.md
    └── Funnel_analysis_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-12-1.png
    │       ├── unnamed-chunk-14-1.png
    │       ├── unnamed-chunk-15-1.png
    │       ├── unnamed-chunk-15-2.png
    │       ├── unnamed-chunk-15-3.png
    │       ├── unnamed-chunk-15-4.png
    │       ├── unnamed-chunk-15-5.png
    │       └── unnamed-chunk-9-1.png
├── JSON City Similarities
    ├── JSON City Similarities.Rmd
    ├── JSON_City_Similarities.md
    ├── JSON_City_Similarities_files
    │   └── figure-gfm
    │   │   ├── unnamed-chunk-10-1.png
    │   │   ├── unnamed-chunk-4-1.png
    │   │   ├── unnamed-chunk-4-2.png
    │   │   ├── unnamed-chunk-4-3.png
    │   │   ├── unnamed-chunk-4-4.png
    │   │   ├── unnamed-chunk-4-5.png
    │   │   ├── unnamed-chunk-4-6.png
    │   │   └── unnamed-chunk-4-7.png
    └── JSON_City_Similarity_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-4-1.png
    │       ├── unnamed-chunk-4-2.png
    │       ├── unnamed-chunk-4-3.png
    │       ├── unnamed-chunk-4-4.png
    │       ├── unnamed-chunk-4-5.png
    │       ├── unnamed-chunk-4-6.png
    │       └── unnamed-chunk-4-7.png
├── Loan granting
    ├── Loan granting.Rmd
    ├── Loan_granting.md
    └── Loan_granting_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-11-1.png
    │       ├── unnamed-chunk-12-1.png
    │       ├── unnamed-chunk-12-10.png
    │       ├── unnamed-chunk-12-11.png
    │       ├── unnamed-chunk-12-12.png
    │       ├── unnamed-chunk-12-2.png
    │       ├── unnamed-chunk-12-3.png
    │       ├── unnamed-chunk-12-4.png
    │       ├── unnamed-chunk-12-5.png
    │       ├── unnamed-chunk-12-6.png
    │       ├── unnamed-chunk-12-7.png
    │       ├── unnamed-chunk-12-8.png
    │       ├── unnamed-chunk-12-9.png
    │       ├── unnamed-chunk-16-1.png
    │       ├── unnamed-chunk-16-2.png
    │       ├── unnamed-chunk-16-3.png
    │       ├── unnamed-chunk-16-4.png
    │       ├── unnamed-chunk-8-1.png
    │       ├── unnamed-chunk-9-1.png
    │       └── unnamed-chunk-9-2.png
├── Online Video Challenge
    ├── Online Video Challenge.Rmd
    ├── Online_Video_Challenge.md
    └── Online_Video_Challenge_files
    │   └── figure-gfm
    │       ├── unnamed-chunk-10-1.png
    │       ├── unnamed-chunk-11-1.png
    │       ├── unnamed-chunk-14-1.png
    │       ├── unnamed-chunk-14-2.png
    │       ├── unnamed-chunk-15-1.png
    │       ├── unnamed-chunk-16-1.png
    │       ├── unnamed-chunk-17-1.png
    │       ├── unnamed-chunk-18-1.png
    │       ├── unnamed-chunk-19-1.png
    │       ├── unnamed-chunk-20-1.png
    │       ├── unnamed-chunk-20-2.png
    │       ├── unnamed-chunk-20-3.png
    │       ├── unnamed-chunk-20-4.png
    │       ├── unnamed-chunk-20-5.png
    │       ├── unnamed-chunk-20-6.png
    │       ├── unnamed-chunk-20-7.png
    │       ├── unnamed-chunk-21-1.png
    │       ├── unnamed-chunk-7-1.png
    │       └── unnamed-chunk-8-1.png
├── README.md
├── Song Challenge
    ├── Song_Challenge.Rmd
    └── Song_Challenge.md
├── URL Parsing Challenge
    ├── URL Parsing Challenge.Rmd
    └── URL_Parsing_Challenge.md
└── User Referral Program
    ├── User Referral Program.Rmd
    ├── User_Referral_Program.md
    └── User_Referral_Program_files
        └── figure-gfm
            ├── unnamed-chunk-4-1.png
            ├── unnamed-chunk-5-1.png
            └── unnamed-chunk-9-1.png


/Ads Analysis/Ads_analysis.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Ads Analysis"
  3 | author: "Siddhartha Jetti"
  4 | date: "July 17, 2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Maybe the first industry to heavily rely on data science was the online ads industry. Data Science is used to choose which ads to show, how much to pay, optimize the ad text and the position as well as in countless of other related applications.
 11 | 
 12 | Optimizing ads is one of the most intellectually challenging jobs a data scientist can do. It is a really complex problem given the huge (really really huge) size of the datasets as well as number of features that can be used. 
 13 | Moreover, companies often spend huge amounts of money in ads and a small ad optimization improvement can be worth millions of dollars for the company.
 14 | 
 15 | The goal of this project is to look at a few ad campaigns and analyze their current performance as well as predict their future performance.
 16 | 
 17 | 
 18 | # Challenge Description
 19 | 
 20 | Company XYZ is a food delivery company. Like pretty much any other site, in order to get customers, they have been relying significantly on online ads, such as those you see on Google or Facebook.
 21 | 
 22 | At the moment, they are running 40 different ad campaigns and want you to help them understand their performance.
 23 | 
 24 | Specifically, you are asked to:
 25 | 
 26 | 1) If you had to identify the 5 best ad groups, which ones would be? Which metric did you choose to identify the best ones? Why? Explain the pros of your metric as well as the possible cons. From a business perspective, choosing that metric implies that you are focusing on what?
 27 | 
 28 | 
 29 | 2) For each group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend).
 30 | 
 31 | 
 32 | 3) Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down.
 33 | 
 34 | 
 35 | ## Data
 36 | 
 37 | We have 1 table downloadable by clicking here.
 38 | 
 39 | The table is:
 40 | 
 41 | ad_table - aggregate information about ads
 42 | 
 43 | ### Columns:
 44 | 
 45 | * date : all data are aggregated by date
 46 | * shown : the number of ads shown on a given day all over the web. Impressions are free. That is, companies pay only if a user clicks on the ad, not to show it
 47 | * clicked : the number of clicks on the ads. This is what companies pay for. By clicking on the ad, the user is brought to the site
 48 | converted : the number of conversions on the site coming from ads. To be counted, a conversion as to happen on the same day as the ad click.
 49 | * avg_cost_per_click : on an average, how much it cost each of those clicks
 50 | * total_revenue : how much revenue came from the conversions
 51 | * ad : we have several different ad versions with different text. This shows which ad group we are considering
 52 | 
 53 | 
 54 | # Problem Setup
 55 | 
 56 | ```{r}
 57 | # Load required libraries
 58 | library(dplyr)
 59 | library(ggplot2)
 60 | library(lubridate)
 61 | library(randomForest)
 62 | 
 63 | # Read in the input data into a dataframe
 64 | ads <- read.csv("ad_table.csv", stringsAsFactors = F) 
 65 | ```
 66 | 
 67 | # Data Exploration
 68 | 
 69 | Explore the ads dataset
 70 | ```{r}
 71 | # Transform variables in ads dataset
 72 | ads <- ads %>%
 73 |   mutate(date = as.Date(date)) %>%
 74 |   arrange(ad, date)
 75 | 
 76 | # Check data types of each of the columns
 77 | summary(ads)
 78 | 
 79 | # Take a peek at data
 80 | head(ads)
 81 | ```
 82 | 
 83 | Removing rows with shown = 0 or revenue < 0
 84 | 
 85 | ```{r}
 86 | ads_cleaned <- ads %>%
 87 |   filter(shown > 0, clicked > 0, total_revenue >= 0)
 88 | ```
 89 | 
 90 | Now, checking the summary of dataset
 91 | 
 92 | ```{r}
 93 | summary(ads_cleaned)
 94 | ```
 95 | 
 96 | Check for missing values in the data.
 97 | ```{r}
 98 | # count of missing values by column in views dataset
 99 | colSums(is.na(ads_cleaned))
100 | ```
101 | No missing values exist anywhere in the data.
102 | 
103 | Check if data exists for all the ads for all the dates.
104 | ```{r}
105 | table(ads_cleaned$date)
106 | ```
107 | 
108 | ```{r}
109 | table(ads_cleaned$ad)
110 | ```
111 | 
112 | Overall the data looks good.
113 | 
114 | # Question 1:
115 | 
116 | Here the goal is to choose the best 5 ads based on the provided data.
117 | I plan on using average return on advertising budget as a metric to choose the top ad campaigns.
118 | 
119 | ```{r}
120 | # calculate average daily return for every ad
121 | best_ads_avg_revenue <- ads_cleaned %>%
122 |   group_by(ad) %>%
123 |   summarise(avg_return =  sum(total_revenue) / sum(clicked * avg_cost_per_click)) %>%
124 |   arrange(desc(avg_return)) 
125 | 
126 | head(best_ads_avg_revenue, 5)
127 | ```
128 | 
129 | The problem with using average return as a metric is, it does not identify the ads with low average return but are consistently trending up. Arguably, the ads that are trending up are equally important to marketing teams, if not more, than the ones with high average return and trending down. So, it makes sense to look both the average return and the trend to know the complete story.
130 | 
131 | ```{r}
132 | # Function to extract trend in the data
133 | slope_trendline <- function(y, x) {
134 |   trendline <- lm(formula = y ~ x)
135 |   return(trendline$coefficients[2])
136 | }
137 | 
138 | # Function to compute the p-valuye of x coefficient
139 | slope_p_value <- function(y, x) {
140 |   trendline <- lm(formula = y ~ x)
141 |   df <- data.frame(summary(trendline)$coefficients)
142 |   return(df[2,4])
143 | }
144 | 
145 | # Estimate the trend over time for every ad
146 | best_ads <- ads_cleaned %>%
147 |   group_by(ad) %>%
148 |   mutate(date = as.numeric(date),
149 |          day_return =  total_revenue / (clicked * avg_cost_per_click)) %>%
150 |   summarise(trend = slope_trendline(day_return, date)) %>%
151 |   inner_join(best_ads_avg_revenue, by = "ad") %>%
152 |   arrange(desc(avg_return))
153 | 
154 | head(best_ads, 10)
155 | ```
156 | 
157 | All the top 5 ad campaigns by average return are making money (return > 1).
158 | Among them, The ad groups 31, 16 and 14 are having a positive trend. While groups 2 and 16 have high average return but are trending down.
159 | 
160 | # Question 2:
161 | 
162 | Here the goal is to predict how many times the ad will be shown on a future date. 
163 | 
164 | The simplest method is to fit a straight line that is closest to the data points for each ad group and use it to predict future views. 
165 | 
166 | Let's pick few arbitrary videos and visualize the time series of views.
167 | ```{r}
168 | ads_cleaned %>%
169 |   filter(ad == "ad_group_31") %>%
170 | 
171 | ggplot(aes(date, shown)) +
172 |   geom_point() +
173 |   geom_smooth(method = "lm", se = FALSE)
174 | ```
175 | 
176 | ```{r}
177 | ads_cleaned %>%
178 |   filter(ad == "ad_group_38") %>%
179 | 
180 | ggplot(aes(date, shown)) +
181 |   geom_point() +
182 |   geom_smooth(method = "lm", se = FALSE)
183 | ```
184 | 
185 | Fitting a line to time series of all the ad campaigns
186 | 
187 | ```{r}
188 | 
189 | unique_ads <- unique(ads_cleaned$ad)
190 | new <- data.frame(x = as.numeric(as.Date("2015-12-15")))
191 | prediction <- data.frame(ad = c(), prediction = c())
192 | 
193 | for(i in unique_ads) {
194 |   x <- as.numeric(ads_cleaned$date[ads_cleaned$ad == i])
195 |   y <- ads_cleaned$shown[ads_cleaned$ad == i]
196 |   predicted <- round(predict(lm(y ~ x), newdata = new), digits = 0)
197 |   predicted_df <- data.frame(ad = i, prediction = predicted, stringsAsFactors = F)
198 |   prediction <- bind_rows(prediction, predicted_df)
199 | }
200 | 
201 | head(prediction)
202 | ```
203 | 
204 | # Question 3:
205 | 
206 | First fit a linear regression line to extract the trend from data.
207 | 
208 | ```{r}
209 | # Function to normalize the variables
210 | normalize <- function(x) {
211 |     return((x - min(x)) / (max(x) - min(x)))
212 | }
213 | 
214 | # Now group by video id, normalize variables and extract slope
215 | ads_cpc_summary <- ads_cleaned %>%
216 |   mutate(date = as.numeric(date),
217 |          cpc_norm = normalize(avg_cost_per_click),
218 |          date_norm = normalize(date)) %>%
219 |   group_by(ad) %>%
220 |   summarise(cpc_slope = slope_trendline(cpc_norm, date_norm),
221 |             cpc_slope_p_value = slope_p_value(cpc_norm, date_norm))
222 | 
223 | # Take a look at the data
224 | head(ads_cpc_summary)
225 | ```
226 | 
227 | 
228 | The distribution of slopes of cost per click trendlines
229 | ```{r}
230 | quantile(ads_cpc_summary$cpc_slope, probs = seq(0, 1, by = 0.05)) 
231 | ads_cpc_summary %>%
232 |   ggplot() +
233 |   geom_histogram(bins = 30, aes(x = cpc_slope)) 
234 | ```
235 | 
236 | Trends are extracted by fitting a inear regression line to time series of average cost per click data. Here are rules that can be thought of to classify ads based on if average cost per click are going up, staying flat or going down.
237 | 
238 | * If coefficient of x-term is statistically significant and coefficient > 0, then average cost per clicks are going up.
239 | * If coefficient is statistically significant and coefficient < 0, then average cost per clicks are going down.
240 | * If coefficient is NOT statistically significant then average cost per clicks are flat.
241 | 
242 | The criterion to decide statistical significance should not be just p < 0.05. The reason is we are effectively doing 40 different tests. So, to prevent risk of incorrectly rejecting a null hypothesis due to multiple comparisons, the p-values of each test need to be adjusted using Boniferoni correction (pvalue/number of comparisons).
243 | 
244 | ```{r}
245 | # Classifying ads based on the stated rules
246 | ads_cpc_summary <- ads_cpc_summary %>%
247 |   mutate(cpc_category = case_when(
248 |                       cpc_slope_p_value < (0.05/40) & cpc_slope > 0 ~ "Going up",
249 |                       cpc_slope_p_value < (0.05/40) & cpc_slope < 0 ~ "Going down",
250 |                       TRUE ~ "Flat")
251 |   )
252 | 
253 | # Frequency of different video categories
254 | table(ads_cpc_summary$cpc_category)
255 | ```
256 | 
257 | Employing Boniferoni correction is classifying all the campaigns as average cost per click remaining Flat. The reason being Boniferoni correction generally  imposes a very stringent condition for significance. 
258 | 


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Clustering Grocery Items/Clustering_Grocery_items.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Clustering Grocery Items'
  3 | author: "Siddhartha Jetti"
  4 | date: "7/25/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Online shops often sell tons of different items and this can become messy very quickly!
 11 | 
 12 | Data science can be extremely useful to automatically organize the products in categories so that they can be easily found by the customers.
 13 | 
 14 | The goal of this challenge is to look at user purchase history and create categories of items that are likely to be bought together and, therefore, should belong to the same cluster.
 15 | 
 16 | 
 17 | # Challenge Description
 18 | 
 19 | Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience.
 20 | 
 21 | However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use!
 22 | 
 23 | This is what they asked you to do:
 24 | 
 25 | 1)The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:
 26 | 
 27 | * the customer who bought the most items overall in her lifetime
 28 | 
 29 | * for each item, the customer who bought that product the most
 30 | 
 31 | 2)Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.
 32 | 
 33 | 
 34 | # Data
 35 | 
 36 | We have 2 table downloadable by clicking on here.
 37 | 
 38 | The 2 tables are:
 39 | 
 40 | item_to_id - for each item, it gives the corresponding id
 41 | 
 42 | ## Columns:
 43 | 
 44 | * Item_name : The name of the item
 45 | * Item_id : the id of the item. Can be joined to the id in the other table. It is unique by item.
 46 | 
 47 | purchase_history - for each user purchase, the items bought
 48 | 
 49 | ## Columns:
 50 | 
 51 | * user_id : The id of the user.
 52 | * id : comma-separated list of items bought together in that transaction.
 53 | 
 54 | 
 55 | # Problem Setup
 56 | 
 57 | ```{r}
 58 | # Load required libraries
 59 | library(tidyverse)
 60 | library(ggplot2)
 61 | 
 62 | # Read and process input data into a dataframe
 63 | items <- read.csv("item_to_id.csv", stringsAsFactors = F)
 64 | purchases <- read.csv("purchase_history.csv", stringsAsFactors = F)
 65 | ```
 66 | 
 67 | # Data Exploration
 68 | 
 69 | Check data types of columns in items dataset
 70 | ```{r}
 71 | # Check data types of each of the columns
 72 | str(items)
 73 | summary(items)
 74 | ```
 75 | 
 76 | Check data types of columns in purchases dataset.
 77 | ```{r}
 78 | # Check data types of each of the columns
 79 | str(purchases)
 80 | summary(purchases)
 81 | ```
 82 | 
 83 | Check for missing values in the data
 84 | ```{r}
 85 | # Check if any missing values exist
 86 | colSums(is.na(items))
 87 | 
 88 | colSums(is.na(purchases))
 89 | ```
 90 | 
 91 | 
 92 | Check for duplicates in the data
 93 | 
 94 | ```{r}
 95 | # check if any duplicate item id exist
 96 | length(items$Item_id) == length(unique(items$Item_id))
 97 | 
 98 | ```
 99 | 
100 | ```{r}
101 | # check if any duplicate user id exist
102 | length(purchases$user_id) == length(unique(purchases$user_id))
103 | 
104 | ```
105 | 
106 | Clearly, there are duplicate user ids in purchases dataset. This is OK because single user can make multiple transactions.
107 | 
108 | It is important to have an unique id for each transaction. Now lets create transaction id using the user id in purchases dataset.
109 | 
110 | ```{r}
111 | purchases <- purchases %>%
112 |   group_by(user_id) %>%
113 |   mutate(transaction_id = paste0(user_id, "_", row_number())) 
114 | 
115 | # Take a peek at the data
116 | head(purchases)
117 | 
118 | # Check if created transaction ids are unique
119 | length(purchases$transaction_id) == length(unique(purchases$transaction_id))
120 | ```
121 | 
122 | Overall the data looks good.
123 | 
124 | # Question 1
125 | 
126 | ```{r}
127 | # Get maximum number of items purchased in a single transaction
128 | # This is done by counting the occurences of "," + 1
129 | max_items <- max(str_count(purchases$id, ",")) + 1
130 | ```
131 | 
132 | Now transform purchases dataset into tidy format for future use.
133 | 
134 | ```{r}
135 | purchases_tidy <- purchases %>%
136 |   separate(col = id, into = paste0("item", 1:max_items), sep = ",") %>%
137 |   gather(key = "value", value = "item_id", -user_id, -transaction_id) %>%
138 |   filter(!is.na(item_id)) %>%
139 |   mutate(item_id = as.integer(item_id))
140 | 
141 | head(purchases_tidy)
142 | ```
143 | 
144 | Get user id that made the most number of purchases in the life time.
145 | 
146 | ```{r}
147 | most_units_bought <- purchases_tidy %>%
148 |   group_by(user_id) %>%
149 |   summarise(units_bought = n()) %>%
150 |   arrange(desc(units_bought)) %>%
151 |   filter(row_number() == 1)
152 | 
153 | # user id with most number of purchases
154 | most_units_bought
155 | ```
156 | 
157 | Get user ids with most units bought by product.
158 | ```{r}
159 | most_units_by_item <- purchases_tidy %>%
160 |   group_by(item_id, user_id) %>%
161 |   summarise(units_bought = n()) %>%
162 |   arrange(item_id, desc(units_bought)) %>%
163 |   filter(row_number() == 1) %>%
164 |   inner_join(items, by = c("item_id" = "Item_id")) 
165 | 
166 | # user id with most number of purchases by item
167 | most_units_by_item
168 | ```
169 | 
170 | # Question 2
171 | 
172 | Now the goal is to create clusters of items that have highest probability of being purchased together.
173 | 
174 | Each grocery item can be imagined as a point in the n-dimensional space spun by transactions. Each coordinate of the point(n-dimensional) would be the number of units of the item purchased in transaction corresponding to the coordinate.
175 | 
176 | ```{r}
177 | # Build item-transaction matrix
178 | item_transaction_matrix <- purchases_tidy %>%
179 |   group_by(transaction_id, item_id) %>%
180 |   summarise(nunits = n()) %>%
181 |   ungroup() %>%
182 |   spread(transaction_id, nunits) %>%
183 |   mutate_all(list(~replace_na(., 0))) 
184 | 
185 | head(item_transaction_matrix)
186 | ```
187 | 
188 | To cluster the items based on transaction history, I choose to use Kmeans clustering algorithm. Here the dataset has all the variables on the same scale and pretty much same meaning. So, I expect K-means to perform well. The advantage of using K-means is that it is highly interpretable and can easily be explained.
189 | 
190 | K-means algorithm chooses the clusters such a way that within cluster variance is minimum for a given number of clusters. The optimal number of clusters is determined by running kmeans with different number of clusters and plotting the Elbow curve (within cluster variance vs number of clusters) and also results should make sense from UI standpoint without containing too many clusters.
191 | 
192 | For stability, Kmeans algorithm is run multiple times for each configuration of clusters. The mean of variance is used for plotting the elbow curve.
193 | 
194 | ```{r}
195 | # Set seed
196 | set.seed(2019)
197 | # Place holder to store within variance for several configuration of clusters
198 | within_SS <- c()
199 | # Try different number of clusters
200 | nclusters <- 2:20
201 | # 10 tries for each configuration of clusters
202 | tries <- 10
203 | 
204 | # Run Kmeans for different number of clusters
205 | for(i in nclusters){
206 |   tries_within_SS <- c()
207 |   # Run the Kmeans 10 times for each configuration of clusters
208 |   for(try in 1:tries){
209 |     clusters <- kmeans(item_transaction_matrix[,-1], centers = i)
210 |     tries_within_SS <- c(tries_within_SS, clusters$tot.withinss)
211 |   }
212 |   within_SS <- c(within_SS, mean(tries_within_SS))
213 | }
214 | ```
215 | 
216 | Plotting the Elbow curve.
217 | ```{r}
218 | data.frame(k = nclusters, within_SS = within_SS) %>%
219 |   ggplot(aes(x = k, y = within_SS)) +
220 |   geom_point() +
221 |   geom_line() +
222 |   ggtitle("Within Sum of Squares vs Number of Clusters")
223 | ```
224 | 
225 | Unfortunately, the above plot does not reveal an obvious "elbow" point. But there is a slight change in gradient after k=9 or 10. So, let us choose k = 9 as the optimal number of clusters.
226 | 
227 | ```{r}
228 | set.seed(2019)
229 | clusters <- kmeans(item_transaction_matrix[,-1], centers = 9)
230 | 
231 | item_clusters <- data.frame(item_id = item_transaction_matrix[,1], cluster = clusters$cluster) %>%
232 |   mutate(item_id = as.integer(item_id)) %>%
233 |   inner_join(items, by = c("item_id" = "Item_id")) %>%
234 |   group_by(cluster) %>%
235 |   summarise(item_count = n(), items = paste0(Item_name, collapse = ", "))
236 | 
237 | item_clusters
238 | ```
239 | 
240 | Looking at the item clusters.
241 | 
242 | * All the vegetables except lettuce are together.
243 | * All fruits are grouped together.
244 | * All kinds of meat are together.
245 | * Beverages are clustered together.
246 | * Snacks are grouped.
247 | 
248 | However, cluster 6 appears to have too many items. Lets try to re-run the algorithm by increasing the number of clusters.
249 | 
250 | ```{r}
251 | set.seed(2019)
252 | clusters <- kmeans(item_transaction_matrix[,-1], centers = 12)
253 | 
254 | item_clusters2 <- data.frame(item_id = item_transaction_matrix[,1], cluster = clusters$cluster) %>%
255 |   mutate(item_id = as.integer(item_id)) %>%
256 |   inner_join(items, by = c("item_id" = "Item_id")) %>%
257 |   group_by(cluster) %>%
258 |   summarise(item_count = n(), items = paste0(Item_name, collapse = ", "))
259 | 
260 | item_clusters2
261 | ```
262 | 
263 | Increasing the number of clusters to 12 resulted in decreasing the maximum number of items in cluster from 22 to 17. It also resulted in breaking out the milk products into different cluster and appears to be performing better than with 9 clusters.


--------------------------------------------------------------------------------
/Clustering Grocery Items/Clustering_Grocery_items_files/figure-markdown_github/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Clustering Grocery Items/Clustering_Grocery_items_files/figure-markdown_github/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity in the Workplace.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Diversity in the Workplace'
  3 | author: "Siddhartha Jetti"
  4 | date: "5/30/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Diversity, unconscious bias in the workplace and, in general, the way companies treat their employees are a very important topic. 
 11 | Data science can help discover potential discriminations by looking at the data and see if there are segments of employees that are treated worse.
 12 | 
 13 | # Challenge Description
 14 | 
 15 | There has been lots of talking about diversity in the workplace, especially in technology. The Head of HR at your company is very concerned about that and has asked you to analyze internal data about employees and see whether results suggest that the company is treating all its employees fairly or not.
 16 | 
 17 | Specifically, she gave you the following tasks:
 18 | 
 19 | 1) In the company there are 6 levels (described below). Identify, for each employee, her corresponding level. 
 20 | * Individual Contributors(IC) - they don’t manage anyone
 21 | * Middle Managers(MM) - they are the direct bosses of IC
 22 | * Directors(D) - they are the direct bosses of MM
 23 | * VP - D direct bosses
 24 | * Executives (E) - VP direct bosses
 25 | * CEO - The direct boss of E.
 26 | 
 27 | 2) How many people each employee manages? Consider that if John directly manages 2 people and these two people manage 5 people each, then we conclude that John manages 12 people.
 28 | 
 29 | 
 30 | 3) Build a model to predict the salary of each employee.
 31 | Describe the main factors impacting employee salaries. Do you think the company has been treating all its employees fairly? What are the next steps you would suggest to the Head of HR?
 32 | 
 33 | # Data
 34 | 
 35 | We have 2 tables downloadable by clicking here.
 36 | 
 37 | The 2 tables are:
 38 | 
 39 | company_hierarchy - info about each employee direct boss and her dept
 40 | 
 41 | ## Columns:
 42 | 
 43 | * employee_id : the Id of the employee. It is unique by employee and can be joined to employee id in the other table
 44 | * boss id : the id of the boss of employee id. It is unique by employee and can be joined to employee id in the other table.
 45 | * dept : employee id dept. There are the following departments: 
 46 |   + Engineering (data science is under engineering)
 47 |   + Marketing
 48 |   + Sales
 49 |   + HR
 50 |   + the “CEO” dept just for the CEO, since belongs to all the above dept.
 51 | 
 52 | employee - info about each employee
 53 | 
 54 | ## Columns:
 55 | 
 56 | * employee_id : the Id of the employee. It is unique by employee and can be joined to employee id and/or boss id in the other table
 57 | * signing_bonus : whether the employee got a signing bonus when she joined the company (1 -> yes, 0 -> no)
 58 | * salary : the current salary of that employee in USD
 59 | * degree level: the highest degree received by the employee.
 60 | * sex: Male/Female 
 61 | * yrs_experience: employee years of work experience
 62 | 
 63 | 
 64 | # Problem Setup
 65 | 
 66 | ```{r}
 67 | # Load required libraries
 68 | library(dplyr)
 69 | library(ggplot2)
 70 | library(randomForest)
 71 | 
 72 | # Read in the input data into a dataframe
 73 | data_hierarchy <- read.csv("company_hierarchy.csv")
 74 | data_employee <- read.csv("employee.csv")
 75 | ```
 76 | 
 77 | 
 78 | # Question 1:
 79 | 
 80 | Get the count of reports and level from hierarchy table.  
 81 | ```{r}
 82 | hierarchy <- data_hierarchy %>%
 83 |   filter(!is.na(boss_id))
 84 | 
 85 | # Function to count direct/in-direct reports and get the hierarchy level for every employee
 86 | countReports <- function(df,id) {
 87 |   reports <- df[df$boss_id == id,]
 88 |   reports1 <- reports
 89 |   len <- nrow(reports)
 90 |   level <- ifelse(len == 0, 1, 2)
 91 |   while(1){
 92 |     reports <- df[df$boss_id %in% reports$employee_id,]
 93 |     reports1 <- rbind(reports1, reports)
 94 |     if(nrow(reports) == len){
 95 |      return(data.frame(employee_id = id, level = level, report_count = nrow(reports1)))
 96 |      break
 97 |     } 
 98 |     len <- nrow(reports)
 99 |     level <- level + 1
100 |   }
101 | }
102 | 
103 | data_reports <- data.frame(employee_id = NULL, level = NULL, report_count = NULL )
104 | emps <- nrow(data_hierarchy)
105 | 
106 | # Loop through all the employess to get level and reports count.
107 | for(i in 1:emps){ 
108 |   data_reports <- rbind(data_reports, countReports(hierarchy, data_hierarchy$employee_id[i]))
109 | }
110 | 
111 | # Join hierarchy related information with employee information to form one table
112 | data <- data_reports %>%
113 |   arrange(desc(report_count), employee_id) %>%
114 |   left_join(data_hierarchy, by = "employee_id") %>%
115 |   select(employee_id, dept, level, report_count) %>%
116 |   right_join(data_employee, by = "employee_id") %>%
117 |   mutate(signing_bonus = as.factor(signing_bonus),
118 |          level = as.factor(case_when( level == 3 ~ "MM",
119 |                             level == 4 ~ "D",
120 |                             level == 5 ~ "VP",
121 |                             level == 6 ~ "E",
122 |                             level == 7 ~ "CEO",
123 |                             TRUE ~ "IC" ))
124 |   )
125 | ```
126 | 
127 | # Question 2:
128 | 
129 | Run descriptive statistics 
130 | ```{r}
131 | # Check datatypes of all the variables in the dataframe
132 | str(data)
133 | ```
134 | 
135 | ```{r}
136 | # Take a peek at the data
137 | head(data)
138 | summary(data)
139 | 
140 | # check for duplicate user ids
141 | length(data$user_id) == length(unique(data$user_id))
142 | ```
143 | 
144 | Here are some observations about data:
145 | 
146 | * There are total of 10000 employees in dataset and each employee id is unique
147 | * The mean and median salary of employess across the company are USD 182k and 189k respectively.
148 | 
149 | ```{r}
150 | # Median is more robust to outliers than mean
151 | salary_dept = data %>%
152 |                 filter(!dept %in% c("CEO")) %>%
153 |                 group_by(dept) %>%
154 |                 summarise(med_salary = median(salary))
155 | ggplot(data = salary_dept, aes(x = dept, y = med_salary))+
156 |        geom_bar(stat = "identity")
157 | 
158 | ```
159 | 
160 | It looks like folks in engineering are earning way more than HR and other depts. This could mean that there is a bias by dept or it could just be a spurious correlation, which would be known only after building a model.
161 | 
162 | ```{r}
163 | salary_level = data %>%
164 |                 group_by(level) %>%
165 |                 summarise(med_salary = median(salary)) %>%
166 |                 arrange(desc(med_salary))
167 | 
168 | ggplot(data = salary_level, aes(x = level, y = med_salary))+
169 |        geom_bar(stat = "identity")
170 | ```
171 | The above plot indicates that median salaries of CEO > Executive > VP > Director > MM > IC which makes sense given the increasing scope and responsibility.
172 | 
173 | ```{r}
174 | salary_degree = data %>%
175 |                 group_by(degree_level) %>%
176 |                 summarise(med_salary = median(salary)) %>%
177 |                 arrange(desc(med_salary))
178 | 
179 | ggplot(data = salary_degree, aes(x = degree_level, y = med_salary))+
180 |        geom_bar(stat = "identity")
181 | ```
182 | 
183 | Overall, median salary for bachelor degree graduates is slightly lower than highschool and increases with advanced degrees. PHD graduates have the highest median salary. One possible explanation for lower bachelor graduate salary than high school graduates could be that gains in salary because of higher degree are nullified by loss of several yrs of experience (4 or more years).
184 | 
185 | ```{r}
186 | salary_gender = data %>%
187 |                 group_by(sex) %>%
188 |                 summarise(med_salary = median(salary)) %>%
189 |                 arrange(desc(med_salary))
190 | 
191 | ggplot(data = salary_gender, aes(x = sex, y = med_salary))+
192 |        geom_bar(stat = "identity")
193 | ```
194 | 
195 | The plot shows that median salary of Males is significantly higher than females. However, This plot alone doesn't confirm the bias as gender could be a proxy for other variables like educational degree, year of experience etc. that are known to determine the salary.
196 | 
197 | ```{r}
198 | # Plot salary vs yrs of experience
199 | ggplot(data = data, aes(x = yrs_experience, y = salary)) +
200 |   geom_point() + 
201 |   geom_smooth(method = "loess", se = FALSE)
202 |   
203 | ```
204 | 
205 | The plot indicates that salaries are expected to increase with yrs of experience and are expected to grow faster after 10-15 yrs of experience.
206 | 
207 | ```{r}
208 | # Plot salary vs number of reports
209 | data %>%
210 |   filter(report_count > 0) %>%
211 | ggplot(aes(x = report_count, y = salary)) +
212 |   geom_point() + 
213 |   geom_smooth(method = 'loess', se = FALSE)
214 | ```
215 | 
216 | # Model Building 
217 | 
218 | Random forest would be a good choice for predicting salary as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 
219 | 
220 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters
221 | 
222 | ```{r}
223 | 
224 | # Training and test set split
225 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66))
226 | train_data = data[train_sample,]
227 | test_data = data[-train_sample,]
228 | 
229 | # Build Random forest model on the data with mostly default settings except for class weight and #trees
230 | set.seed(2019)
231 | rf.fit <- randomForest(y=train_data$salary, x = train_data[,-c(6)], ytest = test_data$salary, 
232 |                        xtest = test_data[,-c(6)],  ntree = 250, mtry = 3, keep.forest = TRUE)
233 | 
234 | rf.fit
235 | # Visualize Important variables
236 | varImpPlot(rf.fit)
237 | ```
238 | The training and test MSE are close and looks like model is not overfitting to data.
239 | Now, Building the model without the dominating variable to see if it changes importance of other variables.
240 | 
241 | ```{r}
242 | rf.fit2 <- randomForest(y=train_data$salary, x = train_data[,-c(2,6)], ytest = test_data$salary, 
243 |                        xtest = test_data[,-c(2,6)],  ntree = 250, mtry = 3, keep.forest = TRUE)
244 | 
245 | rf.fit2
246 | # Visualize Important variables
247 | varImpPlot(rf.fit2)
248 | ```
249 | 
250 | After dropping the dept variable, the variance explained drops dramatically. So rf.fit is indeed optimal model.
251 | 
252 | # Question 3:
253 | 
254 | ```{r}
255 | # Order variables by importance
256 | imp_vars <- importance(rf.fit)
257 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T))
258 | 
259 | # Loop through variables and build PDP in the decreasing order of variable importance
260 | for (i in seq_along(pdp_vars)) {
261 |   partialPlot(rf.fit, data[,-6], pdp_vars[i], xlab="",
262 |               main=paste("Partial Dependence on", pdp_vars[i]))
263 | }
264 | 
265 | ```
266 | 
267 | # Understanding Partial Dependence Plots
268 | 
269 | * Looking at the variable importance and PD plots, the main factors that impact the salaries are dept, yrs of experience and level or number of direct/indirect reports.
270 | 
271 | * Although the descriptive bar chart of Salary Vs Gender show that Females are underpaid compared to Males, after controlling for variables like department, yrs of experience and number of reports no such bias in salary based on gender exists.
272 | 
273 | * The PDP on dept reveals that salaries in engineering are higher than any other department and salaries in HR are the lowest.
274 | 
275 | * Interestingly after controlling for dept, yrs of experience and level, educational degree does not really affect the salary.
276 | 
277 | * Signing bonus and gender does not seem to affect the salary. 
278 | 
279 | # Conclusions & next steps
280 | 
281 | * Based on the data, the people working in engineering dept are likely to get higher salaries than other depts. However, the reason for higher salaries could be long work hours, high stress levels, scarcity of engineering talent or difficulty in retaining talent etc.
282 | 
283 | * The next steps for the Head of HR would be to identify the most likely reason and try to reduce the salary gap across the depts.
284 | 
285 | 


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-2.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-3.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-4.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-5.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-6.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-7.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-8.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-5-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/Engagement Test/Engagement Test.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Engagement Test"
  3 | author: "Siddhartha Jetti"
  4 | date: "October 05, 2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Many sites make money by selling ads. For these sites, the number of pages visited by users on each session is one of the most important metric, if not the most important metric.
 11 | 
 12 | Data science plays a huge role here, especially by building models to suggest personalized content. In order to check if the model is actually improving engagement, companies then run A/B tests. It is often data scientist responsibility to analyze test data and understand whether the model has been successful. The goal of this project is to look at A/B test results and draw conclusions.
 13 | 
 14 | # Challenge Description
 15 | 
 16 | The company of this exercise is a social network. They decided to add a feature called: Recommended Friends, i.e. they suggest people you may know.
 17 | 
 18 | A data scientist has built a model to suggest 5 people to each user. These potential friends will be shown on the user newsfeed. At first, the model is tested just on a random subset of users to see how it performs compared to the newsfeed without the new feature.
 19 | 
 20 | The test has been running for some time and your boss asks you to check the results. You are asked to check, for each user, the number of pages visited during their first session since the test started. If this number increased, the test is a success.
 21 | 
 22 | Specifically, your boss wants to know:
 23 | * Is the test winning? That is, should 100% of the users see the Recommended Friends feature?
 24 | * Is the test performing similarly for all user segments or are there differences among different segments?
 25 | * If you identified segments that responded differently to the test, can you guess the reason? Would this change your point 1 conclusions?
 26 | 
 27 | # Data
 28 | 
 29 | The 2 tables are:
 30 | 
 31 | "user_table" - info about each user sign-up date
 32 | 
 33 | ### Columns:
 34 | * user_id : the Id of the user. It is unique by user and can be joined to user id in the other table
 35 | * signup_date : when the user joined the social network
 36 |     
 37 | "test_table" - data about the test results. For each user, we only consider about how many pages she visited on Jan, 2. The first session since the date when the test started. That is, if the test started on Jan 1, and user 1 visited the site on Jan, 2 and Jan, 3, we only care about how many pages she visited on Jan, 2.
 38 |   
 39 | ### Columns:
 40 | * user_id : the Id of the user
 41 | * date : the date of the first session since the test started
 42 | * browser : user browser during that session
 43 | * test: 1 if the user saw the new feature, 0 otherwise
 44 | * pages_visited: the metric we care about. # of pages visited in that session
 45 | 
 46 | # Problem Setup
 47 | ```{r}
 48 | # Load required libraries
 49 | library(tidyverse)
 50 | library(ggplot2)
 51 | 
 52 | # Read in the input data into a dataframe
 53 | users <- read.csv("user_table.csv", stringsAsFactors = F)
 54 | test <- read.csv("test_table.csv", stringsAsFactors = F)
 55 | ```
 56 | 
 57 | 
 58 | # Data Exploration
 59 | 
 60 | Explore users and test datasets
 61 | ```{r}
 62 | # Transform variables into right format
 63 | users <- users %>%
 64 |   mutate(signup_date = as.Date(signup_date)) %>%
 65 |   arrange(user_id, signup_date)
 66 | 
 67 | # Check datatypes of all the variables in the users dataset
 68 | str(users)
 69 | summary(users)
 70 | ```
 71 | 
 72 | ```{r}
 73 | # Transform variables into right format
 74 | test <- test %>%
 75 |   mutate(date = as.Date(date)) 
 76 | 
 77 | # Check datatypes of all the variables in the test dataset
 78 | str(test)
 79 | summary(test)
 80 | ```
 81 | 
 82 | All the columns appear to have legitimate values.
 83 | 
 84 | ```{r}
 85 | # Merge the two datasets
 86 | data <- users %>%
 87 |   inner_join(test, by = "user_id") %>%
 88 |   arrange(date, user_id)
 89 | 
 90 | # check for any missing values in the merged dataset
 91 | colSums(is.na(data))
 92 | ```
 93 | 
 94 | No missing values exist anywhere in the data.
 95 | 
 96 | ```{r}
 97 | # Take a peek at the data
 98 | head(data)
 99 | 
100 | # Check if duplicates of user id exist 
101 | length(unique(data$user_id)) == length(data$user_id)
102 | ```
103 | 
104 | No duplicates in user id exist in the dataset.
105 | 
106 | ```{r}
107 | unique(data$date)
108 | 
109 | ```
110 | 
111 | Based  on the sign-up date, we can know if the user is new or existing user.
112 | Let us classify the user as new, if he/she signed up after the start of test which is "2015-08-01".
113 | 
114 | ```{r}
115 | data <- data %>%
116 |   mutate(new_user = ifelse(signup_date >= as.Date("2015-08-01"), 1, 0))
117 | 
118 | table(data$new_user)
119 | ```
120 | 
121 | # Question 1:
122 | 
123 | Is the test winning? That is, should 100% of the users see the Recommended Friends feature?
124 | 
125 | First check the mean number of pages visited across test and control groups.
126 | ```{r}
127 | data_summary <- data %>%
128 |   group_by(test) %>%
129 |   summarise(avg_pages_visited = mean(pages_visited))
130 |   
131 | data_summary
132 | ```
133 | 
134 | ## Check randomization
135 | 
136 | To decide if the new feature is resulting in higher number of pages visited, lets run an A/B test on  the feature.
137 | The correctness of AB test depends hugely on assigning users  to test and control groups at random. Now, Check if users of different browsers are randomly assigned to test and control groups.
138 | ```{r}
139 | data %>%
140 |   group_by(browser, new_user) %>%
141 |   summarise(prop_test = sum(test == 1)/n(), prop_control = sum(test == 0)/n())
142 |   
143 | ```
144 | 
145 | Based on the proportions of test and control groups, Users appears to be assigned almost randomly between test and control groups.
146 | 
147 | Now, Run t.test on the test and control datasets
148 | 
149 | ```{r}
150 | t.test(data$pages_visited[data$test == 0], data$pages_visited[data$test == 1])
151 | ```
152 | 
153 | The mean of pages visited in test group is lower than the mean for the control group.
154 | The obtained p_value > 0.05 implies that the observed difference in sample means could have been happened out of random chance. Based on the above data, There is no reason to believe that the two groups are different and the feature should not be launched for all the users.
155 | 
156 | # Question 2:
157 | 
158 | Is the test performing similarly for all user segments or are there differences among different segments?
159 | 
160 | First, Lets plot the mean number of pages visited vs the user browser. 
161 | ```{r}
162 | # Summarize data by test and control groups
163 | data_test_by_browser = data %>%
164 |                     group_by(browser) %>%
165 |                     summarize(Test = mean(pages_visited[test==1]), Control = mean(pages_visited[test==0]))
166 | 
167 | data_test_by_browser
168 | 
169 | # Plot the data
170 | data_test_by_browser %>%
171 |   gather(key = treatment, value = mean_pages, -browser) %>%
172 | ggplot(aes(x = browser, y = mean_pages, group = treatment, color = treatment)) +
173 |   geom_line() +
174 |   geom_point()
175 | ```
176 | 
177 | The average of number of pages visited for test group among users coming from Opera browser is 0. Something wrong with data.
178 | 
179 | The average number of pages visited vs new user
180 | ```{r}
181 | # Summarize data in test and control groups based on new user
182 | data %>%
183 |   group_by(new_user) %>%
184 |   summarize(Test = mean(pages_visited[test==1]), Control = mean(pages_visited[test==0]))
185 | ```
186 | 
187 | Check the means test vs control for different segments.
188 | 
189 | ```{r}
190 | data %>%
191 |   mutate(new_user = as.factor(new_user)) %>%
192 |   group_by(browser, new_user) %>%
193 |   summarize(test_control_ratio  =  mean(pages_visited[test == 1])/mean(pages_visited[test == 0])) %>%
194 |   ggplot(aes(x = browser, y  = test_control_ratio, color = new_user, group = new_user)) +
195 |   geom_line() +
196 |   geom_point()
197 | ```
198 | 
199 | The plot above reveals the following :
200 | 
201 | * The average number of pages visited for test group among users from Opera browser is 0.
202 | 
203 | * For new users, The test group is consistently underperforming (ratio of test vs control less than 1) compared to control group for all browsers with exception of Opera. 
204 | 
205 | * For existing users, The test is performing better that control, ratio more than 1, for all browsers except Opera.
206 | 
207 | Lets run the t-test by segment.
208 | 
209 | ```{r}
210 | data %>%
211 |   group_by(browser, new_user) %>%
212 |   summarise(Test = mean(pages_visited[test == 1]), Control = mean(pages_visited[test == 0]),
213 |             Diff = Test - Control,
214 |             p_value = t.test(pages_visited[test == 1], pages_visited[test == 0])$p.value) 
215 | ```
216 | 
217 | Based on the test-control differences and p-values, Clearly the test is performing differently among different segments.
218 | 
219 | # Question 3:
220 | 
221 | If you identified segments that responded differently to the test, can you guess the reason? Would this change your point 1 conclusions?
222 | 
223 | 
224 | The above results can be more or less repeated by the  multiple linear regression.
225 | ```{r}
226 | 
227 | # Simple Linear regression
228 | model <- lm(pages_visited ~ test, data)
229 | summary(model)
230 | 
231 | # Multiple linear regression controlling for the  browser and test
232 | model <- lm(pages_visited ~ browser + test + new_user, data)
233 | summary(model)
234 | ```
235 | 
236 | The potential reasons for the observed differences across segments are:
237 | 
238 | * The mean number of pages visited for the test group among users coming from Opera browser being 0 doesn't make sense. Clearly, there is a problem with loging system or a technical glitch that is preventing users to browse after implementing the friend recommendation feature for Opera browser. 
239 | 
240 | * Clearly the friend recommendation feature is experiencing cold start problem. The new users donot have any friends and recommendation is pretty much based on a random guess. It is possible that these random friend recommendations are turning off new users and resulting in fewer number of pages visited.
241 | 
242 | * The performance of feature on existing users reveals that after the user data is collected the recommendation feature is able to recommend relevant friends.
243 | 
244 | The above findings would not change conclusion from question 1, as the test hypothesis when we designed the test  was to test for all the users. The next step would be to fix the issue with the Opera browser and re-run the test to make decision whether or not to implement the feature for all the users.
245 | 


--------------------------------------------------------------------------------
/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-13-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-13-1.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Funnel Analysis'
  3 | author: "Siddhartha Jetti"
  4 | date: "6/2/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | The goal is to perform funnel analysis for an e-commerce website.
 11 | 
 12 | Typically, websites have a clear path to conversion: for instance, you land on the home page, then you search, select a product and buy it. At each of these steps, some users will drop off and leave the site. The sequence of pages that leads to conversion is called ‘funnel’ .
 13 | 
 14 | Data Science can have a tremendous impact on funnel optimization. 
 15 | Funnel analysis allows to understand where/when our users abandon the website. It gives crucial insights on user behavior and on ways to improve the user experience as well as it often allows to discover bugs.
 16 | 
 17 | # Challenge Description
 18 | 
 19 | You are looking at data from an e-commerce website. The site is very simple and has just 4 pages.
 20 | 
 21 | The first page is the home page. When you come to the site for the first time, you can only land on the home page as a first page.
 22 | 
 23 | From the home page, the user can perform a search and land on the search page.
 24 | 
 25 | From the search page, if the user clicks on a product, she will get to the payment page, where she is asked to provide payment information in order to buy that product.
 26 | 
 27 | If she does decide to buy, she ends up on the confirmation page.
 28 | 
 29 | The company CEO isn’t very happy with the company sales and, especially, sales coming from new users. Therefore, she asked you to investigate whether there is something wrong in the conversion funnel or, in general, if you can create hypotheses on how conversion rate could be improved.
 30 | 
 31 | Specifically, she is interested in :
 32 | 
 33 | 1) A full picture of funnel conversion rate for both desktop and mobile
 34 | 
 35 | 2) Some insights on what the product team should focus on in order to improve conversion rate as well as any anything you might discover that could help improve conversion rate.
 36 | 
 37 | # Data
 38 | 
 39 | We have 5 tables downloadable by clicking on here. 
 40 | All the tables refer to only the user first experience on the site. The 5 tables are:
 41 | 
 42 | user_table - info about the user
 43 | 
 44 | Columns:
 45 | 
 46 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables
 47 | * date : the date when the user firstly landed on the site
 48 | * device : user device. Can be mobile or desktop
 49 | * sex : male/female
 50 | 
 51 | home_page_table - Users who landed on the home page
 52 | 
 53 | Columns:
 54 | 
 55 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables
 56 | * page : it is always home_page.
 57 | 
 58 | search_page_table - Users who landed on the search_page
 59 | 
 60 | Columns:
 61 | 
 62 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables
 63 | * page : it is always search_page
 64 | 
 65 | payment_page_table - Users who landed on the payment_page
 66 | 
 67 | Columns:
 68 | 
 69 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables
 70 | * page : it is always payment_page
 71 | 
 72 | payment_confirmation_table - Users who landed on the payment_confirmation_table. That is, these are the users who bought the product.
 73 | 
 74 | Columns:
 75 | 
 76 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables
 77 | * page : it is always payment_confirmation_page
 78 | 
 79 | 
 80 | # Problem Setup
 81 | 
 82 | ```{r}
 83 | # Load required libraries
 84 | library(tidyverse)
 85 | library(randomForest)
 86 | library(lubridate)
 87 | 
 88 | # Read all the .csv files in the directory
 89 | home_page <- read.csv("home_page_table.csv")
 90 | search_page <- read.csv("search_page_table.csv")
 91 | payment_page <- read.csv("payment_page_table.csv")
 92 | payment_confirm_page <- read.csv("payment_confirmation_table.csv")
 93 | users <- read.csv("user_table.csv")
 94 | ```
 95 | 
 96 | # Question 1
 97 | 
 98 | Before Merging the different datasets, its important to check the validity of data.
 99 | One important check that could be done is if every user who hit search previously hit home page, every user who hit payment page should exist in previous pages and so on.
100 | 
101 | ```{r}
102 | # Check if all users who hit search page previously visited home page
103 | all(search_page$user_id %in% home_page$user_id)
104 | 
105 | # Check if all users who hit payment page previously visited search and home pages
106 | all(payment_page$user_id %in% search_page$user_id) & all(search_page$user_id %in% home_page$user_id)
107 | 
108 | # Check if all users who hit payment confirmation page previously visited payment, search and home pages
109 | all(payment_confirm_page$user_id %in% payment_page$user_id) & all(payment_page$user_id %in% search_page$user_id) & all(search_page$user_id %in% home_page$user_id)
110 | 
111 | ```
112 | 
113 | Clearly all the users are going through the website in a funnel pattern with out any exceptions and data looks fine.
114 | 
115 | ```{r}
116 | page_data <- home_page %>%
117 |   left_join(search_page, by = "user_id") %>%
118 |   rename(page1 = page.x, page2 = page.y) %>%
119 |   left_join(payment_page, by = "user_id") %>%
120 |   rename(page3 = page) %>%
121 |   left_join(payment_confirm_page, by = "user_id") %>%
122 |   rename(page4 = page) %>%
123 |   left_join(users, by = "user_id") %>%
124 |   mutate(week = week(date),
125 |          week_day = wday(date, label = TRUE))
126 | 
127 | ```
128 | 
129 | Run descriptive statistics 
130 | ```{r}
131 | head(page_data)
132 | table(page_data$device)
133 | ```
134 | 
135 | Almost two-thirds of traffic is coming from desktops and only one-third is from mobile.
136 | 
137 | Overall conversion rates by page
138 | ```{r}
139 | non_conversion <- page_data %>% 
140 |   select(contains("page")) %>% 
141 |   mutate_all(is.na) %>% 
142 |   colMeans() 
143 | 
144 | conversions <- 1 - non_conversion
145 | names(conversions)  <- names(non_conversion)
146 | conversions
147 | ```
148 | 
149 | * The overall conversion rates by page reveal that 50% of traffic coming to home page moved on to search page. 
150 | * About 6.7% percent of home page traffic clicked on a product in search page and landed onto payment. 
151 | * Only 0.5% of original traffic decided to buy and ended up on confirmation page. 
152 | 
153 | ```{r}
154 | page_data %>%
155 |   group_by(device) %>%
156 |   summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 
157 |             conversion_3_4 = mean(!is.na(page4)), conversion_0 = mean(!is.na(page1)))
158 | ```
159 | 
160 | The page wise conversion rates by device reveals that conversion from search to payment page is happening at a much lower rate on desktop than mobile. Also, conversion from payment page to confirmation on desktop is much lower than on mobile.
161 | 
162 | # Conclusion
163 | 
164 | The huge gap in search to payment conversion between desktop and mobile versions indicates that search is not working properly in desktop version. The problem could be that search in desktop version is not showing relevant results as good as mobile version and resulting in higher percentage of users leaving the site before purchasing. Given that two-thirds of traffic comes from desktop, fixing the search on desktop version would present a great opportunity for the e-commerce company.
165 | 
166 | # Question 2
167 | 
168 | ```{r}
169 | page_data %>%
170 |   group_by(sex) %>%
171 |   summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 
172 |             conversion_3_4 = mean(!is.na(page4)))
173 | ```
174 | 
175 | Although the conversion rate among females appears to be higher than males but its not anything shocking.
176 | 
177 | ```{r}
178 | page_data %>%
179 |   group_by(week_day) %>%
180 |   summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 
181 |             conversion_3_4 = mean(!is.na(page4)))
182 | ```
183 | 
184 | The conversion rate on Mondays is higher than any other day of the week.
185 | 
186 | ```{r}
187 | page_week_summary <- page_data %>%
188 |   group_by(week) %>%
189 |   summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 
190 |             conversion_3_4 = mean(!is.na(page4))) 
191 | 
192 | # Visualize by breaking down different page conversions
193 | page_week_summary %>%
194 |   gather(key = type, value = conversion, -week) %>%
195 |   
196 |   ggplot(aes(x = week, y = conversion, group = type)) +
197 |   geom_line(aes(color = type))
198 | ```
199 | 
200 | # Model Building 
201 | 
202 | Random forest would be a good choice for predicting conversion rate as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 
203 | 
204 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters
205 | 
206 | ```{r}
207 | 
208 | # The response variable needs to be changed into a factor
209 | data <- page_data %>%
210 |   mutate(converted = as.factor(ifelse(is.na(page4),0,1))) %>%
211 |   select(user_id, device, sex, week_day, week, converted)
212 | 
213 | # Training and test set split
214 | set.seed(2019)
215 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66))
216 | train_data = data[train_sample,]
217 | test_data = data[-train_sample,]
218 | 
219 | # Build Random forest model on the data with mostly default settings except for class weight and #trees
220 | rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)],  ntree = 50, mtry = 2, keep.forest = TRUE)
221 | 
222 | rf.fit
223 | ```
224 | 
225 | The OOB and test error rates look good but the model is useless as it is classifying everything as class '0'. This is happening because of hugely unbalanced data. 
226 | 
227 | Now the strategy is to increase the weight of the minority class to force the model towards correctly predicting more class 1 events. The increase in class1 accuracy comes at the expense of class 0 and goal is to strike the balance between them.
228 | 
229 | 
230 | ```{r}
231 | class0_error <- c()
232 | class1_error <- c()
233 | overall_accuracy <- c()
234 | 
235 | for (i in 1:9){
236 | 
237 |   rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)], ntree = 150, classwt=c(i, 10-i), mtry = 2, keep.forest = TRUE)
238 |   class0_error <- c(class0_error, rf.fit$test$confusion[1,3])
239 |   class1_error <- c(class1_error, rf.fit$test$confusion[2,3])
240 |   overall_accuracy <- c(overall_accuracy, (rf.fit$test$confusion[1,1]+rf.fit$test$confusion[2,2])/sum(rf.fit$test$confusion[,-3]))
241 | }
242 | 
243 | errors_df <- data.frame(weights = paste0(1:9,"-",9:1), class0_error, class1_error, overall_accuracy)
244 | errors_df
245 | ```
246 | 
247 | The above table shows that the optimium class wt ratio is around 1:9
248 |  
249 | ```{r}
250 | rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)],  ntree = 150, classwt=c(1,9), mtry = 2, keep.forest = TRUE)
251 | rf.fit
252 | # Visualize Important variables
253 | varImpPlot(rf.fit)
254 | ```
255 | 
256 | Now, lets try to build model without user_id
257 | 
258 | ```{r}
259 | rf.fit2 <- randomForest(y=train_data$converted, x = train_data[, -c(1,6)], ytest = test_data$converted, xtest = test_data[, -c(1,6)],  ntree = 150, classwt=c(1,9), mtry = 2, keep.forest = TRUE)
260 | rf.fit2
261 | 
262 | ```
263 | 
264 | There is a clear drop in overall test set accuracy by dropping user_id. The optimum model should infact be rf.fit.
265 | ```{r}
266 | # Visualize Important variables
267 | varImpPlot(rf.fit)
268 | # Order variables by importance
269 | imp_vars <- importance(rf.fit)
270 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T))
271 | ```
272 | 
273 | After tweaking the class weights, the model is not predicting all events as majority class. The improvement in class1 error was accomplished without hurting class0 and overall accuracy too much. The variables user id, lat and long are dominating other variables. Now, lets try to rebuild the model without including those variables and visualize the variable importance
274 | 
275 | # PDP Analysis
276 | 
277 | ```{r}
278 | # Loop through variables and build PDP in the decreasing order of variable importance
279 | for (i in seq_along(pdp_vars)) {
280 |   partialPlot(rf.fit, data[,-6], pdp_vars[i], xlab="",
281 |               main=paste("Partial Dependence on", pdp_vars[i]), which.class=1)
282 | }
283 | 
284 | ```
285 | 
286 | # Conclusion
287 | 
288 | * The Partial Dependence plots reveal that the conversion rate is relatively higher during the first 8 weeks (January and February) of year and goes down over the next 4 weeks and then picks up in the later weeks. Its possible that there could be some seasonality in the user buying patterns.
289 | 
290 | * The PD plots and descriptive charts reveal that desktop version of the site is very much underperforming compared to mobile version. More precisely the search in desktop version which is causing greater proportion of users to leave without purchasing. One possible reason could be bad UI/experience or presence of bugs that are turning off users from converting in desktop version. The product team should immediately focus on identifying these issue with desktop version.
291 | 
292 | * Another possible reason for higher number of users and lower conversion on desktop version is that the mobile app didn't somehow become popular, so you just have very engaged users using mobile and they obviously convert at a higher rate. There are very low number of casual mobile visitors. Also it's possible that the site is spending a lot of money on ads on desktop. And these ads are attracting the wrong kind of people. 
293 | 
294 | * Tuesday appears to be slightly better than other days and introducing promotions could improve the overall conversion rate for the site. 
295 | 
296 | * The PD plots also reveal that after controlling for device and other important variables, males appear to be performing better than females and naturally marketing should focus on increasing the relative weight of this segment.
297 | 


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-1.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-2.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-3.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-4.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-5.png


--------------------------------------------------------------------------------
/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON City Similarities.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'JSON City Similarities'
  3 | author: "Siddhartha Jetti"
  4 | date: "6/30/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | This is another challenge where your data is stored in a JSON file. Each row in this JSON stores info about all user searches within a session.
 11 | 
 12 | Analyzing user behavior within the same session is often crucial. Clustering users based on their browsing behavior is probably the most important step if you want to personalize your site.
 13 | 
 14 | The goal of this challenge is to build the foundation of personalization by identifying searches likely to happen together and cluster users based on their session searches.
 15 | 
 16 | # Challenge Description
 17 | 
 18 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com, etc.
 19 | 
 20 | They store their data in JSON files. Each row in the json shows all different cities which have been searched for by a user within the same session (as well as some other info about the user). That is, if I go to expedia and look for hotels in NY and SF within the same session, the corresponding JSON row will show my user id, some basic info about myself and the two cities.
 21 | 
 22 | You are given the following tasks:
 23 | 
 24 | 1) There was a bug in the code and one country didn’t get logged. It just shows up as an empty field (“”). Can you guess which country was that? How?
 25 | 
 26 | 2) For each city, find the most likely city to be also searched for within the same session.
 27 | 
 28 | 3) Travel sites are browsed by two kinds of users. Users who are actually planning a trip and users who just dream about a vacation. The first ones have obviously a much higher purchasing intent. 
 29 | Users planning a trip often search for cities close to each other, while users who search for cities far away from each other are often just dreaming about a vacation. That is, a user searching for LA, SF and Las Vegas in the same session is much more likely to book a hotel than a user searching for NY, Paris, Kuala Lumpur (makes sense, right?). 
 30 | Based on this idea, come up with an algorithm that clusters sessions into two groups: high intent and low intent. Explain all assumptions you make along the way.
 31 | 
 32 | 
 33 | # Data
 34 | 
 35 | The file is
 36 | 
 37 | city_searches - a list of searches happening within the same session
 38 | 
 39 | ## Fields:
 40 | 
 41 | * session_id : session id. Unique by row
 42 | * unix_timestamp : unixtime stamp of when the session started
 43 | * cities : the unique cities which were searched for within the same session by a user
 44 | * user : it is has the following nested fields: 
 45 | * user_id: the id of the user
 46 | * joining_date: when the user created the account
 47 | * country: where the user is based
 48 | 
 49 | 
 50 | # Problem Setup
 51 | 
 52 | ```{r}
 53 | # Load required libraries
 54 | library(tidyverse)
 55 | library(jsonlite)
 56 | library(lubridate)
 57 | library(ggplot2)
 58 | 
 59 | # Read in the input data into a dataframe
 60 | data_json <- fromJSON("city_searches.json")
 61 | ```
 62 | 
 63 | 
 64 | # Question 1:
 65 | 
 66 | Transforming the JSON data into a data frame
 67 | ```{r}
 68 | session_id <- c()
 69 | unix_timestamp <- c()
 70 | cities <- c()
 71 | user_id <- c()
 72 | joining_date <- c()
 73 | country <- c()
 74 | 
 75 | # Converting the data in the right format
 76 | for(i in 1:length(data_json$user)){
 77 |   session_id <- c(session_id, data_json$session_id[[i]])
 78 |   unix_timestamp <- c(unix_timestamp, data_json$unix_timestamp[[i]])
 79 |   cities <- c(cities, data_json$cities[[i]])
 80 |   user_id <- c(user_id, data_json$user[[i]][[1]][[1]])
 81 |   joining_date <- c(joining_date, data_json$user[[i]][[1]][[2]])
 82 |   country <- c(country, data_json$user[[i]][[1]][[3]])
 83 | }
 84 | 
 85 | # Convert to a data frame
 86 | data <- data.frame(session_id = session_id, unix_timestamp = unix_timestamp, cities = cities,
 87 |                     user_id = user_id, joining_date = joining_date, country = country, stringsAsFactors = F) %>% 
 88 |   mutate(country = ifelse(country == "", "Missing", country))
 89 | 
 90 | # Check data types of each of the columns
 91 | summary(data)
 92 | 
 93 | # check if any duplicate session id exist
 94 | length(data$session_id) == length(unique(data$session_id))
 95 | 
 96 | # Check if any missing values exist
 97 | colSums(is.na(data) | data == "")
 98 | 
 99 | ```
100 | There are no missing values in input data. Also, session id appears to be unique.
101 | 
102 | ```{r}
103 | # Obtain Time and Hour of day from the time stamp
104 | data <- data %>% 
105 |   mutate(time = as.POSIXct(unix_timestamp, origin = "1970-01-01"), hour = hour(time))
106 | 
107 | head(data)
108 | ```
109 | 
110 | Visualizing the data. Ploting the number of searches by hour of day for each of the user countries.
111 | ```{r}
112 | 
113 | countries <- unique(data$country)
114 | 
115 | for(i in countries){
116 |   data_country <- data %>%
117 |     filter(country == i) %>%
118 |     group_by(hour) %>%
119 |     summarise(sessions = n())
120 | 
121 | plot <- ggplot(data = data_country, aes(x = hour, y = sessions)) +
122 |   geom_bar(stat = "identity") +
123 |   ggtitle(paste("Sessions by hour of day in", i, sep=' '))
124 | 
125 | print(plot)
126 | }
127 | 
128 | ```
129 | 
130 | 
131 | It looks like Unix time stamp is based on one of the time zones in US. 
132 | 
133 | From the sessions Vs Hour of day histogram in US, it is clear that peak traffic is between 10 AM and 2PM. 
134 | By assuming similar distribution of sessions by hour of day in the missing country, The session Vs hour histogram for Missing country reveals that local time differs by about 11-12 hrs from US. This hints that the missing country could be in Asia and most likely India or China.
135 | 
136 | # Question 2:
137 | 
138 | Each city can be imagined as a point in the n-dimensional space spun by user sessions. Each coordinate of the point(n-dimensional) would be the number of searches of the city in the session corresponding to the cordinate. The goal her is to build city similarity matrix and extract the most similar city to each of the city searched. The most similar cities are more likely to be searched together in a session than ones that are not.
139 | 
140 | ```{r}
141 | 
142 | # Find the maximum number of cities in a given session
143 | # This is done by counting the occurences of "," + 1
144 | max_cities <- max(str_count(data$cities, ",")) + 1
145 | 
146 | user_city_matrix <- data %>%
147 |   separate(col = cities, into = paste0("city", 1:max_cities), sep = ", ") %>%
148 |   select(-user_id, -joining_date, -country, -time,-hour, -unix_timestamp) %>%
149 |   gather(key = "value", value = "cities", -session_id) %>%
150 |   filter(!is.na(cities)) %>%
151 |   group_by(session_id, cities) %>%
152 |   summarise(nsearches = n()) %>%
153 |   ungroup() %>%
154 |   spread(cities, nsearches) %>%
155 |   mutate_all(funs(replace_na(., 0)))
156 | 
157 | # n-dimensional space
158 | dim(user_city_matrix)
159 | 
160 | # Take a peek at data
161 | head(user_city_matrix)
162 | ```
163 | Each city is a point in 20022 dimensions and each cordinate is number of searches on that city in that session.
164 | Cosine similarity is used to compute similarity between two cities. Most similar cities have cosine similarity close to 1 and least similar have similarity close to 0.
165 | 
166 | ```{r}
167 | 
168 | user_city_matrix <- user_city_matrix %>%
169 |   select(-session_id)
170 | 
171 | unique_cities <- colnames(user_city_matrix)
172 | 
173 | # Define a function to compute the cosine similarity between two cities
174 | cosine_similarity <- function(x, y) { 
175 |   sum(x * y) / (sqrt(sum(x * x)) * sqrt(sum(y * y)))
176 |   }
177 | 
178 | # Define a place holder to hold similarity between each pair of cities
179 | # similarity between a city and itself is 1
180 | city_similarity  <- diag(1, nrow = ncol(user_city_matrix), ncol = ncol(user_city_matrix))
181 | rownames(city_similarity) <- unique_cities 
182 | colnames(city_similarity) <- unique_cities
183 | 
184 | ncity <- ncol(user_city_matrix)
185 | ```
186 | 
187 | Now, compute the pair-wise city smilarities and populate the city similarity matrix.
188 | ```{r}
189 | # Generate city similarity matrix 
190 | # Loop through the columns
191 | for(i in 1:ncity) {
192 |   # Loop through the columns for each column
193 |   for(j in 1:ncity) {
194 |     # Fill in placeholder with cosine similarities
195 |     city_similarity[i, j] <- cosine_similarity(user_city_matrix[i], user_city_matrix[j])
196 |   }
197 | }
198 | 
199 | # Take a peek at city 
200 | head(city_similarity[, 1:10])
201 | ```
202 | 
203 | Most likely city to be searched along with a given city is the city that has the highest similarity score after itself.
204 | 
205 | ```{r}
206 | likely_searches <- data.frame(City = unique_cities, stringsAsFactors = FALSE)
207 | 
208 | # We are interested in the most similar city after the city itself.
209 | for(i in 1:length(unique_cities)){
210 |   cities_sorted_similarity <- names(sort(city_similarity[unique_cities[i],], decreasing = TRUE))
211 |   similarity <- sort(city_similarity[unique_cities[i],], decreasing = TRUE)
212 |   city <- cities_sorted_similarity[cities_sorted_similarity != unique_cities[i]][1]
213 |   likely_searches$Most_Similar[i] <- city
214 |   likely_searches$Similarity_score[i] <- city_similarity[unique_cities[i], city]
215 | }
216 | 
217 | head(likely_searches)
218 | ```
219 | 
220 | # Question 3:
221 | 
222 | The goal is to classify multi-city search sessions into high and low intent based on the distance between searched cities. The straight forward way to accomplish this is by finding the geographic distance between each pair of cities and then classify session based on the obtained distance between the cities. Due to lack of data on the geographic distance between cities, An in-direct method should be employed.
223 | 
224 | The cosine similarity between a pair of cities, each represented by a vector in n-dimensional user session space, tends to be higher for the pair of cities that are often searched together. Conversely, the cities that are not searched together would have lower cosine similarity.
225 | 
226 | If we assume that users of online travel site have a reasonable intent to travel then lower cosine similarity between the pair of cities can be viewed as a proxy for higher distance between them. The similarity score for multi-city search sessions can be calculated as the average of cosine similarities between each pair of cities.
227 | 
228 | Now let us test the assumption using few examples.
229 | ```{r}
230 | # Top 5 least similar cities with San Jose CA
231 | names(sort(city_similarity["San Jose CA", ]))[1:5]
232 | 
233 | # Top 5 least similar cities with Miami FL
234 | names(sort(city_similarity["Miami FL", ]))[1:5]
235 | 
236 | # Top 5 least similar cities with New York NY
237 | names(sort(city_similarity["New York NY", ]))[1:5]
238 | ```
239 | 
240 | Clearly, the least similar cities are ones from a far away state and in some cases cities from a different coast. So, Similarity can be used as a proxy for distance with an inverse relationship.
241 | 
242 | Now, Computing similarity score for the multi-city sessions.
243 | ```{r}
244 | 
245 | # Define a function to compute the similarity score for the session
246 | session_similarity <- function(cities){
247 |   # Get all the cities searched
248 |   searched_cities <- strsplit(cities, split = ", ")[[1]]
249 |   # if only one city is searched then similarity is assigned 0
250 |   if(length(searched_cities) > 1){
251 |     city_pairs <- t(combn(searched_cities, 2))
252 |     similarity <- mean(city_similarity[city_pairs])
253 |   } else { similarity <- NA }
254 | }
255 | 
256 | # Loop through all the sessions and assign session similarity
257 | for(i in 1:nrow(data)){
258 |   data$session_similarity_score[i] <- session_similarity(data$cities[i])
259 | }
260 | 
261 | # distribution of session similarity score among sessions with more than one city being searched
262 | quantile(data$session_similarity_score[!is.na(data$session_similarity_score)], probs = seq(0, 1, by = 0.05))
263 | 
264 | data %>%
265 |   filter(!is.na(session_similarity_score)) %>%
266 |   ggplot()+
267 |   geom_histogram(bins = 50, aes(x = session_similarity_score, y = ..density..))+
268 |   geom_density(aes(x = session_similarity_score, y = ..density..))
269 | ```
270 | 
271 | From the similarity quantiles and distribution, Looks like 0.06 is a reasonable cuttoff for session similarity that classifies 25% of multi-city sessions as low intent and 75% of them as high intent.
272 | 
273 | ```{r}
274 | data <- data %>%
275 |   filter(!is.na(session_similarity_score)) %>%
276 |   mutate(Booking_Intent = ifelse(session_similarity_score > 0.06, "High Intent", "Low Intent"))
277 | 
278 | table(data$Booking_Intent)
279 | ```
280 | 
281 | For sessions with one city search, Unfortunately The data provided is not sufficient to classify them into high or low intent. However, If variables like time spent on site or clickstream behavior are provided it would be possible to classify them.


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-2.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-3.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-4.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-5.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-6.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-7.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-2.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-3.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-4.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-5.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-6.png


--------------------------------------------------------------------------------
/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-7.png


--------------------------------------------------------------------------------
/Loan granting/Loan granting.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Loan granting'
  3 | author: "Siddhartha Jetti"
  4 | date: "6/6/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Another area where data science and machine learning play a huge role is in choosing if granting a loan. This is a particularly hot field as many start-ups feel that bank loan models can be improved. Therefore, there is space to come up with better loaning strategies that can benefit both the lender and the borrower.
 11 | 
 12 | In this challenge, you will have access to loan data from a bank and will have to improve their model.
 13 | 
 14 | # Challenge Description
 15 | 
 16 | We have access to a specific bank loan data. We have data about all loans asked to the bank, whether the bank decided to grant it and, finally, whether the borrower managed to repay it. We also have info about the person asking for the loan at the moment she is asking for the loan.
 17 | 
 18 | You have to come up with a better strategy to grant loans. Specifically you should:
 19 | 
 20 | Build a model which is better than the bank model. Assume that: 
 21 | If you grant the loan and the it doesn’t get repaid, you lose 1.
 22 | If you grant the loan and the it does get repaid, you gain 1
 23 | If you don’t grant the loan, you gain 0.
 24 | 
 25 | Using the rules above, compare bank profitability vs your model profitability.
 26 | 
 27 | * Describe the impact of the most important variables on the prediction. Also, focus on the variable “is_employed”, which describes whether the borrower is employed when she asks for the loan. How does this variable impact the model? Explain why.
 28 | 
 29 | * Are there any other variables you’d like to include in the model?
 30 | 
 31 | # Data
 32 | 
 33 | We have 2 table downloadable by clicking here.
 34 | 
 35 | The 2 tables are:
 36 | 
 37 | loan_table - general information about the loan
 38 | 
 39 | ### Columns:
 40 | 
 41 | * loan_id : the id of the loan. Unique by loan. Can be joined to loan id in the other table
 42 | * loan_purpose : the reason for asking the loan: investment, other, business, emergency_funds, home
 43 | * date : when the loan was asked
 44 | * loan_granted : whether the loan was granted
 45 | * loan_repaid : whether the loan was repaid. NA means that the loan was not granted
 46 | 
 47 | borrower_table - information about the borrower
 48 | 
 49 | ### Columns:
 50 | 
 51 | * loan_id : the id of the the loan. Unique by loan. Can be joined to loan id in the other table
 52 | * is_first_loan : did she ask for any other loans in her lifetime?
 53 | * fully_repaid_previous_loans : did she pay on time all of her previous loans? If this is the first loan, it is NA
 54 | * currently_repaying_other_loans : is she currently repaying any other loans? If this is the first loan, it is NA
 55 | * total_credit_card_limit : total credit card monthly limit
 56 | * avg_percentage_credit_card_limit_used_last_year : on an average, how much did she use of her credit card limit in the previous 12 months. This number can be >1 since it is possible to go above the credit card limit
 57 | * saving_amount : total saving amount balance when she asked for the loan
 58 | * checking_amount : total checking amount balance when she asked for the loan
 59 | * is_employed : whether she is employed (1) or not (0)
 60 | * yearly_salary : how much she earned in the previous year
 61 | * age : her age
 62 | * dependent_number : number of people she claims as dependent
 63 | 
 64 | 
 65 | # Problem Setup
 66 | 
 67 | ```{r}
 68 | # Load required libraries
 69 | library(dplyr)
 70 | library(ggplot2)
 71 | library(randomForest)
 72 | library(ROCR)
 73 | 
 74 | # Read in the input data into a dataframe
 75 | borrower <- read.csv("borrower_table.csv")
 76 | loan <- read.csv("loan_table.csv")
 77 | ```
 78 | 
 79 | 
 80 | # Question 1:
 81 | 
 82 | Understanding and transforming the data
 83 | ```{r}
 84 | # Joining both loan and borrower tables. Transform variables to right format
 85 | data <- loan %>%
 86 |   inner_join(borrower, by = "loan_id") 
 87 | 
 88 | # Take a peak at the data
 89 | str(data)
 90 | summary(data)
 91 | ```
 92 | 
 93 | Check for missing values in the data.
 94 | 
 95 | ```{r}
 96 | # count of missing values by column in the merged data
 97 | colSums(is.na(data))
 98 | 
 99 | # is there a pattern in missing values?
100 | data %>%
101 |   filter(is.na(loan_repaid)) %>%
102 |   select(loan_granted) %>%
103 |   distinct()
104 | 
105 | data %>%
106 |   filter(is.na(fully_repaid_previous_loans) | is.na(currently_repaying_other_loans)) %>%
107 |   select(is_first_loan) %>%
108 |   distinct()
109 | 
110 | data %>%
111 |   filter(is.na(avg_percentage_credit_card_limit_used_last_year)) %>%
112 |   select(total_credit_card_limit) %>%
113 |   distinct()
114 | ```
115 | 
116 | Clearly, There is a pattern in missing values and are not occuring at random.
117 | 
118 | Missing values exist in 'fully_repaid_previous_loans' and 'currently_repaying_other_loans' variables only for the customers who are applying for the first loan. The variable 'loan_repaid' is missing only for customers who are not granted a loan. Also, 'avg_percentage_credit_card_limit_used_last_year' is missing only for customers whose credit limit is zero. All this makes sense.
119 | 
120 | Imputation for missing values is required.
121 | ```{r}
122 | data %>%
123 |   select(loan_repaid, fully_repaid_previous_loans, currently_repaying_other_loans, avg_percentage_credit_card_limit_used_last_year) %>%
124 |   head()
125 | ```
126 | 
127 | Three of the above variables are discrete and one is continuous. 
128 | Due to the non-random nature of missing values, new category should be created for missing values in categorical variables and a very high number is imputed for missing values in continuous variables.
129 | 
130 | ```{r}
131 | data <- data %>%
132 |   mutate(fully_repaid_previous_loans = as.factor(ifelse(is.na(fully_repaid_previous_loans), -1, fully_repaid_previous_loans)),
133 |          currently_repaying_other_loans = as.factor(ifelse(is.na(currently_repaying_other_loans), -1, currently_repaying_other_loans)),
134 |          avg_percentage_credit_card_limit_used_last_year = ifelse(is.na(avg_percentage_credit_card_limit_used_last_year), 9999, avg_percentage_credit_card_limit_used_last_year),
135 |          is_employed = as.factor(is_employed),
136 |          is_first_loan = as.factor(is_first_loan))
137 | ```
138 | 
139 | Estimating the bank profitability based on existing lending policy. 
140 | 
141 | ```{r}
142 | data %>%
143 |     transmute(profitability = ifelse(loan_granted == 0, 0, loan_repaid*(loan_granted + 1) - 1)) %>%
144 |     summarize(profit = sum(as.numeric(profitability)))
145 |     
146 | ```
147 | 
148 | The bank profit based on the gain/loss rules is 13758.
149 | 
150 | ```{r}
151 | summary(data)
152 | ```
153 | 
154 | After Imputation, no missing values exist in the dataset other than loan_repaid variable.
155 | 
156 | Run descriptive stats on the input data. Clearly, variables if employed, saving_amount and salary appear to matter to determine the probability of repayment of the loan.
157 | 
158 | Check if_employed vs loan_repaid
159 | ```{r}
160 | 
161 | is_employed_data <- data %>%
162 |   filter(loan_granted == 1) %>%
163 |   mutate(loan_repaid = as.factor(loan_repaid)) %>%
164 |   group_by(is_employed,loan_repaid) %>%
165 |   summarize(counts = n())
166 | 
167 | ggplot(data = is_employed_data, aes(x = is_employed, y = counts, fill = loan_repaid))+
168 |   geom_bar(stat = "identity")
169 | 
170 | ```
171 | 
172 | Clearly, Among the loans taken by employed majority of them are repaid, which makes sense.
173 | Check how salary effects loan repaid variable.
174 | ```{r}
175 | employed_data <- data %>%
176 |   group_by(loan_repaid) %>%
177 |   summarise(mean_yearly_salary = mean(yearly_salary),
178 |             mean_savings = mean(saving_amount))
179 | 
180 | 
181 | ggplot(data = employed_data, aes(x = loan_repaid, y = mean_yearly_salary))+
182 |        geom_bar(stat = "identity")
183 | 
184 | ggplot(data = employed_data, aes(x = loan_repaid, y = mean_savings))+
185 |        geom_bar(stat = "identity")
186 | ```
187 | 
188 | The above plots reveal that loans that are repaid, reported on an average higher savings balance and higher yearly salary than the loans that are not repaid.
189 | 
190 | # Model Building 
191 | 
192 | The gain or loss rules are as follows
193 | * If loan is granted and does not get repaid, loss 1.
194 | * If loan is granted and gets repaid, gain 1.
195 | * If loan is not granted, gain 0.
196 | 
197 | Lets build a model to estimate probability of repayment of loan and eventually predict if loan gets repaid or not. Random forest would be a good candidate as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 
198 | 
199 | Define the class 0 as loan not getting repaid and class 1 as loan getting repaid. The goal is to build random forest model and find optimal cuttoff probability that maximizes profit based on the above rules. The loan will be granted only if estimated probability of repayment is more than chosen cuttoff else the loan gets denied.
200 | 
201 | * False positive, FP results in -1
202 | * True positive, TP results in +1
203 | * False negative, FN results in 0
204 | * True negative, TN results in 0
205 | 
206 | The model should be built using only the rows where loans are granted as we know if loan is eventually repaid or not. The rows where loan not granted are excluded  from the training sample. A standard 66-33% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters
207 | 
208 | ```{r}
209 | 
210 | # save loans that are denied by bank for future use
211 | loans_denied <- data %>%
212 |   filter(loan_granted == 0)
213 | 
214 | # Training and test set split
215 | data <- data %>%
216 |   filter(loan_granted == 1) %>%
217 |   select(-loan_id, -date, -loan_granted) %>%
218 |   mutate(loan_repaid = as.factor(loan_repaid))
219 | 
220 | set.seed(2019)
221 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66))
222 | train_data = data[train_sample,]
223 | test_data = data[-train_sample,]
224 | 
225 | # Build Random forest model on the data with mostly default settings except for class weight and #trees
226 | rf.fit <- randomForest(y = train_data$loan_repaid, x = train_data[,-c(2)], ytest = test_data$loan_repaid, 
227 |                        xtest = test_data[,-c(2)],  ntree = 250, mtry = 4, keep.forest = TRUE)
228 | 
229 | rf.fit
230 | # Visualize Important variables
231 | varImpPlot(rf.fit)
232 | ```
233 | 
234 | From the variable importance plot, Variables 'saving_amount' and 'checking_amount' have too much importance compared with other variables. This is happening because those variables are acting as a proxy for other variables and have in them part of the information from other variables. Lets rebuild the model by removing these to see how it changes the variable importance.
235 | 
236 | ```{r}
237 | # Build Random forest model on the data with mostly default settings except for class weight and #trees
238 | # without the top variables
239 | rf.fit2 <- randomForest(y = train_data$loan_repaid, x = train_data[,-c(2,8,9)], ytest = test_data$loan_repaid, 
240 |                        xtest = test_data[,-c(2,8,9)],  ntree = 250, keep.forest = TRUE)
241 | 
242 | rf.fit2
243 | # Visualize Important variables
244 | varImpPlot(rf.fit2)
245 | ```
246 | 
247 | 
248 | Overall error rate, class 0 error rate and class 1 error rate look decent. The model can be used for insights. Also, training and test error are close indicating that model is not overfitting to data.
249 | 
250 | ```{r}
251 | # Order variables by importance
252 | imp_vars <- importance(rf.fit)
253 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T))
254 | 
255 | # Loop through variables and build PDP in the decreasing order of variable importance
256 | for (i in seq_along(pdp_vars)) {
257 |   partialPlot(rf.fit, data[,-2], pdp_vars[i], xlab="",
258 |               main=paste("Partial Dependence on", pdp_vars[i]), which.class=1)
259 | }
260 | 
261 | ```
262 | 
263 | # Question 2
264 | 
265 | Now finding the cuttoff probability that maximizes the profit.
266 | ```{r}
267 | #Compare predictions (votes) vs actual results for the test set
268 | pred = prediction(rf.fit$test$votes[,2], test_data$loan_repaid)
269 | 
270 | error_cutoff = data.frame (pred@cutoffs,
271 |                            pred@tn,
272 |                            pred@fn,
273 |                            pred@fp,
274 |                            pred@tp, 
275 |                            row.names = NULL)
276 | colnames(error_cutoff) = c("cutoff", "tn", "fn", "fp", "tp")
277 | 
278 | error_cutoff <- error_cutoff %>%
279 |   mutate(profit = tp-fp) %>%
280 |   arrange(desc(profit))
281 | 
282 | error_cutoff[1,]
283 | ```
284 | 
285 | The cuttoff probability of 0.52 results in maximum profit based on the gain/loss rules.
286 | Now use the model "rf.fit" and the above cuttoff to score the all the loans where bank originally granted loans and compare the profit with bank's profit.
287 | 
288 | ```{r}
289 | repayment_prob <- predict(rf.fit, data[,-2], type = "prob")
290 | loan_status <- ifelse(repayment_prob[,2] > 0.52, 'Grant', 'Deny')
291 | 
292 | # Number of loans granted or denied
293 | table(loan_status)
294 | 
295 | # Among the granted ones, how many would be repaid or not repaid
296 | repay_status <- data$loan_repaid[which(loan_status == 'Grant')]
297 | status <- table(repay_status)
298 | 
299 | # Profit with new model
300 | status[2] - status[1]
301 | ```
302 | 
303 | Among the 47654 loans granted by bank, the new model would only grant 30402 loans and deny 17252 loans.
304 | Out of the granted 30402 loans, 29899 would be repaid and 503 would not be repaid. This results in a total gain of 29396 much higher than bank's profit of 13758.
305 | 
306 | The new model reveals that to get maximum profit, the bank should grant loans only if the repayment probability predicted by it is more than 52%.
307 | 
308 | Now use the new model to score the loans that were denied by the bank.
309 | ```{r}
310 | loans_denied_repayment <- predict(rf.fit, loans_denied[,-c(1,3,4,5)], type = "prob")
311 | loan_status <- ifelse(loans_denied_repayment[,2] > 0.52, 'Grant', 'Deny')
312 | 
313 | table(loan_status)
314 | ```
315 | The fact that bank denied these loans should indicate that they are bad loans with low probability of repayment. Not suprisingly, new model would deny 65% of these loans and grant only 35% of them.
316 | 
317 | # Question 3
318 | 
319 | The following conclusions can be drawn from the variable importance plot and Partial dependence plots of the full model "rf.fit".
320 | 
321 | * The variables savings amount  and checking amount appear to dominate other variables in terms of variable importance. It is likely that these variables are correlated and acting as a proxy for other variables and have in them part of the information from other variables as well.
322 | 
323 | * The variable "is_employed" which appeared to be an important variable in descriptive stats appears to have very low importance in the combined model. This is counter intuitive because people who are employed have steady cash flow and are expected to have higher chance of repayment. This can be explained from the fact that is_employed is highly correlated with savings/checking amount and perfectly correlated with yearly salary. Also, banks offer higher credit limits for employed people. Generally people who are employed tend to have higher saving/checking balance,yearly salary and credit limit than people who are not. After controlling for the saving/checking amount, yearly salary and credit limit variables, there is not much information left to be extracted from is_employed and its importance appears low. The following plots provide evidence in favor of the above hypothesis.
324 | 
325 | ```{r}
326 | employed_data <- data %>%
327 |   group_by(is_employed) %>%
328 |   summarise(mean_yearly_salary = mean(yearly_salary),
329 |             mean_savings = mean(saving_amount),
330 |             mean_checking = mean(checking_amount),
331 |             mean_credit_limit = mean(total_credit_card_limit))
332 | employed_data
333 | 
334 | # Relation between is_employed and yearly salary
335 | ggplot(data = employed_data, aes(x = is_employed, y = mean_yearly_salary))+
336 |        geom_bar(stat = "identity")
337 | 
338 | # Relation between is_employed and mean savings amount
339 | ggplot(data = employed_data, aes(x = is_employed, y = mean_savings))+
340 |        geom_bar(stat = "identity")
341 | 
342 | # Relation between is_employed and mean checking
343 | ggplot(data = employed_data, aes(x = is_employed, y = mean_checking))+
344 |        geom_bar(stat = "identity")
345 | 
346 | # Relation between is_employed and mean credit limit
347 | ggplot(data = employed_data, aes(x = is_employed, y = mean_credit_limit))+
348 |        geom_bar(stat = "identity")
349 | 
350 | ```
351 | 
352 | * People with higher saving/checking amount, higher credit limit and higher salary perform better in terms of loan repayment.
353 | 
354 | * People who are currently repaying other loans appear to have lower chance of repayment than people who are not currently repaying as they have higher financial burden.
355 | 
356 | * The chance of repayment appear to be highest in middle aged people roughly around 40 yrs and appears to drop after.
357 | 
358 | * Loans for emergency funds appear to be less likely to be repaid than loans taken for business or investment purpose.
359 | 
360 | * People with more number of dependents are less likely to repay than people with fewer dependents.
361 | 
362 | * People who repaid loans fully in the past appear to have higher chances of repaying the loans again.
363 | 
364 | * After controlling for all the other variables, is first loan variable does not appear to matter much.
365 | 
366 | 
367 | # Question 4
368 | 
369 | Suprisingly, the data provided for this exercise lacks a critical variable like loan amount (amount they are seeking through the loan). If loan amount is very high then it could make even a credit worthy person to look bad and get denied on the loan. I would include this variable in the model.
370 | 
371 | Another variable that is not provided is the total existing debt (all kinds) each person has. This variable is important because higher debt would cause higher financial burden which could affect the chance of repayment.


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-10.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-11.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-12.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-2.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-3.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-4.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-5.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-6.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-7.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-8.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-9.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-2.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-3.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-4.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-2.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online Video Challenge.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Online Video Challenge'
  3 | author: "Siddhartha Jetti"
  4 | date: "7/4/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | The company of this challenge allows users to upload videos online, just like YouTube.
 11 | 
 12 | This company is interested in knowing whether a video is “hot” (i.e. trending up in terms of popularity), stable or going down. Understanding this would allow to optimize the videos promoted on the home-page and, therefore, maximize ads revenue.
 13 | 
 14 | # Challenge Description
 15 | 
 16 | Company XYZ is an online video streaming company, just like YouTube or Dailymotion.
 17 | 
 18 | The Head of Product has identified as a major problem for the site a very high home page drop-off rate. That is, users come to the home-page and then leave the site without taking any action or watching any video. 
 19 | Since customer acquisition costs are very high, this is a huge problem: the company is spending a lot of money to acquire users who don’t generate any revenue by clicking on ads.
 20 | 
 21 | Currently, the videos shown on the home page to new users are manually chosen. The Head of Product had this idea of creating a new recommended video section on the home page.
 22 | 
 23 | He asked you the following:
 24 | 
 25 | 1) Classify each video into one these 3 categories: 
 26 |    ”Hot” - means trending up. These videos are candidate to be shown.
 27 |    “Stable and Popular” - video view counts are flat, but very high. These videos are candidates to be shown too.
 28 |    “Everything else” - these videos won’t be shown.
 29 | What are the main characteristics of the “hot videos”?
 30 | 
 31 | 2) After having identified the characteristics of the hot videos, how would you use this information from a product standpoint?
 32 | 
 33 | # Data
 34 | 
 35 | We have 2 tables downloadable by clicking here.
 36 | 
 37 | The 2 tables are:
 38 | 
 39 | video_count - provides information about how many times each video was seen each day.
 40 | 
 41 | ## Columns:
 42 | 
 43 | * video_id : video id, unique by video and joinable to the video id in the other table
 44 | * count : total count of views for each video
 45 | * date : on which day that video was watched that many times
 46 | 
 47 | video_features - characteristics of the video.
 48 | 
 49 | ## Columns:
 50 | 
 51 | * video_id : video id, unique by video and joinable to the video id in the other table
 52 | * video_length : length of the video in seconds
 53 | * video_language : language of the video, as selected by the user when she uploaded the video
 54 | * video_upload_date : when the video was uploaded
 55 | * video_quality : quality of the video. It can be [ 240p, 360p, 480p, 720p, 1080p]
 56 | 
 57 | # Problem Setup
 58 | 
 59 | ```{r}
 60 | # Load required libraries
 61 | library(dplyr)
 62 | library(ggplot2)
 63 | library(lubridate)
 64 | library(randomForest)
 65 | 
 66 | 
 67 | # Read in the input data into a dataframe
 68 | views <- read.csv("video_count.csv", stringsAsFactors = F) 
 69 | features <- read.csv("video_features.csv", stringsAsFactors = F) 
 70 | ```
 71 | 
 72 | # Question 1:
 73 | 
 74 | Explore the views dataset
 75 | ```{r}
 76 | # Transform views dataset
 77 | views <- views %>%
 78 |   mutate(date = as.Date(date)) %>%
 79 |   arrange(video_id, date)
 80 | 
 81 | # Check data types of each of the columns
 82 | summary(views)
 83 | 
 84 | # Take a peek at data
 85 | head(views)
 86 | ```
 87 | 
 88 | Explore the features dataset
 89 | ```{r}
 90 | 
 91 | # Transform features dataset
 92 | features <- features %>%
 93 |   mutate(video_upload_date = as.Date(video_upload_date)) %>%
 94 |   arrange(video_id)
 95 | 
 96 | # Check data types of each of the columns
 97 | summary(features)
 98 | 
 99 | # Take a peek at data
100 | head(features)
101 | ```
102 | 
103 | Check for missing values in the views and features data.
104 | ```{r}
105 | # count of missing values by column in views dataset
106 | colSums(is.na(views))
107 | # count of missing values by column in features dataset
108 | colSums(is.na(features))
109 | ```
110 | No missing values exist anywhere in the data.
111 | 
112 | Check if count data exists for all the videos for all the dates.
113 | ```{r}
114 | table(views$date)
115 | ```
116 | 
117 | Clearly, The video counts are provided for 15 days starting from January 1st 2015.
118 | 
119 | ```{r}
120 | length(unique(views$video_id))
121 | ```
122 | 
123 | There are 2785 videos and 15 days of view counts data exist for every video.
124 | 
125 | Here the goal is to classify videos into Hot, stable and others based on if videos are trending up or down. Extracting trend from the time series of views count is essential to accomplish this task.
126 | 
127 | There are several ways to extract the trend. The simplest method is to fit a straight line that is closest to the views count data points for each video, In other words finding the line that minimizes the total sum of squared distances between the points and line. 
128 | 
129 | Let's pick few arbitrary videos and visualize the time series of views.
130 | 
131 | ```{r}
132 | views %>%
133 |   filter(video_id == 176) %>%
134 | 
135 | ggplot(aes(date, count)) +
136 |   geom_point() +
137 |   geom_smooth(method = "lm", se = FALSE)
138 | ```
139 | 
140 | ```{r}
141 | views %>%
142 |   filter(video_id == 499) %>%
143 | 
144 | ggplot(aes(date, count)) +
145 |   geom_point() +
146 |   geom_smooth(method = "lm", se = FALSE)
147 | ```
148 | 
149 | For the selected video "176", the least sum of squares line appears to be a good fit to the data. But for video "499", it looks like a poor fit. Since the goal here is not to predict future counts but only for a descriptive purpose, the linear trend line approach can still be used. The slope of trendline is the coefficient of date variable in regression line equation. The positive slope indicates that views are trending up and a negative slope indicates the video is trending down in terms of views over time,
150 | 
151 | Before finding the least sum of squares fit to all the videos, let us normalize the variables so that visualizing the trendlines of various videos becomes easy.
152 | 
153 | ```{r}
154 | # Function to normalize the variables
155 | normalize <- function(x) {
156 |     return((x - min(x)) / (max(x) - min(x)))
157 | }
158 | 
159 | # Function to extract slope and trend in the data
160 | slope_trendline <- function(y, x) {
161 |   trendline <- lm(formula = y ~ x)
162 |   return(trendline$coefficients[2])
163 | }
164 | 
165 | # Now group by video id, normalize variables and extract slope
166 | views_summary <- views %>%
167 |   mutate(date = as.numeric(date),
168 |          count_norm = normalize(count),
169 |          date_norm = normalize(date)) %>%
170 |   group_by(video_id) %>%
171 |   summarise(avg_views_per_day = round(mean(count), digits = 0),
172 |             slope = slope_trendline(count_norm, date_norm))
173 | 
174 | # Take a look at the data
175 | head(views_summary)
176 | ```
177 | 
178 | Here is the distribution of average counts per day 
179 | ```{r}
180 | quantile(views_summary$avg_views_per_day, probs = seq(0, 1, by = 0.05)) 
181 | views_summary %>%
182 |   ggplot() +
183 |   geom_histogram(bins = 30, aes(x = avg_views_per_day, y = ..density..)) +
184 |   geom_density(aes(x = avg_views_per_day, y = ..density..))
185 | ```
186 | 
187 | The distribution of slopes of trendlines
188 | ```{r}
189 | quantile(views_summary$slope, probs = seq(0, 1, by = 0.1)) 
190 | views_summary %>%
191 |   ggplot() +
192 |   geom_histogram(bins = 30, aes(x = slope)) 
193 | ```
194 | 
195 | Based on the above distributions, Here are the rules that can be thought of to classify videos into Hot, stable and others.
196 | 
197 | * If slope >= 0.02 then video is "Hot"
198 | * If -0.02 < slope < 0.02 and average views per day more than 1,000,000  then  video is "Stable and Popular".
199 | * If none of the above then video is "Everything else""
200 | 
201 | ```{r}
202 | # Classifying videos based on the stated rules
203 | views_summary <- views_summary %>%
204 |   mutate(category = case_when(
205 |                       slope >= 0.02 ~ "Hot",
206 |                       slope > -0.02 & slope < 0.02 & avg_views_per_day > 1000000 ~ "Stable and Popular",
207 |                       TRUE ~ "Everything else")
208 |   )
209 | 
210 | # Frequency of different video categories
211 | table(views_summary$category)
212 | ```
213 | 
214 | # Question 2:
215 | 
216 | Now, The goal is to build a model to predict if a video is "Hot" and understand the factors that effect it.
217 | 
218 | Run descriptive stats on the video features dataset
219 | ```{r}
220 | # Merging the two datasets
221 | data <- features %>%
222 |   left_join(views_summary, by = "video_id") %>%
223 |   mutate(is_hot = as.factor(ifelse(category == "Hot", 1, 0)),
224 |          days_after_upload = as.Date("2015-01-15") - video_upload_date,
225 |          video_language = as.factor(video_language),
226 |          video_quality = as.factor(video_quality),
227 |          upload_weekday = as.factor(weekdays(video_upload_date)),
228 |          upload_week = week(video_upload_date)) %>%
229 |   select(-video_upload_date, -category, -slope)
230 | 
231 | # Variable data types
232 | summary(data)
233 | ```
234 | 
235 | ```{r}
236 | data %>%
237 |   group_by(is_hot) %>%
238 |   summarise(avg_length = mean(video_length)) %>%
239 | ggplot() +
240 |   geom_col(aes(x = is_hot, y = avg_length))
241 | 
242 | ggplot(data, aes(x = video_length, group = is_hot)) +
243 |   geom_histogram(binwidth = 120, aes(fill = is_hot))
244 | ```
245 | 
246 | The average length of Hot videos is much lower than other videos. The histogram of video length reveals that proportion of Hot videos is much lower in the long videos than shorter ones.
247 | 
248 | Average views per day by language
249 | ```{r}
250 | data %>%
251 |   group_by(video_language) %>%
252 |   summarise(avg_views_per_day_per_video = mean(avg_views_per_day)) %>%
253 | ggplot() +
254 |   geom_col(aes(x = video_language, y = avg_views_per_day_per_video))
255 | ```
256 | 
257 | 
258 | Number of videos uploaded and proportion of hot videos by language
259 | ```{r}
260 | data %>%
261 |   group_by(video_language, is_hot) %>%
262 |   summarise(counts = n()) %>%
263 |   mutate(freq = round(counts / sum(counts), digits = 2)) %>%
264 | ggplot(aes(x = video_language, y = counts, group = is_hot)) +
265 |        geom_col(aes(fill = is_hot)) +
266 |        geom_text(aes(label = paste0(freq*100, "%")))
267 | ```
268 | 
269 | The bar chart reveals that German and French videos are much less in number than other languages. The highest number of videos exist in English and Chinese than other languages.
270 | 
271 | Number of videos uploaded and proportion of hot videos by quality
272 | ```{r}
273 | data %>%
274 |   group_by(video_quality, is_hot) %>%
275 |   summarise(counts = n()) %>%
276 |   mutate(freq = round(counts / sum(counts), digits = 2)) %>%
277 | ggplot(aes(x = video_quality, y = counts, group = is_hot)) +
278 |        geom_col(aes(fill = is_hot)) +
279 |        geom_text(aes(label = paste0(freq*100, "%")))
280 | ```
281 | 
282 | Number of videos uploaded and proportion of hot videos by upload week day
283 | ```{r}
284 | data %>%
285 |   group_by(upload_weekday, is_hot) %>%
286 |   summarise(counts = n()) %>%
287 |   mutate(freq = round(counts / sum(counts), digits = 2)) %>%
288 | ggplot(aes(x = upload_weekday, y = counts, group = is_hot)) +
289 |        geom_col(aes(fill = is_hot)) +
290 |        geom_text(aes(label = paste0(freq*100, "%")))
291 | ```
292 | 
293 | ## Model Building:
294 | 
295 | Random forest would be a good choice for predicting "Hot" videos as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters. Partial dependence plots can be used to capture insights from the model.
296 | 
297 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters
298 | 
299 | ```{r}
300 | 
301 | # Split the available data into training and test data sets 
302 | set.seed(2019)
303 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66))
304 | train_data = data[train_sample,]
305 | test_data = data[-train_sample,]
306 | 
307 | # Build Random forest model on the data with mostly default settings except for class weight and #trees
308 | rf.fit <- randomForest(y = train_data$is_hot, x = train_data[, -c(5,6)], ytest = test_data$is_hot, 
309 |                        xtest = test_data[, -c(5,6)],  ntree = 250, mtry = 3, keep.forest = TRUE)
310 | 
311 | rf.fit
312 | 
313 | # Visualize Important variables
314 | varImpPlot(rf.fit)
315 | ```
316 | 
317 | ```{r}
318 | # Order variables by importance
319 | imp_vars <- importance(rf.fit)
320 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T))
321 | 
322 | # Loop through variables and build PDP in the decreasing order of variable importance
323 | for (i in seq_along(pdp_vars)) {
324 |   partialPlot(rf.fit, data[,-c(5,6)], pdp_vars[i], xlab = "", 
325 |               main = paste("Partial Dependence on", pdp_vars[i]), which.class = 1)
326 | }
327 | 
328 | ```
329 | 
330 | The variable importance plot reveals that video length is a major factor in predicting if the video is trending or not followed by days after upload and video quality variables. The variable upload week is just proxy for days after upload.
331 | 
332 | The following are the characteristics of videos that are trending up:
333 | 
334 | * Short video lengths.
335 | 
336 | * Newer videos are more likely to be trending than older ones. This is expected as videos age, they lose the novelty affect and become less effective at generating more views.
337 | 
338 | * Low resolution videos (240p) have a slight higher chance of trending on the platform than higher resolution. This could be because lower resolution videos are easier to stream in low band width areas like streaming videos with mobiles on buses and subways.
339 | 
340 | * German and French videos are more likely to be trending than other languages. This is not suprising as there are so few German and French videos uploaded on the site and the low supply in those languages could be resulting in higher proportion of Hot videos.
341 | 
342 | # Question 3:
343 | 
344 | Let us dig deep and explore relationship between video length and views.
345 | ```{r}
346 | # Bin vodeos based on the video length
347 | data <- data %>%
348 |   mutate(video_length_type = case_when(video_length > 0 & video_length <= 300 ~ "0-5 mins",
349 |                                   video_length > 300 & video_length <= 600 ~ "05-10 mins",
350 |                                   video_length > 600 & video_length <= 900 ~ "10-15 mins",
351 |                                   TRUE  ~ "15+ mins")
352 |          )
353 | 
354 | data %>%
355 |   group_by(video_length_type) %>%
356 |   summarise(avg_views_per_day_per_video = mean(avg_views_per_day)) %>%
357 | ggplot() +
358 |   geom_col(aes(x = video_length_type, y = avg_views_per_day_per_video))
359 | ```
360 |  
361 | The above plot shows that long videos (15+ mins duration) are generating far lower number of views than short ones (less than 10 min duration) and average views per day is lower by more than 35%. Also,The partial dependence plots reveal that short videos are more likely to be trending than longer ones. 
362 | 
363 | From the product stand point, Here are some recommendations:
364 | 
365 | * The company should investigate why there are so few German and French videos uploaded. This could be due to some translation problems in the upload process or is it something else.
366 | 
367 | * Clearly, it appears that lengthy videos are generating far fewer views and are less likely to be trending than short ones. So, The Company should encourage video creators to make short ones with engaging content. They could consider creating a seperate "short video" section with a limit on duration of video being uploaded somewhere between 5-10 mins. The company should commission studies on what the optimum allowable video length should be for the short video section.
368 | 
369 | * The company could also consider an approach to prompt users with a pop up telling the user that videos shorter than 10 mins are more likely to generate more views. If user is uploading really long videos then forbidding from uploading video to platform.
370 | 


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-2.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-15-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-15-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-16-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-16-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-17-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-17-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-18-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-18-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-19-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-19-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-2.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-3.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-4.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-5.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-6.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-7.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-21-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-21-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # A-Collection-of-Data-Science-Take-Home-Challenges
2 | 
3 | R based Solutions to the book "A Collection of Data Science Take-Home Challenges" by Giulio Palombo.
4 | 


--------------------------------------------------------------------------------
/Song Challenge/Song_Challenge.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Song Challenge'
  3 | author: "Siddhartha Jetti"
  4 | date: "8/4/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Company XYZ is a very early stage startup. They allow people to stream music from their mobile for free. Right now, they still only have songs from the Beatles in their music collection, but they are planning to expand soon.
 11 | 
 12 | They still have all their data in json files and they are interested in getting some basic info about their users as well as building a very preliminary song recommendation model in order to increase user engagement.
 13 | 
 14 | # Challenge Description
 15 | 
 16 | You are the fifth employee at company XYZ. The good news is that if the company becomes big, you will become very rich with the stocks. The bad news is that at such an early stage the data is usually very messy. All their data is stored in json format.
 17 | 
 18 | The company CEO asked you for very specific questions:
 19 | 
 20 | * What are the top 3 and the bottom 3 states in terms number of users?
 21 | 
 22 | * What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.
 23 | 
 24 | * The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?
 25 | 
 26 | * Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to “Eight Days A Week“, which song has the highest probability of being played right after it by the same user? This is going to be V1 of a song recommendation model.
 27 | 
 28 | * How would you set up a test to check whether your model works well?
 29 | 
 30 | # Data
 31 | 
 32 | The json is:
 33 | data - Each row represents a song that was listened by a user.
 34 | 
 35 | ## Fields:
 36 | 
 37 | id : it is unique.
 38 | user_id : user id who listened to a given song.
 39 | user_state : where the user is based.
 40 | user_sign_up_date : when the user signed-up.
 41 | song_played : the song that was listened.
 42 | time_played : at which time the user started listening to the song (local time).
 43 | 
 44 | # Problem Setup
 45 | 
 46 | ```{r}
 47 | # Load required libraries
 48 | library(tidyverse)
 49 | library(jsonlite)
 50 | library(lubridate)
 51 | 
 52 | # Read in the input data into a dataframe
 53 | songs <- fromJSON("song.json")
 54 | ```
 55 | 
 56 | # Data Exploration and checks
 57 | 
 58 | Check data types of columns in songs dataset
 59 | ```{r}
 60 | # Check data types of each of the columns
 61 | str(songs)
 62 | ```
 63 | 
 64 | ```{r}
 65 | # take a peek at the data
 66 | summary(songs)
 67 | ```
 68 | 
 69 | Check for missing values in the data
 70 | ```{r}
 71 | # Check if any missing values exist
 72 | colSums(is.na(songs))
 73 | ```
 74 | 
 75 | Check for duplicates in the data
 76 | ```{r}
 77 | # check if any duplicate id exist
 78 | length(songs$id) == length(unique(songs$id))
 79 | ```
 80 | 
 81 | ```{r}
 82 | # check if any duplicate user id exist in the data
 83 | length(songs$user_id) == length(unique(songs$user_id))
 84 | ```
 85 | 
 86 | Clearly, there are duplicate user ids in dataset. This is OK because single user can listen to multiple songs.
 87 | However, id appears to be unique.
 88 | 
 89 | Check if dates make sense. The time played for all the entries should NOT be before the sign-up date
 90 | 
 91 | ```{r}
 92 | all(as.Date(songs$user_sign_up_date) <= as.Date(songs$time_played))
 93 | ```
 94 | 
 95 | Clearly, All the entries have sign-up dates before time played. Overall, the data looks OK.
 96 | 
 97 | 
 98 | # Question 1
 99 | 
100 | Summarize the data by user state
101 | ```{r}
102 | top3_states <- songs %>%
103 |   group_by(user_state) %>%
104 |   summarise(user_count = n_distinct(user_id)) %>%
105 |   ungroup() %>%
106 |   arrange(desc(user_count), user_state) %>%
107 |   filter(row_number() <= 3)
108 | 
109 | bottom3_states <- songs %>%
110 |   group_by(user_state) %>%
111 |   summarise(user_count = n_distinct(user_id)) %>%
112 |   ungroup() %>%
113 |   arrange(user_count, user_state) %>%
114 |   filter(row_number() <= 3)
115 | 
116 | top3_states
117 | bottom3_states
118 | ```
119 | 
120 | # Question 2
121 | 
122 | Based on the given data and problem description, the only way users engage with the service is by playing songs.
123 | I define user engagement as number of play events per user in a given period of time. I plan to use average daily user engagement, which is average number of play events per day per user, as the metric to decide the top and bottom states for product usage. 
124 | 
125 | If the users use the product a lot then number of play events per day per user would go up and hence would drive the metric up. Also, The  daily user engagement rates can be used to visualize the trends over time.
126 | 
127 | The user engagement should be calculated using the number of user signups prior to the play event.
128 | 
129 | The number of user sign-ups by state and date.
130 | ```{r}
131 | total_signups_by_date <- songs %>%
132 |   arrange(user_sign_up_date) %>%
133 |   group_by(user_sign_up_date, user_state) %>%
134 |   summarize(counts = n_distinct(user_id)) %>%
135 |   ungroup() %>%
136 |   arrange(user_state, user_sign_up_date)
137 | 
138 | # Unique states
139 | unique_states <- songs %>% 
140 |   select(user_state) %>%
141 |   distinct() %>% 
142 |   arrange(user_state)
143 | 
144 | ```
145 | 
146 | The dates for which the daily engagement rate needs to be computed
147 | ```{r}
148 | required_dates <- substring(songs$time_played, 1, 10) %>% unique() %>% sort()
149 | ```
150 | 
151 | Initialize a place holder to hold daily engagement rate.
152 | ```{r}
153 | engagement_state_date <- data.frame(date = required_dates, stringsAsFactors = F) %>%
154 |   merge(unique_states)
155 | 
156 | # Merge with other dataset to get the number of play events
157 | daily_engagement_by_state_date <- songs %>%
158 |   mutate(date_played = substring(time_played, 1, 10)) %>%
159 |   group_by(user_state, date_played) %>%
160 |   summarise(plays = n()) %>%
161 |   ungroup() %>%
162 |   right_join(engagement_state_date, by = c("user_state" = "user_state", "date_played" = "date")) %>%
163 |   mutate(plays = ifelse(is.na(plays), 0, plays), signups_till_date = NA) 
164 | 
165 | head(daily_engagement_by_state_date)
166 | ```
167 | 
168 | Compute daily user engagement by state as "number of play events/ number of user sign ups till date"
169 | 
170 | ```{r}
171 | # Loop through the each of the entries
172 | for(i in 1:nrow(daily_engagement_by_state_date)){
173 | tmp <- total_signups_by_date %>%
174 |   filter(user_state == daily_engagement_by_state_date$user_state[i], 
175 |          as.Date(user_sign_up_date) <= as.Date(daily_engagement_by_state_date$date_played[i]))
176 | 
177 | daily_engagement_by_state_date$signups_till_date[i] <- sum(tmp$counts)
178 | }
179 | 
180 | daily_engagement_by_state_date <- daily_engagement_by_state_date %>%
181 |   mutate(daily_engagement = plays/signups_till_date) 
182 | 
183 | head(daily_engagement_by_state_date)
184 | ```
185 | 
186 | ```{r}
187 | daily_engagement_summary <- daily_engagement_by_state_date %>%
188 |   group_by(user_state) %>%
189 |   summarise(avg_daily_engagement = round(mean(daily_engagement), digits = 2)) 
190 | 
191 | daily_engagement_summary
192 | ```
193 | 
194 | Top and bottom 3 states by user daily user engagement
195 | 
196 | ```{r}
197 | top3_states_engagement <- daily_engagement_summary %>%
198 |   arrange(desc(avg_daily_engagement)) %>%
199 |   filter(row_number() <= 3)
200 | 
201 | bottom3_states_engagement <- daily_engagement_summary %>%
202 |   arrange(avg_daily_engagement) %>%
203 |   filter(row_number() <= 3)
204 | 
205 | top3_states_engagement
206 | bottom3_states_engagement
207 | ```
208 | 
209 | # Question 3
210 | 
211 | First users by state
212 | 
213 | ```{r}
214 | first_users_by_state <- songs %>%
215 |   group_by(user_state) %>%
216 |   arrange(user_sign_up_date) %>%
217 |   filter(row_number() == 1) %>%
218 |   ungroup() %>%
219 |   select(user_state, user_id) %>%
220 |   arrange(user_state)
221 | 
222 | first_users_by_state
223 | ```
224 | 
225 | # Question 4
226 | 
227 | The approach to build song recommendation system is to use first order Markov chain where for each song, we predict the most likely next song without looking at user history, but only taking into account the current song. The Markov chain approach is combined with similarity score obtained from Collaborative filtering to break any ties or for cases of cold start.
228 | 
229 | The algorithm is to build a data set where for each user and song, it gives the very next song listened to. 
230 | We can then group by each song across all users and find the next song with the highest count in a given time window. Here, I choose the time window as one day. For every song, We are interested in finding the counterpart that is played consecutively the most number of times but on the same day across all the users. In the cases where there is a tie or missing data, the similarity using collaborative filtering is used to give the prediction.
231 | 
232 | ## Markov Chain
233 | 
234 | ```{r}
235 | songs <- songs %>%
236 |   mutate(k = 1)
237 | 
238 | # Cartesian join with the same table and apply appropriate filter
239 | songs_joined <- songs %>%
240 |   select(user_id1 = user_id, song = song_played, time_played_song1 = time_played, k) %>%
241 |   full_join(songs, by = "k") %>%
242 |   # Only interested in next song played most times by that user for that day
243 |   filter(user_id1 == user_id, date(ymd_hms(time_played_song1)) == date(ymd_hms(time_played)),  
244 |          ymd_hms(time_played_song1) < ymd_hms(time_played), song != song_played) %>%
245 |   select(user_id, song, next_song = song_played)
246 | 
247 | # Most likely next song based on Markov chain
248 | song_pairs <- songs_joined %>%
249 |   mutate(song = toupper(song), next_song = toupper(next_song)) %>%
250 |   group_by(song, next_song) %>%
251 |   summarise(counts = n()) %>%
252 |   ungroup() %>%
253 |   arrange(song, desc(counts)) 
254 | 
255 | ```
256 | 
257 | Clearly, ties exist in the data.
258 | 
259 | ## Collaborative Filtering
260 | 
261 | Using coll.filtering to break ties. Each song can be imagined as a point in the n-dimensional user space. Each coordinate of the point(n-dimensional) would be the number of times the song is played by the particular user.
262 | 
263 | ```{r}
264 | # Build user song matrix
265 | user_song_matrix <- songs %>%
266 |   group_by(user_id, song_played) %>%
267 |   summarise(nplays = n()) %>%
268 |   ungroup() %>%
269 |   spread(song_played, nplays) %>%
270 |   mutate_all(list(~replace_na(., 0))) %>%
271 |   select(-user_id)
272 | 
273 | unique_songs <- colnames(user_song_matrix)
274 | ```
275 | 
276 | Cosine similarity is used to compute similarity between two songs. The idea here is if two songs are played by the same set of users, then they must be similar and have high cosine similarity value.
277 | 
278 | ```{r}
279 | # Define a function to compute the cosine similarity between two songs
280 | cosine_similarity <- function(x, y) { 
281 |   sum(x * y) / (sqrt(sum(x * x)) * sqrt(sum(y * y)))
282 | }
283 | 
284 | # Define a place holder to hold similarity between each pair of songs
285 | # similarity between a song and itself is 1
286 | song_similarity  <- diag(1, nrow = ncol(user_song_matrix), ncol = ncol(user_song_matrix))
287 | rownames(song_similarity) <- toupper(unique_songs)
288 | colnames(song_similarity) <- toupper(unique_songs)
289 | nsongs <- ncol(user_song_matrix)
290 | ```
291 | 
292 | Generate song similarity matrix 
293 | ```{r}
294 | # Loop through the columns
295 | for(i in 1:nsongs) {
296 |   # Loop through the columns for each column
297 |   for(j in 1:nsongs) {
298 |     # Fill in placeholder with cosine similarities
299 |     song_similarity[i, j] <- cosine_similarity(user_song_matrix[i], user_song_matrix[j])
300 |   }
301 | }
302 | 
303 | # Process song pairs
304 | song_similarity_df <- song_similarity %>%
305 |   as.data.frame() 
306 | row.names(song_similarity_df) <- c()
307 | 
308 | song_similarity_df$song1 <- row.names(song_similarity)
309 | song_similarity_df <- song_similarity_df %>%
310 |   select(song1, 1:100) %>%
311 |   gather(key = "song2", value = "similarity", -song1) %>%
312 |   filter(song1 != song2)
313 | 
314 | # Take a peek at the song pair similarity scores
315 | head(song_similarity_df)
316 | ```
317 | 
318 | For every song, get the song with most counts and if multiple songs have the most counts then use highest similarity score.
319 | ```{r}
320 | # summarize
321 | next_song <- song_pairs %>%
322 |   left_join(song_similarity_df, by = c("song" = "song1", "next_song" = "song2")) %>%
323 |   arrange(song, desc(counts), desc(similarity)) %>%
324 |   group_by(song) %>%
325 |   filter(row_number() == 1)
326 | ```
327 | 
328 | Based on the number of songs, Not all songs got a prediction on the next song to be played. For those cases, choose the song with highest similarity as the next likely song.
329 | ```{r}
330 | # Get the missing songs similarity
331 | missing_songs <- song_similarity_df %>%
332 |   filter(!song1 %in% next_song$song) %>%
333 |   arrange(song1, desc(similarity)) %>%
334 |   group_by(song1) %>%
335 |   filter(row_number() == 1)
336 | 
337 | ```
338 | 
339 | Combine all the predictions.
340 | ```{r}
341 | # Combining  
342 | next_song_final <- missing_songs %>%
343 |   select(song = song1, next_song = song2) %>%
344 |   bind_rows(next_song) %>%
345 |   arrange(song) %>%
346 |   select(song, next_song)
347 | 
348 | next_song_final
349 | ```
350 | 
351 | Now, Define the function to get the mostr likely next song using the above data set.
352 | ```{r}
353 | # Function to get the next song
354 | get_next_song <- function(song){
355 |   if(!toupper(song) %in% next_song_final$song){
356 |     return("Song not found in database!")
357 |   }
358 |   return(next_song_final$next_song[next_song_final$song == toupper(song)])
359 | }
360 | 
361 | # Test cases
362 | get_next_song("Eight Days A Week")
363 | get_next_song("XXXXXXX")
364 | ```
365 | 
366 | # Question 5
367 | 
368 | Launching song recommendation system to an existing product is a major change and is likely to introduce lot of UI changes. When testing the song recommendation system, It is important to isolate the effect of UI changes from the overall change in metric before and after introducing the recommender, to know the goodness of recommender algorithm. To accomplish this we test 3 versions.
369 | 
370 | V1 - original product with out recommendation.
371 | V2 - with recommendation and associated UI changes (recommendation based on random guess or a very rudimentary model).
372 | V3 - With recommendation based on the built algorithm and associated UI changes.
373 | 
374 | Perform multiple A/B testing on three versions as follows.
375 | 
376 |   * First estimate the number days the test needs to be run for the desired effect size, p-value and statistical power. P-value should be corrected using Boniferroni correction as multiple tests are involved.
377 |   * Randomly split users into three groups, Each group is shown one of the three versions to be tested.
378 |   * Collect data on "average number of play events per user per day" for all three groups.
379 |   * Here are the hypotheses to be tested. 
380 |     * H0 : No difference in the metric across the groups.
381 |     * H1 : There is a difference in the metric between the groups.
382 |   * After test period, perform T-test on each pair of the groups and check if you can reject or fail to reject H0 at adjusted p-value (employing Boniferroni correction) and judge the effect of recommendation algorithm.
383 | 
384 |   


--------------------------------------------------------------------------------
/URL Parsing Challenge/URL Parsing Challenge.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'URL Parsing Challenge'
  3 | author: "Siddhartha Jetti"
  4 | date: "7/14/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Being able to efficiently parse URLs and extract info from them is a very important skill for a data scientist.
 11 | 
 12 | Firstly, if you join a very early stage startup, it might be that a lot of data are just stored in the URLs visited by the users. And, therefore, you will have to build a system that parses the URLs, extracts fields out of it, and saves them into a table that can be easily queried (not the most fun of the jobs, but very useful!).
 13 | 
 14 | Secondly, often using external data can help a lot your models. And a way to get external data is by scraping websites. And this is often done by being able to play with a given site URL structure (assuming it is allowed by the external site ToS).
 15 | 
 16 | The goal of this project is to parse a sequence of URLs about user searches and extract some basic info out of it.
 17 | 
 18 | 
 19 | # Challenge Description
 20 | 
 21 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com, etc.
 22 | 
 23 | They haven’t invested in data science yet and all the data they have about user searches are simply stored in the URLs the users generate when they search for a hotel. If you are not familiar with URLs, you can run a search on any OTA site and see how all search parameters are always present in the URL.
 24 | 
 25 | You are asked to:
 26 | 
 27 | 1) Create a clean data set where each column is a field in the URL, each row is a given search and the cells are the corresponding URL values.
 28 | 
 29 | For each search query, how many amenities were selected?
 30 | 
 31 | 2) Often, to measure the quality of a search algorithm, data scientists use some metric based on how often users click on the second page, third page, and so on. The idea here is that a great search algorithm should return all interesting results on the first page and never force users to visit the other pages (how often do you click on the second page results when you search on Google? Almost never, right?).
 32 | 
 33 | Create a metric based on the above idea and find the city with the worst search algorithm.
 34 | 
 35 | 
 36 | # Data
 37 | 
 38 | The file is:
 39 | 
 40 | url_list - a list of search URLs generated by the users when searching for hotels
 41 | 
 42 | ## Fields:
 43 | 
 44 | * hotel.checkin : checkin date in the search query. It is mandatory
 45 | * hotel.customMinimumPriceFilter : This filter allows to only return hotels whose nightly price is above a certain threshold (in USD). Useful to filter out the really bad hotels
 46 | * hotel.customMaximumPriceFilter : This filter allows to only return hotels whose nightly price is below a certain threshold (in USD). Useful to filter out the hotels you can’t afford
 47 | * hotel.freeCancellation : It is a check box. If the user selects it, only hotels with free cancellation are returned.
 48 | * hotel.stars_5 : It is a check box. If the user selects it, 5-star hotels are returned. Multiple choices are allowed. For instance, a user can select 5 and 4 star hotels by checking this box and the one below. If no check box is selected, all hotels are returned.
 49 | * hotel.stars_4 : It is a check box. If the user selects it, 4-star hotels are returned
 50 | * hotel.stars_3 : It is a check box. If the user selects it, 3-star hotels are returned
 51 | * hotel.stars_2 : It is a check box. If the user selects it, 2-star hotels are returned
 52 | * hotel.stars_1 : It is a check box. If the user selects it, 1-star hotels are returned
 53 | * hotel.max_score : This filter allows to only return hotels whose score is below a certain threshold. Score is 1-5 with high being good (you can think of TripAdvisor score to get an idea of what it is)
 54 | * hotel.min_score : This filter allows to only return hotels whose score is above a certain threshold
 55 | * hotel.couponCode : If the user uses a coupon in her search, this fields gets populated with “Yes”
 56 | * hotel.adults : Number of adults in the search query. This sis the number of adults who would stay in the same room. It is mandatory
 57 | * hotel.city : Which city is the user searching for. It is mandatory
 58 | * hotel.children : Is the user traveling with children? This field returns the number of children in the search query
 59 | * hotel.amenities : There are a few amenities that the user can select in her search via different check boxes. The possible amenities are: 
 60 |   + shuttle: free shuttle transportation from the airport
 61 |   + internet: free internet
 62 |   + breakfast : free breakfast
 63 |   + lounge : does the hotel have a lounge
 64 |   + yes_smoking : are there rooms where smoking is allowed
 65 |   + yes_pet : is it allowed to bring pets
 66 | * hotel.checkout : Check out date. It is mandatory
 67 | * hotel.search_page : Search page visited. 1 means the user in on the first page results, 2 -> clicked on the second page etc. This will be used to estimate the goodness of ranking for different cities
 68 | 
 69 | 
 70 | # Problem Setup
 71 | 
 72 | ```{r}
 73 | # Load required libraries
 74 | library(tidyverse)
 75 | library(ggplot2)
 76 | 
 77 | # Read in the input data into a dataframe
 78 | urls <- read.table("url_list.txt", stringsAsFactors = F)
 79 | urls1 <- gsub("http://www.mysearchforhotels.com/shop/hotelsearch?", "", urls[,1], fixed = TRUE)
 80 | 
 81 | ```
 82 | 
 83 | # Question 1:
 84 | 
 85 | Transform the URL data
 86 | ```{r}
 87 | url_list <- strsplit(urls1, "&") 
 88 | nurls <- nrow(urls)
 89 | ```
 90 | 
 91 | Process and clean the data
 92 | ```{r}
 93 | # all possible URL fields
 94 | url_fields <- c("hotel.checkin", "hotel.customMinimumPriceFilter", "hotel.customMaximumPriceFilter", 
 95 |                 "hotel.freeCancellation", "hotel.stars_5", "hotel.stars_4", "hotel.stars_3", "hotel.stars_2", 
 96 |                 "hotel.stars_1", "hotel.max_score", "hotel.min_score", "hotel.couponCode", "hotel.adults", 
 97 |                 "hotel.city", "hotel.children", "hotel.amenities", "hotel.checkout", "hotel.search_page")
 98 | 
 99 | # Initialize a data frame to hold the cleaned data in rectangular format
100 | url_data <- as.data.frame(matrix(NA, nrow = nurls, ncol = 19))
101 | names(url_data) <- c("id", url_fields)
102 | url_data$id <- seq(1, nurls, by = 1)
103 | ```
104 | 
105 | Build the rectangular data set.
106 | ```{r}
107 | # Loop through all the URLs
108 | for(i in 1:nurls) {
109 |   search <- url_list[[i]]
110 |   # Loop through all the fields for every url entry
111 |   for(field in url_fields){
112 |     if(any(grepl(field, search))){
113 |       search_strings <- paste(search[grep(field, search)], collapse = ',')
114 |       url_data[i, field] <- gsub(paste0(field, "="), "", search_strings, fixed = TRUE)
115 |     } else { url_data[i, field] <- NA }
116 |   }
117 | }
118 | # Remove special characters
119 | url_data <- url_data %>%
120 |   mutate(hotel.city = gsub("+", "", hotel.city, fixed = TRUE))
121 | 
122 | # Take a peek at the final data
123 | summary(url_data)
124 | head(url_data)
125 | 
126 | # Check if any rows exist with all fields other than id missing
127 | any(rowSums(is.na(url_data)) == ncol(url_data[,-1]))
128 | 
129 | ```
130 | 
131 | Every search query need not have all the fields specified. So, the cleaned dataset is expected to be sparse (several missing values in each row). No rows exist with all fields other than id missing. 
132 | 
133 | 
134 | ```{r}
135 | # Check if missing values exist for search page
136 | sum(is.na(url_data$hotel.search_page))
137 | 
138 | # Check missing values for city
139 | sum(is.na(url_data$hotel.city))
140 | ```
141 | 
142 | No missing values exist in critical fields. The data looks good.
143 | 
144 | Count number of amenities for every search
145 | ```{r}
146 | # Frequency by categories
147 | table(url_data$hotel.amenities, useNA = "always")
148 | 
149 | # Count the amenities searched by splitting at "," for searches involving multiple amenity selection
150 | url_data <- url_data %>%
151 |   mutate(amenities_count = ifelse(is.na(hotel.amenities), 0, str_count(hotel.amenities, ",") + 1))
152 | 
153 | table(url_data$amenities_count, useNA = "always")
154 | ```
155 | 
156 | Clearly, More than 99% of the searches does not have any amenity selected. Only less than 1% of searches have one or more amenities selected and highest number of amenities selected in a search is two.
157 | 
158 | # Question 2:
159 | 
160 | A good search algorithm should rank the most relevant search results highly and hence show interesting results in the first page. I think users having to look at second page of search results is as bad as having to look at 10th page because most users don't bother to go beyond first page and ones shown in later pages may not reach the user. So, it makes sense to bin searches into two categories one where searches are found on first page and second category being any where else other than first page. 
161 | 
162 | The metric I choose to know the relavancy of search results, in other words performance of search algorithm, is "the fraction of searches with users looking beyond first search results page". The lower the value of this metric, the better is the relavancy of search results and the search algorithm. If large percentage of users looked beyond the first page then this metric will go up and indicate that search algorithm is not performing well.
163 | 
164 | ```{r}
165 | # Frequency by searched city
166 | table(url_data$hotel.city, useNA = "always")
167 | 
168 | # Order cities based on the chosen metric
169 | url_data %>%
170 |   mutate(next_search_page = ifelse(hotel.search_page == "1", 0, 1)) %>%
171 |   group_by(hotel.city) %>%
172 |   summarize(next_search_page_fraction = mean(next_search_page)) %>%
173 |   arrange(desc(next_search_page_fraction))
174 | ```
175 | 
176 | It looks like London had the worst search algorithm with users looking beyond first search results page in about 47% of searches. Followed by New York with about 44% of searches going beyond first page.
177 | 


--------------------------------------------------------------------------------
/URL Parsing Challenge/URL_Parsing_Challenge.md:
--------------------------------------------------------------------------------
  1 | URL Parsing Challenge
  2 | ================
  3 | Siddhartha Jetti
  4 | 7/14/2019
  5 | 
  6 | # Goal
  7 | 
  8 | Being able to efficiently parse URLs and extract info from them is a
  9 | very important skill for a data scientist.
 10 | 
 11 | Firstly, if you join a very early stage startup, it might be that a lot
 12 | of data are just stored in the URLs visited by the users. And,
 13 | therefore, you will have to build a system that parses the URLs,
 14 | extracts fields out of it, and saves them into a table that can be
 15 | easily queried (not the most fun of the jobs, but very useful\!).
 16 | 
 17 | Secondly, often using external data can help a lot your models. And a
 18 | way to get external data is by scraping websites. And this is often done
 19 | by being able to play with a given site URL structure (assuming it is
 20 | allowed by the external site ToS).
 21 | 
 22 | The goal of this project is to parse a sequence of URLs about user
 23 | searches and extract some basic info out of it.
 24 | 
 25 | # Challenge Description
 26 | 
 27 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com,
 28 | etc.
 29 | 
 30 | They haven’t invested in data science yet and all the data they have
 31 | about user searches are simply stored in the URLs the users generate
 32 | when they search for a hotel. If you are not familiar with URLs, you can
 33 | run a search on any OTA site and see how all search parameters are
 34 | always present in the URL.
 35 | 
 36 | You are asked to:
 37 | 
 38 | 1)  Create a clean data set where each column is a field in the URL,
 39 |     each row is a given search and the cells are the corresponding URL
 40 |     values.
 41 | 
 42 | For each search query, how many amenities were selected?
 43 | 
 44 | 2)  Often, to measure the quality of a search algorithm, data scientists
 45 |     use some metric based on how often users click on the second page,
 46 |     third page, and so on. The idea here is that a great search
 47 |     algorithm should return all interesting results on the first page
 48 |     and never force users to visit the other pages (how often do you
 49 |     click on the second page results when you search on Google? Almost
 50 |     never, right?).
 51 | 
 52 | Create a metric based on the above idea and find the city with the worst
 53 | search algorithm.
 54 | 
 55 | # Data
 56 | 
 57 | The file is:
 58 | 
 59 | url\_list - a list of search URLs generated by the users when searching
 60 | for hotels
 61 | 
 62 | ## Fields:
 63 | 
 64 |   - hotel.checkin : checkin date in the search query. It is mandatory
 65 |   - hotel.customMinimumPriceFilter : This filter allows to only return
 66 |     hotels whose nightly price is above a certain threshold (in USD).
 67 |     Useful to filter out the really bad hotels
 68 |   - hotel.customMaximumPriceFilter : This filter allows to only return
 69 |     hotels whose nightly price is below a certain threshold (in USD).
 70 |     Useful to filter out the hotels you can’t afford
 71 |   - hotel.freeCancellation : It is a check box. If the user selects it,
 72 |     only hotels with free cancellation are returned.
 73 |   - hotel.stars\_5 : It is a check box. If the user selects it, 5-star
 74 |     hotels are returned. Multiple choices are allowed. For instance, a
 75 |     user can select 5 and 4 star hotels by checking this box and the one
 76 |     below. If no check box is selected, all hotels are returned.
 77 |   - hotel.stars\_4 : It is a check box. If the user selects it, 4-star
 78 |     hotels are returned
 79 |   - hotel.stars\_3 : It is a check box. If the user selects it, 3-star
 80 |     hotels are returned
 81 |   - hotel.stars\_2 : It is a check box. If the user selects it, 2-star
 82 |     hotels are returned
 83 |   - hotel.stars\_1 : It is a check box. If the user selects it, 1-star
 84 |     hotels are returned
 85 |   - hotel.max\_score : This filter allows to only return hotels whose
 86 |     score is below a certain threshold. Score is 1-5 with high being
 87 |     good (you can think of TripAdvisor score to get an idea of what it
 88 |     is)
 89 |   - hotel.min\_score : This filter allows to only return hotels whose
 90 |     score is above a certain threshold
 91 |   - hotel.couponCode : If the user uses a coupon in her search, this
 92 |     fields gets populated with “Yes”
 93 |   - hotel.adults : Number of adults in the search query. This sis the
 94 |     number of adults who would stay in the same room. It is mandatory
 95 |   - hotel.city : Which city is the user searching for. It is mandatory
 96 |   - hotel.children : Is the user traveling with children? This field
 97 |     returns the number of children in the search query
 98 |   - hotel.amenities : There are a few amenities that the user can select
 99 |     in her search via different check boxes. The possible amenities are:
100 |       - shuttle: free shuttle transportation from the airport
101 |       - internet: free internet
102 |       - breakfast : free breakfast
103 |       - lounge : does the hotel have a lounge
104 |       - yes\_smoking : are there rooms where smoking is allowed
105 |       - yes\_pet : is it allowed to bring pets
106 |   - hotel.checkout : Check out date. It is mandatory
107 |   - hotel.search\_page : Search page visited. 1 means the user in on the
108 |     first page results, 2 -\> clicked on the second page etc. This will
109 |     be used to estimate the goodness of ranking for different cities
110 | 
111 | # Problem Setup
112 | 
113 | ``` r
114 | # Load required libraries
115 | library(tidyverse)
116 | ```
117 | 
118 |     ## Registered S3 methods overwritten by 'ggplot2':
119 |     ##   method         from 
120 |     ##   [.quosures     rlang
121 |     ##   c.quosures     rlang
122 |     ##   print.quosures rlang
123 | 
124 |     ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
125 | 
126 |     ## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
127 |     ## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
128 |     ## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
129 |     ## ✔ readr   1.3.1     ✔ forcats 0.4.0
130 | 
131 |     ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
132 |     ## ✖ dplyr::filter() masks stats::filter()
133 |     ## ✖ dplyr::lag()    masks stats::lag()
134 | 
135 | ``` r
136 | library(ggplot2)
137 | 
138 | # Read in the input data into a dataframe
139 | urls <- read.table("url_list.txt", stringsAsFactors = F)
140 | urls1 <- gsub("http://www.mysearchforhotels.com/shop/hotelsearch?", "", urls[,1], fixed = TRUE)
141 | ```
142 | 
143 | # Question 1:
144 | 
145 | Transform the URL data
146 | 
147 | ``` r
148 | url_list <- strsplit(urls1, "&") 
149 | nurls <- nrow(urls)
150 | ```
151 | 
152 | Process and clean the data
153 | 
154 | ``` r
155 | # all possible URL fields
156 | url_fields <- c("hotel.checkin", "hotel.customMinimumPriceFilter", "hotel.customMaximumPriceFilter", 
157 |                 "hotel.freeCancellation", "hotel.stars_5", "hotel.stars_4", "hotel.stars_3", "hotel.stars_2", 
158 |                 "hotel.stars_1", "hotel.max_score", "hotel.min_score", "hotel.couponCode", "hotel.adults", 
159 |                 "hotel.city", "hotel.children", "hotel.amenities", "hotel.checkout", "hotel.search_page")
160 | 
161 | # Initialize a data frame to hold the cleaned data in rectangular format
162 | url_data <- as.data.frame(matrix(NA, nrow = nurls, ncol = 19))
163 | names(url_data) <- c("id", url_fields)
164 | url_data$id <- seq(1, nurls, by = 1)
165 | ```
166 | 
167 | Build the rectangular data set.
168 | 
169 | ``` r
170 | # Loop through all the URLs
171 | for(i in 1:nurls) {
172 |   search <- url_list[[i]]
173 |   # Loop through all the fields for every url entry
174 |   for(field in url_fields){
175 |     if(any(grepl(field, search))){
176 |       search_strings <- paste(search[grep(field, search)], collapse = ',')
177 |       url_data[i, field] <- gsub(paste0(field, "="), "", search_strings, fixed = TRUE)
178 |     } else { url_data[i, field] <- NA }
179 |   }
180 | }
181 | # Remove special characters
182 | url_data <- url_data %>%
183 |   mutate(hotel.city = gsub("+", "", hotel.city, fixed = TRUE))
184 | 
185 | # Take a peek at the final data
186 | summary(url_data)
187 | ```
188 | 
189 |     ##        id        hotel.checkin      hotel.customMinimumPriceFilter
190 |     ##  Min.   :    1   Length:77677       Length:77677                  
191 |     ##  1st Qu.:19420   Class :character   Class :character              
192 |     ##  Median :38839   Mode  :character   Mode  :character              
193 |     ##  Mean   :38839                                                    
194 |     ##  3rd Qu.:58258                                                    
195 |     ##  Max.   :77677                                                    
196 |     ##  hotel.customMaximumPriceFilter hotel.freeCancellation hotel.stars_5     
197 |     ##  Length:77677                   Length:77677           Length:77677      
198 |     ##  Class :character               Class :character       Class :character  
199 |     ##  Mode  :character               Mode  :character       Mode  :character  
200 |     ##                                                                          
201 |     ##                                                                          
202 |     ##                                                                          
203 |     ##  hotel.stars_4      hotel.stars_3      hotel.stars_2     
204 |     ##  Length:77677       Length:77677       Length:77677      
205 |     ##  Class :character   Class :character   Class :character  
206 |     ##  Mode  :character   Mode  :character   Mode  :character  
207 |     ##                                                          
208 |     ##                                                          
209 |     ##                                                          
210 |     ##  hotel.stars_1      hotel.max_score    hotel.min_score   
211 |     ##  Length:77677       Length:77677       Length:77677      
212 |     ##  Class :character   Class :character   Class :character  
213 |     ##  Mode  :character   Mode  :character   Mode  :character  
214 |     ##                                                          
215 |     ##                                                          
216 |     ##                                                          
217 |     ##  hotel.couponCode   hotel.adults        hotel.city       
218 |     ##  Length:77677       Length:77677       Length:77677      
219 |     ##  Class :character   Class :character   Class :character  
220 |     ##  Mode  :character   Mode  :character   Mode  :character  
221 |     ##                                                          
222 |     ##                                                          
223 |     ##                                                          
224 |     ##  hotel.children     hotel.amenities    hotel.checkout    
225 |     ##  Length:77677       Length:77677       Length:77677      
226 |     ##  Class :character   Class :character   Class :character  
227 |     ##  Mode  :character   Mode  :character   Mode  :character  
228 |     ##                                                          
229 |     ##                                                          
230 |     ##                                                          
231 |     ##  hotel.search_page 
232 |     ##  Length:77677      
233 |     ##  Class :character  
234 |     ##  Mode  :character  
235 |     ##                    
236 |     ##                    
237 |     ## 
238 | 
239 | ``` r
240 | head(url_data)
241 | ```
242 | 
243 |     ##   id hotel.checkin hotel.customMinimumPriceFilter
244 |     ## 1  1    2015-09-19                           <NA>
245 |     ## 2  2    2015-09-14                           <NA>
246 |     ## 3  3    2015-09-26                           <NA>
247 |     ## 4  4    2015-09-02                           <NA>
248 |     ## 5  5    2015-09-20                           <NA>
249 |     ## 6  6    2015-09-14                           <NA>
250 |     ##   hotel.customMaximumPriceFilter hotel.freeCancellation hotel.stars_5
251 |     ## 1                           <NA>                   <NA>          <NA>
252 |     ## 2                           <NA>                   <NA>          <NA>
253 |     ## 3                            175                   <NA>          <NA>
254 |     ## 4                           <NA>                   <NA>           yes
255 |     ## 5                            275                   <NA>          <NA>
256 |     ## 6                           <NA>                    yes          <NA>
257 |     ##   hotel.stars_4 hotel.stars_3 hotel.stars_2 hotel.stars_1 hotel.max_score
258 |     ## 1           yes          <NA>          <NA>          <NA>            <NA>
259 |     ## 2          <NA>           yes          <NA>          <NA>            <NA>
260 |     ## 3           yes          <NA>          <NA>          <NA>            <NA>
261 |     ## 4           yes          <NA>          <NA>          <NA>            <NA>
262 |     ## 5          <NA>          <NA>          <NA>          <NA>            <NA>
263 |     ## 6          <NA>          <NA>          <NA>          <NA>            <NA>
264 |     ##   hotel.min_score hotel.couponCode hotel.adults
265 |     ## 1               4             <NA>            3
266 |     ## 2               4             <NA>            3
267 |     ## 3               5             <NA>            2
268 |     ## 4               4             <NA>            1
269 |     ## 5               5             <NA>            3
270 |     ## 6            <NA>             <NA>            2
271 |     ##                             hotel.city hotel.children hotel.amenities
272 |     ## 1              NewYork,NY,UnitedStates           <NA>            <NA>
273 |     ## 2                 London,UnitedKingdom           <NA>            <NA>
274 |     ## 3              NewYork,NY,UnitedStates           <NA>            <NA>
275 |     ## 4                    HongKong,HongKong           <NA>            <NA>
276 |     ## 5                 London,UnitedKingdom           <NA>            <NA>
277 |     ## 6 SanFrancisco,California,UnitedStates           <NA>            <NA>
278 |     ##   hotel.checkout hotel.search_page
279 |     ## 1     2015-09-20                 1
280 |     ## 2     2015-09-15                 1
281 |     ## 3     2015-09-27                 1
282 |     ## 4     2015-09-03                 1
283 |     ## 5     2015-09-29                 1
284 |     ## 6     2015-09-16                 1
285 | 
286 | ``` r
287 | # Check if any rows exist with all fields other than id missing
288 | any(rowSums(is.na(url_data)) == ncol(url_data[,-1]))
289 | ```
290 | 
291 |     ## [1] FALSE
292 | 
293 | Every search query need not have all the fields specified. So, the
294 | cleaned dataset is expected to be sparse (several missing values in each
295 | row). No rows exist with all fields other than id missing.
296 | 
297 | ``` r
298 | # Check if missing values exist for search page
299 | sum(is.na(url_data$hotel.search_page))
300 | ```
301 | 
302 |     ## [1] 0
303 | 
304 | ``` r
305 | # Check missing values for city
306 | sum(is.na(url_data$hotel.city))
307 | ```
308 | 
309 |     ## [1] 0
310 | 
311 | No missing values exist in critical fields. The data looks good.
312 | 
313 | Count number of amenities for every search
314 | 
315 | ``` r
316 | # Frequency by categories
317 | table(url_data$hotel.amenities, useNA = "always")
318 | ```
319 | 
320 |     ## 
321 |     ##           breakfast   breakfast,yes_pet            internet 
322 |     ##                  39                   1                 272 
323 |     ##              lounge             shuttle             yes_pet 
324 |     ##                  22                 111                  85 
325 |     ##         yes_smoking yes_smoking,yes_pet                <NA> 
326 |     ##                 170                   4               76973
327 | 
328 | ``` r
329 | # Count the amenities searched by splitting at "," for searches involving multiple amenity selection
330 | url_data <- url_data %>%
331 |   mutate(amenities_count = ifelse(is.na(hotel.amenities), 0, str_count(hotel.amenities, ",") + 1))
332 | 
333 | table(url_data$amenities_count, useNA = "always")
334 | ```
335 | 
336 |     ## 
337 |     ##     0     1     2  <NA> 
338 |     ## 76973   699     5     0
339 | 
340 | Clearly, More than 99% of the searches does not have any amenity
341 | selected. Only less than 1% of searches have one or more amenities
342 | selected and highest number of amenities selected in a search is two.
343 | 
344 | # Question 2:
345 | 
346 | A good search algorithm should rank the most relevant search results
347 | highly and hence show interesting results in the first page. I think
348 | users having to look at second page of search results is as bad as
349 | having to look at 10th page because most users don’t bother to go beyond
350 | first page and ones shown in later pages may not reach the user. So, it
351 | makes sense to bin searches into two categories one where searches are
352 | found on first page and second category being any where else other than
353 | first page.
354 | 
355 | The metric I choose to know the relavancy of search results, in other
356 | words performance of search algorithm, is “the fraction of searches with
357 | users looking beyond first search results page”. The lower the value of
358 | this metric, the better is the relavancy of search results and the
359 | search algorithm. If large percentage of users looked beyond the first
360 | page then this metric will go up and indicate that search algorithm is
361 | not performing well.
362 | 
363 | ``` r
364 | # Frequency by searched city
365 | table(url_data$hotel.city, useNA = "always")
366 | ```
367 | 
368 |     ## 
369 |     ##                    HongKong,HongKong                 London,UnitedKingdom 
370 |     ##                                11786                                28058 
371 |     ##              NewYork,NY,UnitedStates SanFrancisco,California,UnitedStates 
372 |     ##                                29384                                 8449 
373 |     ##                                 <NA> 
374 |     ##                                    0
375 | 
376 | ``` r
377 | # Order cities based on the chosen metric
378 | url_data %>%
379 |   mutate(next_search_page = ifelse(hotel.search_page == "1", 0, 1)) %>%
380 |   group_by(hotel.city) %>%
381 |   summarize(next_search_page_fraction = mean(next_search_page)) %>%
382 |   arrange(desc(next_search_page_fraction))
383 | ```
384 | 
385 |     ## # A tibble: 4 x 2
386 |     ##   hotel.city                           next_search_page_fraction
387 |     ##   <chr>                                                    <dbl>
388 |     ## 1 London,UnitedKingdom                                    0.473 
389 |     ## 2 NewYork,NY,UnitedStates                                 0.442 
390 |     ## 3 HongKong,HongKong                                       0.0892
391 |     ## 4 SanFrancisco,California,UnitedStates                    0.0407
392 | 
393 | It looks like London had the worst search algorithm with users looking
394 | beyond first search results page in about 47% of searches. Followed by
395 | New York with about 44% of searches going beyond first page.
396 | 


--------------------------------------------------------------------------------
/User Referral Program/User Referral Program.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "User Referral Program"
  3 | author: "Siddhartha Jetti"
  4 | date: "10/12/2019"
  5 | output: rmarkdown::github_document
  6 | ---
  7 | 
  8 | # Goal
  9 | 
 10 | Almost all sites have a user referral program: you can invite new users to try a given product. Typically, after the new user completes a transaction, you get rewarded with a certain amount of money or credit to be used on the site.
 11 | 
 12 | The goal of this challenge is to analyze the data from a referral program and draw conclusions about its effectiveness.
 13 | 
 14 | # Challenge Description
 15 | 
 16 | Company XYZ has started a new referral program on Oct, 31. Each user who refers a new user will get 10$ in credit when the new user buys something.
 17 | 
 18 | The program has been running for almost a month and the Growth Product Manager wants to know if it's been successful. She is very excited cause, since the referral program started, the company saw a spike in number of users and wants you to be able to give her some data she can show to her boss.
 19 | 
 20 | * Can you estimate the impact the program had on the site?
 21 | * Based on the data, what would you suggest to do as a next step?
 22 | * The referral program wasn't really tested in a rigorous way. It simply started on a given day for all users and you are drawing conclusions by looking at the data before and after the test started. What kinds of risks this approach presents? Can you think of a better way to test the referral program and measure its impact?
 23 | 
 24 | # Data
 25 | 
 26 | We have one dataset "referral.csv"
 27 | 
 28 | ### Columns:
 29 | * user_id : the id of the user
 30 | * date : date of the purchase
 31 | * country : user country based on ip address
 32 | * money_spent : how much the item bought costs (in USD)
 33 | * is_referral : whether the user came from the referral program(1) or not (0)
 34 | * device_id : It is an identifier of the device used to make the purchase. You can assume here that for a given device, its id never changes
 35 | 
 36 | # Problem Setup
 37 | ```{r}
 38 | # Load required libraries
 39 | library(tidyverse)
 40 | library(ggplot2)
 41 | 
 42 | # Read in the input data into a dataframe
 43 | data <- read.csv("referral.csv", stringsAsFactors = F)
 44 | ```
 45 | 
 46 | # Data Exploration
 47 | 
 48 | Explore data
 49 | ```{r}
 50 | # Check datatypes of all the variables in the dataset
 51 | str(data)
 52 | summary(data)
 53 | ```
 54 | 
 55 | The range of each of the variables appear reasonable without any non-sensical values.
 56 | 
 57 | ```{r}
 58 | # check for any missing values in the dataset
 59 | colSums(is.na(data))
 60 | ```
 61 | 
 62 | No missing values exist in the dataset.
 63 | 
 64 | Run descriptive stats on the dataset.
 65 | 
 66 | ```{r}
 67 | # Total revenue by date
 68 | data %>%
 69 |   mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>%
 70 |   ggplot(aes(x = date, y  = money_spent)) +
 71 |   stat_summary(fun.y=sum, geom="line") +
 72 |   stat_summary(fun.y=sum, geom="point") +
 73 |   geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "red")
 74 | 
 75 | ```
 76 | 
 77 | The above plot shows a jump in daily sales after the referral program is launched on Oct 31st (dashed line in the plot).
 78 | 
 79 | ```{r}
 80 | 
 81 | # Split revenue between users from referrals and non=referred users
 82 | data %>%
 83 |   group_by(is_referral) %>%
 84 |   summarise(unique_users = n_distinct(user_id), total_sales = sum(money_spent), revenue_per_user = sum(money_spent)/n_distinct(user_id))
 85 | 
 86 | # Total revenue split between referrals and non-referrals
 87 | data %>%
 88 |   mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>%
 89 |   ggplot(aes(x = date, y  = money_spent, color = is_referral)) +
 90 |   stat_summary(fun.y=sum, geom="line") +
 91 |   stat_summary(fun.y=sum, geom="point") +
 92 |   geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "black")
 93 | ```
 94 | 
 95 | The above plot reveals that daily sales from users, who came from referrals, are helping the overall sales after the launch of program. However, due to the way the experiment is set up, the jump in sales cannot be attributed to the referral program.
 96 | 
 97 | # Question 1:
 98 | 
 99 | Can you estimate the impact the program had on the site?
100 | 
101 | Here I plan to use daily sales as the metric to assess the impact of referral program. It is easy to think of other metrics like number of users who made atleast one purchase, money spent per purchase etc. At the end of day, total revenue is much more meaningful metric than others.
102 | 
103 | Summarize the data by date to get the daily sales.
104 | 
105 | ```{r}
106 | data_daily <- data %>%
107 |   group_by(date) %>%
108 |   summarise(daily_sales = sum(money_spent))
109 | 
110 | data_daily
111 | ```
112 | 
113 | Compare the daily sales before and after the launch of referral program. This is done by running T-test on daily sales and check if the difference is statistically significant.
114 | 
115 | ```{r}
116 | t.test(data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")], data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")])
117 | ```
118 | 
119 | The T-test suggests that the difference in daily sales before and after the launch of program is not statistically significant. 
120 | 
121 | Check the performance of referral program across different user segments.
122 | 
123 | ```{r}
124 | # Compute average daily sales for each country
125 | data_daily_country <- data %>%
126 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"),
127 |          country = as.factor(country)) %>%
128 |   group_by(date, country, before_after_launch) %>%
129 |   summarise(daily_sales = sum(money_spent)) %>%
130 |   group_by(country) %>%
131 |   summarise( avg_diff_daily_sales = mean(daily_sales[before_after_launch == "Before Launch"]) - mean(daily_sales[before_after_launch == "After Launch"]),
132 |              p_value = t.test(daily_sales[before_after_launch == "Before Launch"], daily_sales[before_after_launch == "After Launch"])$p.value)
133 | 
134 | data_daily_country
135 | ```
136 | 
137 | The difference in average daily sales before and after launch are statistically significant for Mexico MX and China CH. However, sales are better after launch in MX and worse in CH.
138 | 
139 | ```{r}
140 | # Plot average daily sales Vs Country
141 | data %>%
142 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"),
143 |          country = as.factor(country)) %>%
144 |   group_by(date, country, before_after_launch) %>%
145 |   summarise(daily_sales = sum(money_spent)) %>%
146 |   group_by(country, before_after_launch) %>%
147 |   summarise(avg_daily_sales = mean(daily_sales)) %>%
148 |   ggplot(aes(x = country, y  = avg_daily_sales, color = before_after_launch, group = before_after_launch)) +
149 |   geom_line() +
150 |   geom_point()
151 | 
152 | ```
153 | 
154 | The above plot reveals that the referral program is performing differently across different countries.
155 | The average daily sales dropped after the launch of program in countries like China CH and Germany DE. While sales appear to increase in other countries after the launch.
156 | 
157 | Based on the way the experiment is set up, Any change in sales after the launch of referral program cannot be attributed to it alone. There are other confounding factors like seasonality or launch of some marketing event that could be affecting the sales. In US and in some other countries, Thanks giving (occurs in end of November every year) is one of the major holiday season and sales are generally expected to grow and the experiment is not controlling for the lurking variables like that.
158 | 
159 | # Question 2:
160 | 
161 | Based on the data, what would you suggest to do as a next step?
162 | 
163 | From the revenue split between referrals and non-referrals plot, It is apparent that the sales from the non-referrals, which includes all existing users who signed up/made purchases before Oct 31st, after launch of the program are down significantly. 
164 | 
165 | Lets check if the drop in sales from the non-referrals is significant before and after launch of program is significant.
166 | 
167 | ```{r}
168 | nonreferrals_before_vs_after <- data %>%
169 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch")) %>%
170 |   filter(is_referral == 0) %>%
171 |   group_by(date, before_after_launch) %>%
172 |   summarise(daily_sales = sum(money_spent))
173 | 
174 | t.test(nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "Before Launch"], nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "After Launch"])
175 | ```
176 | 
177 | Clearly, T-test suggests that the difference in sales from non-referrals before and after the launch of program is statistically significant.
178 | 
179 | It is quite possible that new program is cannibalizing sales from the existing users. Lets check if that is the case.
180 | ```{r}
181 | # Check if multiple users are using same device
182 | counts_deviceid <- data %>%
183 |   group_by(device_id) %>%
184 |   summarise(n_users = n_distinct(user_id), referral_types = n_distinct(is_referral)) %>%
185 |   arrange(desc(referral_types))
186 | 
187 | counts_deviceid
188 | 
189 | # Are new referred users coming from existing devices
190 | referrals_using_existing_devices <- data %>%
191 |   filter(device_id %in% counts_deviceid$device_id[counts_deviceid$referral_types > 1], is_referral == 1) %>%
192 |   select(user_id) %>%
193 |   distinct()
194 | 
195 | nrow(referrals_using_existing_devices)
196 | ```
197 | 
198 | There are about 8571 users that were showing up as new referrals, but were using devices seen earlier.
199 | Clearly, users are referring themselves or members of same family. 
200 | 
201 | Although it appears that sales have increased after launch, The new program is cannibalizing the sales from the existing users. Users are simply creating new accounts and referring themselves or members of family to take advantage of the $10 credit for a new referral. The jump in sales we witnessed after launch could very well have happened even if the program would not have launched.
202 | 
203 | ### Next Steps :
204 | The company should come up with a robust way to test the referral program that controls for unwanted confounders.
205 | If they decide to launch the referral program, They should look into ways to prevent the abuse of program by restricting number new user sign-ups from a single device.
206 | 
207 | 
208 | # Question 3:
209 | 
210 | The referral program wasn't really tested in a rigorous way. It simply started on a given day for all users and you are drawing conclusions by looking at the data before and after the test started. What kinds of risks this approach presents? Can you think of a better way to test the referral program and measure its impact?
211 | 
212 | The way the referral program was tested is faulty. The experimental set up is not accounting for the seasonal nature of sales and there could be confounding variables in play that affect the sales. This approach poses the risk of false positive, concluding that the new program is causing the jump in sales when in fact it is not. 
213 | 
214 | A better way to test the referral program is Testing by Markets. 
215 | Metric : Average Daily sales
216 | 
217 | ### Experimental Setup :
218 | 
219 | * Identify the pairs of markets or zones that have similar sales and demographic characteristics by looking at the historical data. 
220 | * Estimate the sample size (number of days) needed to detect the expected effect size.
221 | * Now among the similar pair, randomly select one to be test group and launch the referral program. The other will be the control group.
222 | * Run the test at the sametime for both test and contol groups until required sample size is reached. Running the test at the sametime is critical to control for the seasonality and other confounders.
223 | * Check if difference in sales is significant using a T-test.
224 | 
225 | 
226 | 
227 | 


--------------------------------------------------------------------------------
/User Referral Program/User_Referral_Program.md:
--------------------------------------------------------------------------------
  1 | User Referral Program
  2 | ================
  3 | Siddhartha Jetti
  4 | 10/12/2019
  5 | 
  6 | # Goal
  7 | 
  8 | Almost all sites have a user referral program: you can invite new users
  9 | to try a given product. Typically, after the new user completes a
 10 | transaction, you get rewarded with a certain amount of money or credit
 11 | to be used on the site.
 12 | 
 13 | The goal of this challenge is to analyze the data from a referral
 14 | program and draw conclusions about its effectiveness.
 15 | 
 16 | # Challenge Description
 17 | 
 18 | Company XYZ has started a new referral program on Oct, 31. Each user who
 19 | refers a new user will get 10$ in credit when the new user buys
 20 | something.
 21 | 
 22 | The program has been running for almost a month and the Growth Product
 23 | Manager wants to know if it’s been successful. She is very excited
 24 | cause, since the referral program started, the company saw a spike in
 25 | number of users and wants you to be able to give her some data she can
 26 | show to her boss.
 27 | 
 28 |   - Can you estimate the impact the program had on the site?
 29 |   - Based on the data, what would you suggest to do as a next step?
 30 |   - The referral program wasn’t really tested in a rigorous way. It
 31 |     simply started on a given day for all users and you are drawing
 32 |     conclusions by looking at the data before and after the test
 33 |     started. What kinds of risks this approach presents? Can you think
 34 |     of a better way to test the referral program and measure its impact?
 35 | 
 36 | # Data
 37 | 
 38 | We have one dataset “referral.csv”
 39 | 
 40 | ### Columns:
 41 | 
 42 |   - user\_id : the id of the user
 43 |   - date : date of the purchase
 44 |   - country : user country based on ip address
 45 |   - money\_spent : how much the item bought costs (in USD)
 46 |   - is\_referral : whether the user came from the referral program(1) or
 47 |     not (0)
 48 |   - device\_id : It is an identifier of the device used to make the
 49 |     purchase. You can assume here that for a given device, its id never
 50 |     changes
 51 | 
 52 | # Problem Setup
 53 | 
 54 | ``` r
 55 | # Load required libraries
 56 | library(tidyverse)
 57 | ```
 58 | 
 59 |     ## Registered S3 methods overwritten by 'ggplot2':
 60 |     ##   method         from 
 61 |     ##   [.quosures     rlang
 62 |     ##   c.quosures     rlang
 63 |     ##   print.quosures rlang
 64 | 
 65 |     ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
 66 | 
 67 |     ## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
 68 |     ## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
 69 |     ## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
 70 |     ## ✔ readr   1.3.1     ✔ forcats 0.4.0
 71 | 
 72 |     ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 73 |     ## ✖ dplyr::filter() masks stats::filter()
 74 |     ## ✖ dplyr::lag()    masks stats::lag()
 75 | 
 76 | ``` r
 77 | library(ggplot2)
 78 | 
 79 | # Read in the input data into a dataframe
 80 | data <- read.csv("referral.csv", stringsAsFactors = F)
 81 | ```
 82 | 
 83 | # Data Exploration
 84 | 
 85 | Explore data
 86 | 
 87 | ``` r
 88 | # Check datatypes of all the variables in the dataset
 89 | str(data)
 90 | ```
 91 | 
 92 |     ## 'data.frame':    97341 obs. of  6 variables:
 93 |     ##  $ user_id    : int  2 3 6 7 7 10 17 19 19 19 ...
 94 |     ##  $ date       : chr  "2015-10-03" "2015-10-03" "2015-10-03" "2015-10-03" ...
 95 |     ##  $ country    : chr  "FR" "CA" "FR" "UK" ...
 96 |     ##  $ money_spent: int  65 54 35 73 35 36 25 69 17 29 ...
 97 |     ##  $ is_referral: int  0 0 0 0 0 0 0 0 0 0 ...
 98 |     ##  $ device_id  : chr  "EVDCJTZMVMJDG" "WUBZFTVKXGQQX" "CBAPCJRTFNUJG" "PRGXJZAJKMXRH" ...
 99 | 
100 | ``` r
101 | summary(data)
102 | ```
103 | 
104 |     ##     user_id          date             country           money_spent    
105 |     ##  Min.   :    1   Length:97341       Length:97341       Min.   : 10.00  
106 |     ##  1st Qu.: 2020   Class :character   Class :character   1st Qu.: 27.00  
107 |     ##  Median : 4053   Mode  :character   Mode  :character   Median : 42.00  
108 |     ##  Mean   : 6355                                         Mean   : 44.69  
109 |     ##  3rd Qu.:10286                                         3rd Qu.: 59.00  
110 |     ##  Max.   :20000                                         Max.   :220.00  
111 |     ##   is_referral      device_id        
112 |     ##  Min.   :0.0000   Length:97341      
113 |     ##  1st Qu.:0.0000   Class :character  
114 |     ##  Median :0.0000   Mode  :character  
115 |     ##  Mean   :0.2878                     
116 |     ##  3rd Qu.:1.0000                     
117 |     ##  Max.   :1.0000
118 | 
119 | The range of each of the variables appear reasonable without any
120 | non-sensical values.
121 | 
122 | ``` r
123 | # check for any missing values in the dataset
124 | colSums(is.na(data))
125 | ```
126 | 
127 |     ##     user_id        date     country money_spent is_referral   device_id 
128 |     ##           0           0           0           0           0           0
129 | 
130 | No missing values exist in the dataset.
131 | 
132 | Run descriptive stats on the dataset.
133 | 
134 | ``` r
135 | # Total revenue by date
136 | data %>%
137 |   mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>%
138 |   ggplot(aes(x = date, y  = money_spent)) +
139 |   stat_summary(fun.y=sum, geom="line") +
140 |   stat_summary(fun.y=sum, geom="point") +
141 |   geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "red")
142 | ```
143 | 
144 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png)<!-- -->
145 | 
146 | The above plot shows a jump in daily sales after the referral program is
147 | launched on Oct 31st (dashed line in the plot).
148 | 
149 | ``` r
150 | # Split revenue between users from referrals and non=referred users
151 | data %>%
152 |   group_by(is_referral) %>%
153 |   summarise(unique_users = n_distinct(user_id), total_sales = sum(money_spent), revenue_per_user = sum(money_spent)/n_distinct(user_id))
154 | ```
155 | 
156 |     ## # A tibble: 2 x 4
157 |     ##   is_referral unique_users total_sales revenue_per_user
158 |     ##         <int>        <int>       <int>            <dbl>
159 |     ## 1           0        11951     3034612             254.
160 |     ## 2           1        12715     1315787             103.
161 | 
162 | ``` r
163 | # Total revenue split between referrals and non-referrals
164 | data %>%
165 |   mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>%
166 |   ggplot(aes(x = date, y  = money_spent, color = is_referral)) +
167 |   stat_summary(fun.y=sum, geom="line") +
168 |   stat_summary(fun.y=sum, geom="point") +
169 |   geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "black")
170 | ```
171 | 
172 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png)<!-- -->
173 | 
174 | The above plot reveals that daily sales from users, who came from
175 | referrals, are helping the overall sales after the launch of program.
176 | However, due to the way the experiment is set up, the jump in sales
177 | cannot be attributed to the referral program.
178 | 
179 | # Question 1:
180 | 
181 | Can you estimate the impact the program had on the site?
182 | 
183 | Here I plan to use daily sales as the metric to assess the impact of
184 | referral program. It is easy to think of other metrics like number of
185 | users who made atleast one purchase, money spent per purchase etc. At
186 | the end of day, total revenue is much more meaningful metric than
187 | others.
188 | 
189 | Summarize the data by date to get the daily sales.
190 | 
191 | ``` r
192 | data_daily <- data %>%
193 |   group_by(date) %>%
194 |   summarise(daily_sales = sum(money_spent))
195 | 
196 | data_daily
197 | ```
198 | 
199 |     ## # A tibble: 56 x 2
200 |     ##    date       daily_sales
201 |     ##    <chr>            <int>
202 |     ##  1 2015-10-03      128475
203 |     ##  2 2015-10-04      130772
204 |     ##  3 2015-10-05       46946
205 |     ##  4 2015-10-06       49711
206 |     ##  5 2015-10-07       45242
207 |     ##  6 2015-10-08       48463
208 |     ##  7 2015-10-09       48389
209 |     ##  8 2015-10-10      129837
210 |     ##  9 2015-10-11      128229
211 |     ## 10 2015-10-12       48712
212 |     ## # … with 46 more rows
213 | 
214 | Compare the daily sales before and after the launch of referral program.
215 | This is done by running T-test on daily sales and check if the
216 | difference is statistically
217 | significant.
218 | 
219 | ``` r
220 | t.test(data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")], data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")])
221 | ```
222 | 
223 |     ## 
224 |     ##  Welch Two Sample t-test
225 |     ## 
226 |     ## data:  data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")] and data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")]
227 |     ## t = -1.1138, df = 53.026, p-value = 0.2704
228 |     ## alternative hypothesis: true difference in means is not equal to 0
229 |     ## 95 percent confidence interval:
230 |     ##  -33770.245   9655.459
231 |     ## sample estimates:
232 |     ## mean of x mean of y 
233 |     ##  71657.00  83714.39
234 | 
235 | The T-test suggests that the difference in daily sales before and after
236 | the launch of program is not statistically significant.
237 | 
238 | Check the performance of referral program across different user
239 | segments.
240 | 
241 | ``` r
242 | # Compute average daily sales for each country
243 | data_daily_country <- data %>%
244 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"),
245 |          country = as.factor(country)) %>%
246 |   group_by(date, country, before_after_launch) %>%
247 |   summarise(daily_sales = sum(money_spent)) %>%
248 |   group_by(country) %>%
249 |   summarise( avg_diff_daily_sales = mean(daily_sales[before_after_launch == "Before Launch"]) - mean(daily_sales[before_after_launch == "After Launch"]),
250 |              p_value = t.test(daily_sales[before_after_launch == "Before Launch"], daily_sales[before_after_launch == "After Launch"])$p.value)
251 | 
252 | data_daily_country
253 | ```
254 | 
255 |     ## # A tibble: 9 x 3
256 |     ##   country avg_diff_daily_sales p_value
257 |     ##   <fct>                  <dbl>   <dbl>
258 |     ## 1 CA                     -412   0.703 
259 |     ## 2 CH                      512.  0.0139
260 |     ## 3 DE                     1843.  0.163 
261 |     ## 4 ES                    -2012.  0.0750
262 |     ## 5 FR                    -3250.  0.0637
263 |     ## 6 IT                    -2542.  0.0515
264 |     ## 7 MX                    -2058.  0.0193
265 |     ## 8 UK                    -2983.  0.0970
266 |     ## 9 US                    -1156.  0.498
267 | 
268 | The difference in average daily sales before and after launch are
269 | statistically significant for Mexico MX and China CH. However, sales are
270 | better after launch in MX and worse in CH.
271 | 
272 | ``` r
273 | # Plot average daily sales Vs Country
274 | data %>%
275 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"),
276 |          country = as.factor(country)) %>%
277 |   group_by(date, country, before_after_launch) %>%
278 |   summarise(daily_sales = sum(money_spent)) %>%
279 |   group_by(country, before_after_launch) %>%
280 |   summarise(avg_daily_sales = mean(daily_sales)) %>%
281 |   ggplot(aes(x = country, y  = avg_daily_sales, color = before_after_launch, group = before_after_launch)) +
282 |   geom_line() +
283 |   geom_point()
284 | ```
285 | 
286 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png)<!-- -->
287 | 
288 | The above plot reveals that the referral program is performing
289 | differently across different countries. The average daily sales dropped
290 | after the launch of program in countries like China CH and Germany DE.
291 | While sales appear to increase in other countries after the launch.
292 | 
293 | Based on the way the experiment is set up, Any change in sales after the
294 | launch of referral program cannot be attributed to it alone. There are
295 | other confounding factors like seasonality or launch of some marketing
296 | event that could be affecting the sales. In US and in some other
297 | countries, Thanks giving (occurs in end of November every year) is one
298 | of the major holiday season and sales are generally expected to grow and
299 | the experiment is not controlling for the lurking variables like that.
300 | 
301 | # Question 2:
302 | 
303 | Based on the data, what would you suggest to do as a next step?
304 | 
305 | From the revenue split between referrals and non-referrals plot, It is
306 | apparent that the sales from the non-referrals, which includes all
307 | existing users who signed up/made purchases before Oct 31st, after
308 | launch of the program are down significantly.
309 | 
310 | Lets check if the drop in sales from the non-referrals is significant
311 | before and after launch of program is significant.
312 | 
313 | ``` r
314 | nonreferrals_before_vs_after <- data %>%
315 |   mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch")) %>%
316 |   filter(is_referral == 0) %>%
317 |   group_by(date, before_after_launch) %>%
318 |   summarise(daily_sales = sum(money_spent))
319 | 
320 | t.test(nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "Before Launch"], nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "After Launch"])
321 | ```
322 | 
323 |     ## 
324 |     ##  Welch Two Sample t-test
325 |     ## 
326 |     ## data:  nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch ==  and nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch ==     "Before Launch"] and     "After Launch"]
327 |     ## t = 4.3754, df = 40.085, p-value = 8.417e-05
328 |     ## alternative hypothesis: true difference in means is not equal to 0
329 |     ## 95 percent confidence interval:
330 |     ##  18799.12 51070.88
331 |     ## sample estimates:
332 |     ## mean of x mean of y 
333 |     ##     71657     36722
334 | 
335 | Clearly, T-test suggests that the difference in sales from non-referrals
336 | before and after the launch of program is statistically significant.
337 | 
338 | It is quite possible that new program is cannibalizing sales from the
339 | existing users. Lets check if that is the case.
340 | 
341 | ``` r
342 | # Check if multiple users are using same device
343 | counts_deviceid <- data %>%
344 |   group_by(device_id) %>%
345 |   summarise(n_users = n_distinct(user_id), referral_types = n_distinct(is_referral)) %>%
346 |   arrange(desc(referral_types))
347 | 
348 | counts_deviceid
349 | ```
350 | 
351 |     ## # A tibble: 17,887 x 3
352 |     ##    device_id     n_users referral_types
353 |     ##    <chr>           <int>          <int>
354 |     ##  1 AAEIJZQUXUETO       1              2
355 |     ##  2 AAFWFCPHRVTTE       2              2
356 |     ##  3 AALHUFWNVROUD       2              2
357 |     ##  4 AANBNDYFXDDTU       2              2
358 |     ##  5 AANMQLIXUDKFJ       2              2
359 |     ##  6 AAPOSGLGOGVMI       2              2
360 |     ##  7 AAUOIIUZHHWTL       1              2
361 |     ##  8 ABABAWCOLPZXO       2              2
362 |     ##  9 ABADIUGCJRDOV       2              2
363 |     ## 10 ABAPCEOKZCDAH       1              2
364 |     ## # … with 17,877 more rows
365 | 
366 | ``` r
367 | # Are new referred users coming from existing devices
368 | referrals_using_existing_devices <- data %>%
369 |   filter(device_id %in% counts_deviceid$device_id[counts_deviceid$referral_types > 1], is_referral == 1) %>%
370 |   select(user_id) %>%
371 |   distinct()
372 | 
373 | nrow(referrals_using_existing_devices)
374 | ```
375 | 
376 |     ## [1] 8571
377 | 
378 | There are about 8571 users that were showing up as new referrals, but
379 | were using devices seen earlier. Clearly, users are referring themselves
380 | or members of same family.
381 | 
382 | Although it appears that sales have increased after launch, The new
383 | program is cannibalizing the sales from the existing users. Users are
384 | simply creating new accounts and referring themselves or members of
385 | family to take advantage of the $10 credit for a new referral. The jump
386 | in sales we witnessed after launch could very well have happened even if
387 | the program would not have launched.
388 | 
389 | ### Next Steps :
390 | 
391 | The company should come up with a robust way to test the referral
392 | program that controls for unwanted confounders. If they decide to launch
393 | the referral program, They should look into ways to prevent the abuse of
394 | program by restricting number new user sign-ups from a single device.
395 | 
396 | # Question 3:
397 | 
398 | The referral program wasn’t really tested in a rigorous way. It simply
399 | started on a given day for all users and you are drawing conclusions by
400 | looking at the data before and after the test started. What kinds of
401 | risks this approach presents? Can you think of a better way to test the
402 | referral program and measure its impact?
403 | 
404 | The way the referral program was tested is faulty. The experimental set
405 | up is not accounting for the seasonal nature of sales and there could be
406 | confounding variables in play that affect the sales. This approach poses
407 | the risk of false positive, concluding that the new program is causing
408 | the jump in sales when in fact it is not.
409 | 
410 | A better way to test the referral program is Testing by Markets. Metric
411 | : Average Daily sales
412 | 
413 | ### Experimental Setup :
414 | 
415 |   - Identify the pairs of markets or zones that have similar sales and
416 |     demographic characteristics by looking at the historical data.
417 |   - Estimate the sample size (number of days) needed to detect the
418 |     expected effect size.
419 |   - Now among the similar pair, randomly select one to be test group and
420 |     launch the referral program. The other will be the control group.
421 |   - Run the test at the sametime for both test and contol groups until
422 |     required sample size is reached. Running the test at the sametime is
423 |     critical to control for the seasonality and other confounders.
424 |   - Check if difference in sales is significant using a T-test.
425 | 


--------------------------------------------------------------------------------
/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png


--------------------------------------------------------------------------------
/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------