├── Ads Analysis ├── Ads_analysis.Rmd ├── Ads_analysis.md └── Ads_analysis_files │ ├── figure-html │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ └── unnamed-chunk-14-1.png │ └── figure-markdown_github │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ └── unnamed-chunk-14-1.png ├── Clustering Grocery Items ├── Clustering_Grocery_items.Rmd ├── Clustering_Grocery_items.md └── Clustering_Grocery_items_files │ └── figure-markdown_github │ └── unnamed-chunk-14-1.png ├── Diversity in Workplace ├── Diversity in the Workplace.Rmd ├── Diversity_in_the_Workplace.md └── Diversity_in_the_Workplace_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ ├── unnamed-chunk-12-1.png │ ├── unnamed-chunk-13-1.png │ ├── unnamed-chunk-13-2.png │ ├── unnamed-chunk-13-3.png │ ├── unnamed-chunk-13-4.png │ ├── unnamed-chunk-13-5.png │ ├── unnamed-chunk-13-6.png │ ├── unnamed-chunk-13-7.png │ ├── unnamed-chunk-13-8.png │ ├── unnamed-chunk-5-1.png │ ├── unnamed-chunk-6-1.png │ ├── unnamed-chunk-7-1.png │ ├── unnamed-chunk-8-1.png │ └── unnamed-chunk-9-1.png ├── Engagement Test ├── Engagement Test.Rmd ├── Engagement_Test.md └── Engagement_Test_files │ └── figure-gfm │ ├── unnamed-chunk-11-1.png │ └── unnamed-chunk-13-1.png ├── Funnel Analysis ├── Funnel_analysis.Rmd ├── Funnel_analysis.md └── Funnel_analysis_files │ └── figure-gfm │ ├── unnamed-chunk-12-1.png │ ├── unnamed-chunk-14-1.png │ ├── unnamed-chunk-15-1.png │ ├── unnamed-chunk-15-2.png │ ├── unnamed-chunk-15-3.png │ ├── unnamed-chunk-15-4.png │ ├── unnamed-chunk-15-5.png │ └── unnamed-chunk-9-1.png ├── JSON City Similarities ├── JSON City Similarities.Rmd ├── JSON_City_Similarities.md ├── JSON_City_Similarities_files │ └── figure-gfm │ │ ├── unnamed-chunk-10-1.png │ │ ├── unnamed-chunk-4-1.png │ │ ├── unnamed-chunk-4-2.png │ │ ├── unnamed-chunk-4-3.png │ │ ├── unnamed-chunk-4-4.png │ │ ├── unnamed-chunk-4-5.png │ │ ├── unnamed-chunk-4-6.png │ │ └── unnamed-chunk-4-7.png └── JSON_City_Similarity_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-4-1.png │ ├── unnamed-chunk-4-2.png │ ├── unnamed-chunk-4-3.png │ ├── unnamed-chunk-4-4.png │ ├── unnamed-chunk-4-5.png │ ├── unnamed-chunk-4-6.png │ └── unnamed-chunk-4-7.png ├── Loan granting ├── Loan granting.Rmd ├── Loan_granting.md └── Loan_granting_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ ├── unnamed-chunk-12-1.png │ ├── unnamed-chunk-12-10.png │ ├── unnamed-chunk-12-11.png │ ├── unnamed-chunk-12-12.png │ ├── unnamed-chunk-12-2.png │ ├── unnamed-chunk-12-3.png │ ├── unnamed-chunk-12-4.png │ ├── unnamed-chunk-12-5.png │ ├── unnamed-chunk-12-6.png │ ├── unnamed-chunk-12-7.png │ ├── unnamed-chunk-12-8.png │ ├── unnamed-chunk-12-9.png │ ├── unnamed-chunk-16-1.png │ ├── unnamed-chunk-16-2.png │ ├── unnamed-chunk-16-3.png │ ├── unnamed-chunk-16-4.png │ ├── unnamed-chunk-8-1.png │ ├── unnamed-chunk-9-1.png │ └── unnamed-chunk-9-2.png ├── Online Video Challenge ├── Online Video Challenge.Rmd ├── Online_Video_Challenge.md └── Online_Video_Challenge_files │ └── figure-gfm │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ ├── unnamed-chunk-14-1.png │ ├── unnamed-chunk-14-2.png │ ├── unnamed-chunk-15-1.png │ ├── unnamed-chunk-16-1.png │ ├── unnamed-chunk-17-1.png │ ├── unnamed-chunk-18-1.png │ ├── unnamed-chunk-19-1.png │ ├── unnamed-chunk-20-1.png │ ├── unnamed-chunk-20-2.png │ ├── unnamed-chunk-20-3.png │ ├── unnamed-chunk-20-4.png │ ├── unnamed-chunk-20-5.png │ ├── unnamed-chunk-20-6.png │ ├── unnamed-chunk-20-7.png │ ├── unnamed-chunk-21-1.png │ ├── unnamed-chunk-7-1.png │ └── unnamed-chunk-8-1.png ├── README.md ├── Song Challenge ├── Song_Challenge.Rmd └── Song_Challenge.md ├── URL Parsing Challenge ├── URL Parsing Challenge.Rmd └── URL_Parsing_Challenge.md └── User Referral Program ├── User Referral Program.Rmd ├── User_Referral_Program.md └── User_Referral_Program_files └── figure-gfm ├── unnamed-chunk-4-1.png ├── unnamed-chunk-5-1.png └── unnamed-chunk-9-1.png /Ads Analysis/Ads_analysis.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ads Analysis" 3 | author: "Siddhartha Jetti" 4 | date: "July 17, 2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Maybe the first industry to heavily rely on data science was the online ads industry. Data Science is used to choose which ads to show, how much to pay, optimize the ad text and the position as well as in countless of other related applications. 11 | 12 | Optimizing ads is one of the most intellectually challenging jobs a data scientist can do. It is a really complex problem given the huge (really really huge) size of the datasets as well as number of features that can be used. 13 | Moreover, companies often spend huge amounts of money in ads and a small ad optimization improvement can be worth millions of dollars for the company. 14 | 15 | The goal of this project is to look at a few ad campaigns and analyze their current performance as well as predict their future performance. 16 | 17 | 18 | # Challenge Description 19 | 20 | Company XYZ is a food delivery company. Like pretty much any other site, in order to get customers, they have been relying significantly on online ads, such as those you see on Google or Facebook. 21 | 22 | At the moment, they are running 40 different ad campaigns and want you to help them understand their performance. 23 | 24 | Specifically, you are asked to: 25 | 26 | 1) If you had to identify the 5 best ad groups, which ones would be? Which metric did you choose to identify the best ones? Why? Explain the pros of your metric as well as the possible cons. From a business perspective, choosing that metric implies that you are focusing on what? 27 | 28 | 29 | 2) For each group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend). 30 | 31 | 32 | 3) Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down. 33 | 34 | 35 | ## Data 36 | 37 | We have 1 table downloadable by clicking here. 38 | 39 | The table is: 40 | 41 | ad_table - aggregate information about ads 42 | 43 | ### Columns: 44 | 45 | * date : all data are aggregated by date 46 | * shown : the number of ads shown on a given day all over the web. Impressions are free. That is, companies pay only if a user clicks on the ad, not to show it 47 | * clicked : the number of clicks on the ads. This is what companies pay for. By clicking on the ad, the user is brought to the site 48 | converted : the number of conversions on the site coming from ads. To be counted, a conversion as to happen on the same day as the ad click. 49 | * avg_cost_per_click : on an average, how much it cost each of those clicks 50 | * total_revenue : how much revenue came from the conversions 51 | * ad : we have several different ad versions with different text. This shows which ad group we are considering 52 | 53 | 54 | # Problem Setup 55 | 56 | ```{r} 57 | # Load required libraries 58 | library(dplyr) 59 | library(ggplot2) 60 | library(lubridate) 61 | library(randomForest) 62 | 63 | # Read in the input data into a dataframe 64 | ads <- read.csv("ad_table.csv", stringsAsFactors = F) 65 | ``` 66 | 67 | # Data Exploration 68 | 69 | Explore the ads dataset 70 | ```{r} 71 | # Transform variables in ads dataset 72 | ads <- ads %>% 73 | mutate(date = as.Date(date)) %>% 74 | arrange(ad, date) 75 | 76 | # Check data types of each of the columns 77 | summary(ads) 78 | 79 | # Take a peek at data 80 | head(ads) 81 | ``` 82 | 83 | Removing rows with shown = 0 or revenue < 0 84 | 85 | ```{r} 86 | ads_cleaned <- ads %>% 87 | filter(shown > 0, clicked > 0, total_revenue >= 0) 88 | ``` 89 | 90 | Now, checking the summary of dataset 91 | 92 | ```{r} 93 | summary(ads_cleaned) 94 | ``` 95 | 96 | Check for missing values in the data. 97 | ```{r} 98 | # count of missing values by column in views dataset 99 | colSums(is.na(ads_cleaned)) 100 | ``` 101 | No missing values exist anywhere in the data. 102 | 103 | Check if data exists for all the ads for all the dates. 104 | ```{r} 105 | table(ads_cleaned$date) 106 | ``` 107 | 108 | ```{r} 109 | table(ads_cleaned$ad) 110 | ``` 111 | 112 | Overall the data looks good. 113 | 114 | # Question 1: 115 | 116 | Here the goal is to choose the best 5 ads based on the provided data. 117 | I plan on using average return on advertising budget as a metric to choose the top ad campaigns. 118 | 119 | ```{r} 120 | # calculate average daily return for every ad 121 | best_ads_avg_revenue <- ads_cleaned %>% 122 | group_by(ad) %>% 123 | summarise(avg_return = sum(total_revenue) / sum(clicked * avg_cost_per_click)) %>% 124 | arrange(desc(avg_return)) 125 | 126 | head(best_ads_avg_revenue, 5) 127 | ``` 128 | 129 | The problem with using average return as a metric is, it does not identify the ads with low average return but are consistently trending up. Arguably, the ads that are trending up are equally important to marketing teams, if not more, than the ones with high average return and trending down. So, it makes sense to look both the average return and the trend to know the complete story. 130 | 131 | ```{r} 132 | # Function to extract trend in the data 133 | slope_trendline <- function(y, x) { 134 | trendline <- lm(formula = y ~ x) 135 | return(trendline$coefficients[2]) 136 | } 137 | 138 | # Function to compute the p-valuye of x coefficient 139 | slope_p_value <- function(y, x) { 140 | trendline <- lm(formula = y ~ x) 141 | df <- data.frame(summary(trendline)$coefficients) 142 | return(df[2,4]) 143 | } 144 | 145 | # Estimate the trend over time for every ad 146 | best_ads <- ads_cleaned %>% 147 | group_by(ad) %>% 148 | mutate(date = as.numeric(date), 149 | day_return = total_revenue / (clicked * avg_cost_per_click)) %>% 150 | summarise(trend = slope_trendline(day_return, date)) %>% 151 | inner_join(best_ads_avg_revenue, by = "ad") %>% 152 | arrange(desc(avg_return)) 153 | 154 | head(best_ads, 10) 155 | ``` 156 | 157 | All the top 5 ad campaigns by average return are making money (return > 1). 158 | Among them, The ad groups 31, 16 and 14 are having a positive trend. While groups 2 and 16 have high average return but are trending down. 159 | 160 | # Question 2: 161 | 162 | Here the goal is to predict how many times the ad will be shown on a future date. 163 | 164 | The simplest method is to fit a straight line that is closest to the data points for each ad group and use it to predict future views. 165 | 166 | Let's pick few arbitrary videos and visualize the time series of views. 167 | ```{r} 168 | ads_cleaned %>% 169 | filter(ad == "ad_group_31") %>% 170 | 171 | ggplot(aes(date, shown)) + 172 | geom_point() + 173 | geom_smooth(method = "lm", se = FALSE) 174 | ``` 175 | 176 | ```{r} 177 | ads_cleaned %>% 178 | filter(ad == "ad_group_38") %>% 179 | 180 | ggplot(aes(date, shown)) + 181 | geom_point() + 182 | geom_smooth(method = "lm", se = FALSE) 183 | ``` 184 | 185 | Fitting a line to time series of all the ad campaigns 186 | 187 | ```{r} 188 | 189 | unique_ads <- unique(ads_cleaned$ad) 190 | new <- data.frame(x = as.numeric(as.Date("2015-12-15"))) 191 | prediction <- data.frame(ad = c(), prediction = c()) 192 | 193 | for(i in unique_ads) { 194 | x <- as.numeric(ads_cleaned$date[ads_cleaned$ad == i]) 195 | y <- ads_cleaned$shown[ads_cleaned$ad == i] 196 | predicted <- round(predict(lm(y ~ x), newdata = new), digits = 0) 197 | predicted_df <- data.frame(ad = i, prediction = predicted, stringsAsFactors = F) 198 | prediction <- bind_rows(prediction, predicted_df) 199 | } 200 | 201 | head(prediction) 202 | ``` 203 | 204 | # Question 3: 205 | 206 | First fit a linear regression line to extract the trend from data. 207 | 208 | ```{r} 209 | # Function to normalize the variables 210 | normalize <- function(x) { 211 | return((x - min(x)) / (max(x) - min(x))) 212 | } 213 | 214 | # Now group by video id, normalize variables and extract slope 215 | ads_cpc_summary <- ads_cleaned %>% 216 | mutate(date = as.numeric(date), 217 | cpc_norm = normalize(avg_cost_per_click), 218 | date_norm = normalize(date)) %>% 219 | group_by(ad) %>% 220 | summarise(cpc_slope = slope_trendline(cpc_norm, date_norm), 221 | cpc_slope_p_value = slope_p_value(cpc_norm, date_norm)) 222 | 223 | # Take a look at the data 224 | head(ads_cpc_summary) 225 | ``` 226 | 227 | 228 | The distribution of slopes of cost per click trendlines 229 | ```{r} 230 | quantile(ads_cpc_summary$cpc_slope, probs = seq(0, 1, by = 0.05)) 231 | ads_cpc_summary %>% 232 | ggplot() + 233 | geom_histogram(bins = 30, aes(x = cpc_slope)) 234 | ``` 235 | 236 | Trends are extracted by fitting a inear regression line to time series of average cost per click data. Here are rules that can be thought of to classify ads based on if average cost per click are going up, staying flat or going down. 237 | 238 | * If coefficient of x-term is statistically significant and coefficient > 0, then average cost per clicks are going up. 239 | * If coefficient is statistically significant and coefficient < 0, then average cost per clicks are going down. 240 | * If coefficient is NOT statistically significant then average cost per clicks are flat. 241 | 242 | The criterion to decide statistical significance should not be just p < 0.05. The reason is we are effectively doing 40 different tests. So, to prevent risk of incorrectly rejecting a null hypothesis due to multiple comparisons, the p-values of each test need to be adjusted using Boniferoni correction (pvalue/number of comparisons). 243 | 244 | ```{r} 245 | # Classifying ads based on the stated rules 246 | ads_cpc_summary <- ads_cpc_summary %>% 247 | mutate(cpc_category = case_when( 248 | cpc_slope_p_value < (0.05/40) & cpc_slope > 0 ~ "Going up", 249 | cpc_slope_p_value < (0.05/40) & cpc_slope < 0 ~ "Going down", 250 | TRUE ~ "Flat") 251 | ) 252 | 253 | # Frequency of different video categories 254 | table(ads_cpc_summary$cpc_category) 255 | ``` 256 | 257 | Employing Boniferoni correction is classifying all the campaigns as average cost per click remaining Flat. The reason being Boniferoni correction generally imposes a very stringent condition for significance. 258 | -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-html/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Ads Analysis/Ads_analysis_files/figure-markdown_github/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Clustering Grocery Items/Clustering_Grocery_items.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Clustering Grocery Items' 3 | author: "Siddhartha Jetti" 4 | date: "7/25/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Online shops often sell tons of different items and this can become messy very quickly! 11 | 12 | Data science can be extremely useful to automatically organize the products in categories so that they can be easily found by the customers. 13 | 14 | The goal of this challenge is to look at user purchase history and create categories of items that are likely to be bought together and, therefore, should belong to the same cluster. 15 | 16 | 17 | # Challenge Description 18 | 19 | Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience. 20 | 21 | However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use! 22 | 23 | This is what they asked you to do: 24 | 25 | 1)The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder: 26 | 27 | * the customer who bought the most items overall in her lifetime 28 | 29 | * for each item, the customer who bought that product the most 30 | 31 | 2)Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster. 32 | 33 | 34 | # Data 35 | 36 | We have 2 table downloadable by clicking on here. 37 | 38 | The 2 tables are: 39 | 40 | item_to_id - for each item, it gives the corresponding id 41 | 42 | ## Columns: 43 | 44 | * Item_name : The name of the item 45 | * Item_id : the id of the item. Can be joined to the id in the other table. It is unique by item. 46 | 47 | purchase_history - for each user purchase, the items bought 48 | 49 | ## Columns: 50 | 51 | * user_id : The id of the user. 52 | * id : comma-separated list of items bought together in that transaction. 53 | 54 | 55 | # Problem Setup 56 | 57 | ```{r} 58 | # Load required libraries 59 | library(tidyverse) 60 | library(ggplot2) 61 | 62 | # Read and process input data into a dataframe 63 | items <- read.csv("item_to_id.csv", stringsAsFactors = F) 64 | purchases <- read.csv("purchase_history.csv", stringsAsFactors = F) 65 | ``` 66 | 67 | # Data Exploration 68 | 69 | Check data types of columns in items dataset 70 | ```{r} 71 | # Check data types of each of the columns 72 | str(items) 73 | summary(items) 74 | ``` 75 | 76 | Check data types of columns in purchases dataset. 77 | ```{r} 78 | # Check data types of each of the columns 79 | str(purchases) 80 | summary(purchases) 81 | ``` 82 | 83 | Check for missing values in the data 84 | ```{r} 85 | # Check if any missing values exist 86 | colSums(is.na(items)) 87 | 88 | colSums(is.na(purchases)) 89 | ``` 90 | 91 | 92 | Check for duplicates in the data 93 | 94 | ```{r} 95 | # check if any duplicate item id exist 96 | length(items$Item_id) == length(unique(items$Item_id)) 97 | 98 | ``` 99 | 100 | ```{r} 101 | # check if any duplicate user id exist 102 | length(purchases$user_id) == length(unique(purchases$user_id)) 103 | 104 | ``` 105 | 106 | Clearly, there are duplicate user ids in purchases dataset. This is OK because single user can make multiple transactions. 107 | 108 | It is important to have an unique id for each transaction. Now lets create transaction id using the user id in purchases dataset. 109 | 110 | ```{r} 111 | purchases <- purchases %>% 112 | group_by(user_id) %>% 113 | mutate(transaction_id = paste0(user_id, "_", row_number())) 114 | 115 | # Take a peek at the data 116 | head(purchases) 117 | 118 | # Check if created transaction ids are unique 119 | length(purchases$transaction_id) == length(unique(purchases$transaction_id)) 120 | ``` 121 | 122 | Overall the data looks good. 123 | 124 | # Question 1 125 | 126 | ```{r} 127 | # Get maximum number of items purchased in a single transaction 128 | # This is done by counting the occurences of "," + 1 129 | max_items <- max(str_count(purchases$id, ",")) + 1 130 | ``` 131 | 132 | Now transform purchases dataset into tidy format for future use. 133 | 134 | ```{r} 135 | purchases_tidy <- purchases %>% 136 | separate(col = id, into = paste0("item", 1:max_items), sep = ",") %>% 137 | gather(key = "value", value = "item_id", -user_id, -transaction_id) %>% 138 | filter(!is.na(item_id)) %>% 139 | mutate(item_id = as.integer(item_id)) 140 | 141 | head(purchases_tidy) 142 | ``` 143 | 144 | Get user id that made the most number of purchases in the life time. 145 | 146 | ```{r} 147 | most_units_bought <- purchases_tidy %>% 148 | group_by(user_id) %>% 149 | summarise(units_bought = n()) %>% 150 | arrange(desc(units_bought)) %>% 151 | filter(row_number() == 1) 152 | 153 | # user id with most number of purchases 154 | most_units_bought 155 | ``` 156 | 157 | Get user ids with most units bought by product. 158 | ```{r} 159 | most_units_by_item <- purchases_tidy %>% 160 | group_by(item_id, user_id) %>% 161 | summarise(units_bought = n()) %>% 162 | arrange(item_id, desc(units_bought)) %>% 163 | filter(row_number() == 1) %>% 164 | inner_join(items, by = c("item_id" = "Item_id")) 165 | 166 | # user id with most number of purchases by item 167 | most_units_by_item 168 | ``` 169 | 170 | # Question 2 171 | 172 | Now the goal is to create clusters of items that have highest probability of being purchased together. 173 | 174 | Each grocery item can be imagined as a point in the n-dimensional space spun by transactions. Each coordinate of the point(n-dimensional) would be the number of units of the item purchased in transaction corresponding to the coordinate. 175 | 176 | ```{r} 177 | # Build item-transaction matrix 178 | item_transaction_matrix <- purchases_tidy %>% 179 | group_by(transaction_id, item_id) %>% 180 | summarise(nunits = n()) %>% 181 | ungroup() %>% 182 | spread(transaction_id, nunits) %>% 183 | mutate_all(list(~replace_na(., 0))) 184 | 185 | head(item_transaction_matrix) 186 | ``` 187 | 188 | To cluster the items based on transaction history, I choose to use Kmeans clustering algorithm. Here the dataset has all the variables on the same scale and pretty much same meaning. So, I expect K-means to perform well. The advantage of using K-means is that it is highly interpretable and can easily be explained. 189 | 190 | K-means algorithm chooses the clusters such a way that within cluster variance is minimum for a given number of clusters. The optimal number of clusters is determined by running kmeans with different number of clusters and plotting the Elbow curve (within cluster variance vs number of clusters) and also results should make sense from UI standpoint without containing too many clusters. 191 | 192 | For stability, Kmeans algorithm is run multiple times for each configuration of clusters. The mean of variance is used for plotting the elbow curve. 193 | 194 | ```{r} 195 | # Set seed 196 | set.seed(2019) 197 | # Place holder to store within variance for several configuration of clusters 198 | within_SS <- c() 199 | # Try different number of clusters 200 | nclusters <- 2:20 201 | # 10 tries for each configuration of clusters 202 | tries <- 10 203 | 204 | # Run Kmeans for different number of clusters 205 | for(i in nclusters){ 206 | tries_within_SS <- c() 207 | # Run the Kmeans 10 times for each configuration of clusters 208 | for(try in 1:tries){ 209 | clusters <- kmeans(item_transaction_matrix[,-1], centers = i) 210 | tries_within_SS <- c(tries_within_SS, clusters$tot.withinss) 211 | } 212 | within_SS <- c(within_SS, mean(tries_within_SS)) 213 | } 214 | ``` 215 | 216 | Plotting the Elbow curve. 217 | ```{r} 218 | data.frame(k = nclusters, within_SS = within_SS) %>% 219 | ggplot(aes(x = k, y = within_SS)) + 220 | geom_point() + 221 | geom_line() + 222 | ggtitle("Within Sum of Squares vs Number of Clusters") 223 | ``` 224 | 225 | Unfortunately, the above plot does not reveal an obvious "elbow" point. But there is a slight change in gradient after k=9 or 10. So, let us choose k = 9 as the optimal number of clusters. 226 | 227 | ```{r} 228 | set.seed(2019) 229 | clusters <- kmeans(item_transaction_matrix[,-1], centers = 9) 230 | 231 | item_clusters <- data.frame(item_id = item_transaction_matrix[,1], cluster = clusters$cluster) %>% 232 | mutate(item_id = as.integer(item_id)) %>% 233 | inner_join(items, by = c("item_id" = "Item_id")) %>% 234 | group_by(cluster) %>% 235 | summarise(item_count = n(), items = paste0(Item_name, collapse = ", ")) 236 | 237 | item_clusters 238 | ``` 239 | 240 | Looking at the item clusters. 241 | 242 | * All the vegetables except lettuce are together. 243 | * All fruits are grouped together. 244 | * All kinds of meat are together. 245 | * Beverages are clustered together. 246 | * Snacks are grouped. 247 | 248 | However, cluster 6 appears to have too many items. Lets try to re-run the algorithm by increasing the number of clusters. 249 | 250 | ```{r} 251 | set.seed(2019) 252 | clusters <- kmeans(item_transaction_matrix[,-1], centers = 12) 253 | 254 | item_clusters2 <- data.frame(item_id = item_transaction_matrix[,1], cluster = clusters$cluster) %>% 255 | mutate(item_id = as.integer(item_id)) %>% 256 | inner_join(items, by = c("item_id" = "Item_id")) %>% 257 | group_by(cluster) %>% 258 | summarise(item_count = n(), items = paste0(Item_name, collapse = ", ")) 259 | 260 | item_clusters2 261 | ``` 262 | 263 | Increasing the number of clusters to 12 resulted in decreasing the maximum number of items in cluster from 22 to 17. It also resulted in breaking out the milk products into different cluster and appears to be performing better than with 9 clusters. -------------------------------------------------------------------------------- /Clustering Grocery Items/Clustering_Grocery_items_files/figure-markdown_github/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Clustering Grocery Items/Clustering_Grocery_items_files/figure-markdown_github/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity in the Workplace.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Diversity in the Workplace' 3 | author: "Siddhartha Jetti" 4 | date: "5/30/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Diversity, unconscious bias in the workplace and, in general, the way companies treat their employees are a very important topic. 11 | Data science can help discover potential discriminations by looking at the data and see if there are segments of employees that are treated worse. 12 | 13 | # Challenge Description 14 | 15 | There has been lots of talking about diversity in the workplace, especially in technology. The Head of HR at your company is very concerned about that and has asked you to analyze internal data about employees and see whether results suggest that the company is treating all its employees fairly or not. 16 | 17 | Specifically, she gave you the following tasks: 18 | 19 | 1) In the company there are 6 levels (described below). Identify, for each employee, her corresponding level. 20 | * Individual Contributors(IC) - they don’t manage anyone 21 | * Middle Managers(MM) - they are the direct bosses of IC 22 | * Directors(D) - they are the direct bosses of MM 23 | * VP - D direct bosses 24 | * Executives (E) - VP direct bosses 25 | * CEO - The direct boss of E. 26 | 27 | 2) How many people each employee manages? Consider that if John directly manages 2 people and these two people manage 5 people each, then we conclude that John manages 12 people. 28 | 29 | 30 | 3) Build a model to predict the salary of each employee. 31 | Describe the main factors impacting employee salaries. Do you think the company has been treating all its employees fairly? What are the next steps you would suggest to the Head of HR? 32 | 33 | # Data 34 | 35 | We have 2 tables downloadable by clicking here. 36 | 37 | The 2 tables are: 38 | 39 | company_hierarchy - info about each employee direct boss and her dept 40 | 41 | ## Columns: 42 | 43 | * employee_id : the Id of the employee. It is unique by employee and can be joined to employee id in the other table 44 | * boss id : the id of the boss of employee id. It is unique by employee and can be joined to employee id in the other table. 45 | * dept : employee id dept. There are the following departments: 46 | + Engineering (data science is under engineering) 47 | + Marketing 48 | + Sales 49 | + HR 50 | + the “CEO” dept just for the CEO, since belongs to all the above dept. 51 | 52 | employee - info about each employee 53 | 54 | ## Columns: 55 | 56 | * employee_id : the Id of the employee. It is unique by employee and can be joined to employee id and/or boss id in the other table 57 | * signing_bonus : whether the employee got a signing bonus when she joined the company (1 -> yes, 0 -> no) 58 | * salary : the current salary of that employee in USD 59 | * degree level: the highest degree received by the employee. 60 | * sex: Male/Female 61 | * yrs_experience: employee years of work experience 62 | 63 | 64 | # Problem Setup 65 | 66 | ```{r} 67 | # Load required libraries 68 | library(dplyr) 69 | library(ggplot2) 70 | library(randomForest) 71 | 72 | # Read in the input data into a dataframe 73 | data_hierarchy <- read.csv("company_hierarchy.csv") 74 | data_employee <- read.csv("employee.csv") 75 | ``` 76 | 77 | 78 | # Question 1: 79 | 80 | Get the count of reports and level from hierarchy table. 81 | ```{r} 82 | hierarchy <- data_hierarchy %>% 83 | filter(!is.na(boss_id)) 84 | 85 | # Function to count direct/in-direct reports and get the hierarchy level for every employee 86 | countReports <- function(df,id) { 87 | reports <- df[df$boss_id == id,] 88 | reports1 <- reports 89 | len <- nrow(reports) 90 | level <- ifelse(len == 0, 1, 2) 91 | while(1){ 92 | reports <- df[df$boss_id %in% reports$employee_id,] 93 | reports1 <- rbind(reports1, reports) 94 | if(nrow(reports) == len){ 95 | return(data.frame(employee_id = id, level = level, report_count = nrow(reports1))) 96 | break 97 | } 98 | len <- nrow(reports) 99 | level <- level + 1 100 | } 101 | } 102 | 103 | data_reports <- data.frame(employee_id = NULL, level = NULL, report_count = NULL ) 104 | emps <- nrow(data_hierarchy) 105 | 106 | # Loop through all the employess to get level and reports count. 107 | for(i in 1:emps){ 108 | data_reports <- rbind(data_reports, countReports(hierarchy, data_hierarchy$employee_id[i])) 109 | } 110 | 111 | # Join hierarchy related information with employee information to form one table 112 | data <- data_reports %>% 113 | arrange(desc(report_count), employee_id) %>% 114 | left_join(data_hierarchy, by = "employee_id") %>% 115 | select(employee_id, dept, level, report_count) %>% 116 | right_join(data_employee, by = "employee_id") %>% 117 | mutate(signing_bonus = as.factor(signing_bonus), 118 | level = as.factor(case_when( level == 3 ~ "MM", 119 | level == 4 ~ "D", 120 | level == 5 ~ "VP", 121 | level == 6 ~ "E", 122 | level == 7 ~ "CEO", 123 | TRUE ~ "IC" )) 124 | ) 125 | ``` 126 | 127 | # Question 2: 128 | 129 | Run descriptive statistics 130 | ```{r} 131 | # Check datatypes of all the variables in the dataframe 132 | str(data) 133 | ``` 134 | 135 | ```{r} 136 | # Take a peek at the data 137 | head(data) 138 | summary(data) 139 | 140 | # check for duplicate user ids 141 | length(data$user_id) == length(unique(data$user_id)) 142 | ``` 143 | 144 | Here are some observations about data: 145 | 146 | * There are total of 10000 employees in dataset and each employee id is unique 147 | * The mean and median salary of employess across the company are USD 182k and 189k respectively. 148 | 149 | ```{r} 150 | # Median is more robust to outliers than mean 151 | salary_dept = data %>% 152 | filter(!dept %in% c("CEO")) %>% 153 | group_by(dept) %>% 154 | summarise(med_salary = median(salary)) 155 | ggplot(data = salary_dept, aes(x = dept, y = med_salary))+ 156 | geom_bar(stat = "identity") 157 | 158 | ``` 159 | 160 | It looks like folks in engineering are earning way more than HR and other depts. This could mean that there is a bias by dept or it could just be a spurious correlation, which would be known only after building a model. 161 | 162 | ```{r} 163 | salary_level = data %>% 164 | group_by(level) %>% 165 | summarise(med_salary = median(salary)) %>% 166 | arrange(desc(med_salary)) 167 | 168 | ggplot(data = salary_level, aes(x = level, y = med_salary))+ 169 | geom_bar(stat = "identity") 170 | ``` 171 | The above plot indicates that median salaries of CEO > Executive > VP > Director > MM > IC which makes sense given the increasing scope and responsibility. 172 | 173 | ```{r} 174 | salary_degree = data %>% 175 | group_by(degree_level) %>% 176 | summarise(med_salary = median(salary)) %>% 177 | arrange(desc(med_salary)) 178 | 179 | ggplot(data = salary_degree, aes(x = degree_level, y = med_salary))+ 180 | geom_bar(stat = "identity") 181 | ``` 182 | 183 | Overall, median salary for bachelor degree graduates is slightly lower than highschool and increases with advanced degrees. PHD graduates have the highest median salary. One possible explanation for lower bachelor graduate salary than high school graduates could be that gains in salary because of higher degree are nullified by loss of several yrs of experience (4 or more years). 184 | 185 | ```{r} 186 | salary_gender = data %>% 187 | group_by(sex) %>% 188 | summarise(med_salary = median(salary)) %>% 189 | arrange(desc(med_salary)) 190 | 191 | ggplot(data = salary_gender, aes(x = sex, y = med_salary))+ 192 | geom_bar(stat = "identity") 193 | ``` 194 | 195 | The plot shows that median salary of Males is significantly higher than females. However, This plot alone doesn't confirm the bias as gender could be a proxy for other variables like educational degree, year of experience etc. that are known to determine the salary. 196 | 197 | ```{r} 198 | # Plot salary vs yrs of experience 199 | ggplot(data = data, aes(x = yrs_experience, y = salary)) + 200 | geom_point() + 201 | geom_smooth(method = "loess", se = FALSE) 202 | 203 | ``` 204 | 205 | The plot indicates that salaries are expected to increase with yrs of experience and are expected to grow faster after 10-15 yrs of experience. 206 | 207 | ```{r} 208 | # Plot salary vs number of reports 209 | data %>% 210 | filter(report_count > 0) %>% 211 | ggplot(aes(x = report_count, y = salary)) + 212 | geom_point() + 213 | geom_smooth(method = 'loess', se = FALSE) 214 | ``` 215 | 216 | # Model Building 217 | 218 | Random forest would be a good choice for predicting salary as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 219 | 220 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters 221 | 222 | ```{r} 223 | 224 | # Training and test set split 225 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66)) 226 | train_data = data[train_sample,] 227 | test_data = data[-train_sample,] 228 | 229 | # Build Random forest model on the data with mostly default settings except for class weight and #trees 230 | set.seed(2019) 231 | rf.fit <- randomForest(y=train_data$salary, x = train_data[,-c(6)], ytest = test_data$salary, 232 | xtest = test_data[,-c(6)], ntree = 250, mtry = 3, keep.forest = TRUE) 233 | 234 | rf.fit 235 | # Visualize Important variables 236 | varImpPlot(rf.fit) 237 | ``` 238 | The training and test MSE are close and looks like model is not overfitting to data. 239 | Now, Building the model without the dominating variable to see if it changes importance of other variables. 240 | 241 | ```{r} 242 | rf.fit2 <- randomForest(y=train_data$salary, x = train_data[,-c(2,6)], ytest = test_data$salary, 243 | xtest = test_data[,-c(2,6)], ntree = 250, mtry = 3, keep.forest = TRUE) 244 | 245 | rf.fit2 246 | # Visualize Important variables 247 | varImpPlot(rf.fit2) 248 | ``` 249 | 250 | After dropping the dept variable, the variance explained drops dramatically. So rf.fit is indeed optimal model. 251 | 252 | # Question 3: 253 | 254 | ```{r} 255 | # Order variables by importance 256 | imp_vars <- importance(rf.fit) 257 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T)) 258 | 259 | # Loop through variables and build PDP in the decreasing order of variable importance 260 | for (i in seq_along(pdp_vars)) { 261 | partialPlot(rf.fit, data[,-6], pdp_vars[i], xlab="", 262 | main=paste("Partial Dependence on", pdp_vars[i])) 263 | } 264 | 265 | ``` 266 | 267 | # Understanding Partial Dependence Plots 268 | 269 | * Looking at the variable importance and PD plots, the main factors that impact the salaries are dept, yrs of experience and level or number of direct/indirect reports. 270 | 271 | * Although the descriptive bar chart of Salary Vs Gender show that Females are underpaid compared to Males, after controlling for variables like department, yrs of experience and number of reports no such bias in salary based on gender exists. 272 | 273 | * The PDP on dept reveals that salaries in engineering are higher than any other department and salaries in HR are the lowest. 274 | 275 | * Interestingly after controlling for dept, yrs of experience and level, educational degree does not really affect the salary. 276 | 277 | * Signing bonus and gender does not seem to affect the salary. 278 | 279 | # Conclusions & next steps 280 | 281 | * Based on the data, the people working in engineering dept are likely to get higher salaries than other depts. However, the reason for higher salaries could be long work hours, high stress levels, scarcity of engineering talent or difficulty in retaining talent etc. 282 | 283 | * The next steps for the Head of HR would be to identify the most likely reason and try to reduce the salary gap across the depts. 284 | 285 | -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-2.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-3.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-4.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-5.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-6.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-7.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-13-8.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Diversity in Workplace/Diversity_in_the_Workplace_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /Engagement Test/Engagement Test.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Engagement Test" 3 | author: "Siddhartha Jetti" 4 | date: "October 05, 2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Many sites make money by selling ads. For these sites, the number of pages visited by users on each session is one of the most important metric, if not the most important metric. 11 | 12 | Data science plays a huge role here, especially by building models to suggest personalized content. In order to check if the model is actually improving engagement, companies then run A/B tests. It is often data scientist responsibility to analyze test data and understand whether the model has been successful. The goal of this project is to look at A/B test results and draw conclusions. 13 | 14 | # Challenge Description 15 | 16 | The company of this exercise is a social network. They decided to add a feature called: Recommended Friends, i.e. they suggest people you may know. 17 | 18 | A data scientist has built a model to suggest 5 people to each user. These potential friends will be shown on the user newsfeed. At first, the model is tested just on a random subset of users to see how it performs compared to the newsfeed without the new feature. 19 | 20 | The test has been running for some time and your boss asks you to check the results. You are asked to check, for each user, the number of pages visited during their first session since the test started. If this number increased, the test is a success. 21 | 22 | Specifically, your boss wants to know: 23 | * Is the test winning? That is, should 100% of the users see the Recommended Friends feature? 24 | * Is the test performing similarly for all user segments or are there differences among different segments? 25 | * If you identified segments that responded differently to the test, can you guess the reason? Would this change your point 1 conclusions? 26 | 27 | # Data 28 | 29 | The 2 tables are: 30 | 31 | "user_table" - info about each user sign-up date 32 | 33 | ### Columns: 34 | * user_id : the Id of the user. It is unique by user and can be joined to user id in the other table 35 | * signup_date : when the user joined the social network 36 | 37 | "test_table" - data about the test results. For each user, we only consider about how many pages she visited on Jan, 2. The first session since the date when the test started. That is, if the test started on Jan 1, and user 1 visited the site on Jan, 2 and Jan, 3, we only care about how many pages she visited on Jan, 2. 38 | 39 | ### Columns: 40 | * user_id : the Id of the user 41 | * date : the date of the first session since the test started 42 | * browser : user browser during that session 43 | * test: 1 if the user saw the new feature, 0 otherwise 44 | * pages_visited: the metric we care about. # of pages visited in that session 45 | 46 | # Problem Setup 47 | ```{r} 48 | # Load required libraries 49 | library(tidyverse) 50 | library(ggplot2) 51 | 52 | # Read in the input data into a dataframe 53 | users <- read.csv("user_table.csv", stringsAsFactors = F) 54 | test <- read.csv("test_table.csv", stringsAsFactors = F) 55 | ``` 56 | 57 | 58 | # Data Exploration 59 | 60 | Explore users and test datasets 61 | ```{r} 62 | # Transform variables into right format 63 | users <- users %>% 64 | mutate(signup_date = as.Date(signup_date)) %>% 65 | arrange(user_id, signup_date) 66 | 67 | # Check datatypes of all the variables in the users dataset 68 | str(users) 69 | summary(users) 70 | ``` 71 | 72 | ```{r} 73 | # Transform variables into right format 74 | test <- test %>% 75 | mutate(date = as.Date(date)) 76 | 77 | # Check datatypes of all the variables in the test dataset 78 | str(test) 79 | summary(test) 80 | ``` 81 | 82 | All the columns appear to have legitimate values. 83 | 84 | ```{r} 85 | # Merge the two datasets 86 | data <- users %>% 87 | inner_join(test, by = "user_id") %>% 88 | arrange(date, user_id) 89 | 90 | # check for any missing values in the merged dataset 91 | colSums(is.na(data)) 92 | ``` 93 | 94 | No missing values exist anywhere in the data. 95 | 96 | ```{r} 97 | # Take a peek at the data 98 | head(data) 99 | 100 | # Check if duplicates of user id exist 101 | length(unique(data$user_id)) == length(data$user_id) 102 | ``` 103 | 104 | No duplicates in user id exist in the dataset. 105 | 106 | ```{r} 107 | unique(data$date) 108 | 109 | ``` 110 | 111 | Based on the sign-up date, we can know if the user is new or existing user. 112 | Let us classify the user as new, if he/she signed up after the start of test which is "2015-08-01". 113 | 114 | ```{r} 115 | data <- data %>% 116 | mutate(new_user = ifelse(signup_date >= as.Date("2015-08-01"), 1, 0)) 117 | 118 | table(data$new_user) 119 | ``` 120 | 121 | # Question 1: 122 | 123 | Is the test winning? That is, should 100% of the users see the Recommended Friends feature? 124 | 125 | First check the mean number of pages visited across test and control groups. 126 | ```{r} 127 | data_summary <- data %>% 128 | group_by(test) %>% 129 | summarise(avg_pages_visited = mean(pages_visited)) 130 | 131 | data_summary 132 | ``` 133 | 134 | ## Check randomization 135 | 136 | To decide if the new feature is resulting in higher number of pages visited, lets run an A/B test on the feature. 137 | The correctness of AB test depends hugely on assigning users to test and control groups at random. Now, Check if users of different browsers are randomly assigned to test and control groups. 138 | ```{r} 139 | data %>% 140 | group_by(browser, new_user) %>% 141 | summarise(prop_test = sum(test == 1)/n(), prop_control = sum(test == 0)/n()) 142 | 143 | ``` 144 | 145 | Based on the proportions of test and control groups, Users appears to be assigned almost randomly between test and control groups. 146 | 147 | Now, Run t.test on the test and control datasets 148 | 149 | ```{r} 150 | t.test(data$pages_visited[data$test == 0], data$pages_visited[data$test == 1]) 151 | ``` 152 | 153 | The mean of pages visited in test group is lower than the mean for the control group. 154 | The obtained p_value > 0.05 implies that the observed difference in sample means could have been happened out of random chance. Based on the above data, There is no reason to believe that the two groups are different and the feature should not be launched for all the users. 155 | 156 | # Question 2: 157 | 158 | Is the test performing similarly for all user segments or are there differences among different segments? 159 | 160 | First, Lets plot the mean number of pages visited vs the user browser. 161 | ```{r} 162 | # Summarize data by test and control groups 163 | data_test_by_browser = data %>% 164 | group_by(browser) %>% 165 | summarize(Test = mean(pages_visited[test==1]), Control = mean(pages_visited[test==0])) 166 | 167 | data_test_by_browser 168 | 169 | # Plot the data 170 | data_test_by_browser %>% 171 | gather(key = treatment, value = mean_pages, -browser) %>% 172 | ggplot(aes(x = browser, y = mean_pages, group = treatment, color = treatment)) + 173 | geom_line() + 174 | geom_point() 175 | ``` 176 | 177 | The average of number of pages visited for test group among users coming from Opera browser is 0. Something wrong with data. 178 | 179 | The average number of pages visited vs new user 180 | ```{r} 181 | # Summarize data in test and control groups based on new user 182 | data %>% 183 | group_by(new_user) %>% 184 | summarize(Test = mean(pages_visited[test==1]), Control = mean(pages_visited[test==0])) 185 | ``` 186 | 187 | Check the means test vs control for different segments. 188 | 189 | ```{r} 190 | data %>% 191 | mutate(new_user = as.factor(new_user)) %>% 192 | group_by(browser, new_user) %>% 193 | summarize(test_control_ratio = mean(pages_visited[test == 1])/mean(pages_visited[test == 0])) %>% 194 | ggplot(aes(x = browser, y = test_control_ratio, color = new_user, group = new_user)) + 195 | geom_line() + 196 | geom_point() 197 | ``` 198 | 199 | The plot above reveals the following : 200 | 201 | * The average number of pages visited for test group among users from Opera browser is 0. 202 | 203 | * For new users, The test group is consistently underperforming (ratio of test vs control less than 1) compared to control group for all browsers with exception of Opera. 204 | 205 | * For existing users, The test is performing better that control, ratio more than 1, for all browsers except Opera. 206 | 207 | Lets run the t-test by segment. 208 | 209 | ```{r} 210 | data %>% 211 | group_by(browser, new_user) %>% 212 | summarise(Test = mean(pages_visited[test == 1]), Control = mean(pages_visited[test == 0]), 213 | Diff = Test - Control, 214 | p_value = t.test(pages_visited[test == 1], pages_visited[test == 0])$p.value) 215 | ``` 216 | 217 | Based on the test-control differences and p-values, Clearly the test is performing differently among different segments. 218 | 219 | # Question 3: 220 | 221 | If you identified segments that responded differently to the test, can you guess the reason? Would this change your point 1 conclusions? 222 | 223 | 224 | The above results can be more or less repeated by the multiple linear regression. 225 | ```{r} 226 | 227 | # Simple Linear regression 228 | model <- lm(pages_visited ~ test, data) 229 | summary(model) 230 | 231 | # Multiple linear regression controlling for the browser and test 232 | model <- lm(pages_visited ~ browser + test + new_user, data) 233 | summary(model) 234 | ``` 235 | 236 | The potential reasons for the observed differences across segments are: 237 | 238 | * The mean number of pages visited for the test group among users coming from Opera browser being 0 doesn't make sense. Clearly, there is a problem with loging system or a technical glitch that is preventing users to browse after implementing the friend recommendation feature for Opera browser. 239 | 240 | * Clearly the friend recommendation feature is experiencing cold start problem. The new users donot have any friends and recommendation is pretty much based on a random guess. It is possible that these random friend recommendations are turning off new users and resulting in fewer number of pages visited. 241 | 242 | * The performance of feature on existing users reveals that after the user data is collected the recommendation feature is able to recommend relevant friends. 243 | 244 | The above findings would not change conclusion from question 1, as the test hypothesis when we designed the test was to test for all the users. The next step would be to fix the issue with the Opera browser and re-run the test to make decision whether or not to implement the feature for all the users. 245 | -------------------------------------------------------------------------------- /Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-13-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Engagement Test/Engagement_Test_files/figure-gfm/unnamed-chunk-13-1.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Funnel Analysis' 3 | author: "Siddhartha Jetti" 4 | date: "6/2/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | The goal is to perform funnel analysis for an e-commerce website. 11 | 12 | Typically, websites have a clear path to conversion: for instance, you land on the home page, then you search, select a product and buy it. At each of these steps, some users will drop off and leave the site. The sequence of pages that leads to conversion is called ‘funnel’ . 13 | 14 | Data Science can have a tremendous impact on funnel optimization. 15 | Funnel analysis allows to understand where/when our users abandon the website. It gives crucial insights on user behavior and on ways to improve the user experience as well as it often allows to discover bugs. 16 | 17 | # Challenge Description 18 | 19 | You are looking at data from an e-commerce website. The site is very simple and has just 4 pages. 20 | 21 | The first page is the home page. When you come to the site for the first time, you can only land on the home page as a first page. 22 | 23 | From the home page, the user can perform a search and land on the search page. 24 | 25 | From the search page, if the user clicks on a product, she will get to the payment page, where she is asked to provide payment information in order to buy that product. 26 | 27 | If she does decide to buy, she ends up on the confirmation page. 28 | 29 | The company CEO isn’t very happy with the company sales and, especially, sales coming from new users. Therefore, she asked you to investigate whether there is something wrong in the conversion funnel or, in general, if you can create hypotheses on how conversion rate could be improved. 30 | 31 | Specifically, she is interested in : 32 | 33 | 1) A full picture of funnel conversion rate for both desktop and mobile 34 | 35 | 2) Some insights on what the product team should focus on in order to improve conversion rate as well as any anything you might discover that could help improve conversion rate. 36 | 37 | # Data 38 | 39 | We have 5 tables downloadable by clicking on here. 40 | All the tables refer to only the user first experience on the site. The 5 tables are: 41 | 42 | user_table - info about the user 43 | 44 | Columns: 45 | 46 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables 47 | * date : the date when the user firstly landed on the site 48 | * device : user device. Can be mobile or desktop 49 | * sex : male/female 50 | 51 | home_page_table - Users who landed on the home page 52 | 53 | Columns: 54 | 55 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables 56 | * page : it is always home_page. 57 | 58 | search_page_table - Users who landed on the search_page 59 | 60 | Columns: 61 | 62 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables 63 | * page : it is always search_page 64 | 65 | payment_page_table - Users who landed on the payment_page 66 | 67 | Columns: 68 | 69 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables 70 | * page : it is always payment_page 71 | 72 | payment_confirmation_table - Users who landed on the payment_confirmation_table. That is, these are the users who bought the product. 73 | 74 | Columns: 75 | 76 | * user_id : the Id of the user. It is unique by user and can be joined to user id in all other tables 77 | * page : it is always payment_confirmation_page 78 | 79 | 80 | # Problem Setup 81 | 82 | ```{r} 83 | # Load required libraries 84 | library(tidyverse) 85 | library(randomForest) 86 | library(lubridate) 87 | 88 | # Read all the .csv files in the directory 89 | home_page <- read.csv("home_page_table.csv") 90 | search_page <- read.csv("search_page_table.csv") 91 | payment_page <- read.csv("payment_page_table.csv") 92 | payment_confirm_page <- read.csv("payment_confirmation_table.csv") 93 | users <- read.csv("user_table.csv") 94 | ``` 95 | 96 | # Question 1 97 | 98 | Before Merging the different datasets, its important to check the validity of data. 99 | One important check that could be done is if every user who hit search previously hit home page, every user who hit payment page should exist in previous pages and so on. 100 | 101 | ```{r} 102 | # Check if all users who hit search page previously visited home page 103 | all(search_page$user_id %in% home_page$user_id) 104 | 105 | # Check if all users who hit payment page previously visited search and home pages 106 | all(payment_page$user_id %in% search_page$user_id) & all(search_page$user_id %in% home_page$user_id) 107 | 108 | # Check if all users who hit payment confirmation page previously visited payment, search and home pages 109 | all(payment_confirm_page$user_id %in% payment_page$user_id) & all(payment_page$user_id %in% search_page$user_id) & all(search_page$user_id %in% home_page$user_id) 110 | 111 | ``` 112 | 113 | Clearly all the users are going through the website in a funnel pattern with out any exceptions and data looks fine. 114 | 115 | ```{r} 116 | page_data <- home_page %>% 117 | left_join(search_page, by = "user_id") %>% 118 | rename(page1 = page.x, page2 = page.y) %>% 119 | left_join(payment_page, by = "user_id") %>% 120 | rename(page3 = page) %>% 121 | left_join(payment_confirm_page, by = "user_id") %>% 122 | rename(page4 = page) %>% 123 | left_join(users, by = "user_id") %>% 124 | mutate(week = week(date), 125 | week_day = wday(date, label = TRUE)) 126 | 127 | ``` 128 | 129 | Run descriptive statistics 130 | ```{r} 131 | head(page_data) 132 | table(page_data$device) 133 | ``` 134 | 135 | Almost two-thirds of traffic is coming from desktops and only one-third is from mobile. 136 | 137 | Overall conversion rates by page 138 | ```{r} 139 | non_conversion <- page_data %>% 140 | select(contains("page")) %>% 141 | mutate_all(is.na) %>% 142 | colMeans() 143 | 144 | conversions <- 1 - non_conversion 145 | names(conversions) <- names(non_conversion) 146 | conversions 147 | ``` 148 | 149 | * The overall conversion rates by page reveal that 50% of traffic coming to home page moved on to search page. 150 | * About 6.7% percent of home page traffic clicked on a product in search page and landed onto payment. 151 | * Only 0.5% of original traffic decided to buy and ended up on confirmation page. 152 | 153 | ```{r} 154 | page_data %>% 155 | group_by(device) %>% 156 | summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 157 | conversion_3_4 = mean(!is.na(page4)), conversion_0 = mean(!is.na(page1))) 158 | ``` 159 | 160 | The page wise conversion rates by device reveals that conversion from search to payment page is happening at a much lower rate on desktop than mobile. Also, conversion from payment page to confirmation on desktop is much lower than on mobile. 161 | 162 | # Conclusion 163 | 164 | The huge gap in search to payment conversion between desktop and mobile versions indicates that search is not working properly in desktop version. The problem could be that search in desktop version is not showing relevant results as good as mobile version and resulting in higher percentage of users leaving the site before purchasing. Given that two-thirds of traffic comes from desktop, fixing the search on desktop version would present a great opportunity for the e-commerce company. 165 | 166 | # Question 2 167 | 168 | ```{r} 169 | page_data %>% 170 | group_by(sex) %>% 171 | summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 172 | conversion_3_4 = mean(!is.na(page4))) 173 | ``` 174 | 175 | Although the conversion rate among females appears to be higher than males but its not anything shocking. 176 | 177 | ```{r} 178 | page_data %>% 179 | group_by(week_day) %>% 180 | summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 181 | conversion_3_4 = mean(!is.na(page4))) 182 | ``` 183 | 184 | The conversion rate on Mondays is higher than any other day of the week. 185 | 186 | ```{r} 187 | page_week_summary <- page_data %>% 188 | group_by(week) %>% 189 | summarize(conversion_1_2 = mean(!is.na(page2)), conversion_2_3 = mean(!is.na(page3)), 190 | conversion_3_4 = mean(!is.na(page4))) 191 | 192 | # Visualize by breaking down different page conversions 193 | page_week_summary %>% 194 | gather(key = type, value = conversion, -week) %>% 195 | 196 | ggplot(aes(x = week, y = conversion, group = type)) + 197 | geom_line(aes(color = type)) 198 | ``` 199 | 200 | # Model Building 201 | 202 | Random forest would be a good choice for predicting conversion rate as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 203 | 204 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters 205 | 206 | ```{r} 207 | 208 | # The response variable needs to be changed into a factor 209 | data <- page_data %>% 210 | mutate(converted = as.factor(ifelse(is.na(page4),0,1))) %>% 211 | select(user_id, device, sex, week_day, week, converted) 212 | 213 | # Training and test set split 214 | set.seed(2019) 215 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66)) 216 | train_data = data[train_sample,] 217 | test_data = data[-train_sample,] 218 | 219 | # Build Random forest model on the data with mostly default settings except for class weight and #trees 220 | rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)], ntree = 50, mtry = 2, keep.forest = TRUE) 221 | 222 | rf.fit 223 | ``` 224 | 225 | The OOB and test error rates look good but the model is useless as it is classifying everything as class '0'. This is happening because of hugely unbalanced data. 226 | 227 | Now the strategy is to increase the weight of the minority class to force the model towards correctly predicting more class 1 events. The increase in class1 accuracy comes at the expense of class 0 and goal is to strike the balance between them. 228 | 229 | 230 | ```{r} 231 | class0_error <- c() 232 | class1_error <- c() 233 | overall_accuracy <- c() 234 | 235 | for (i in 1:9){ 236 | 237 | rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)], ntree = 150, classwt=c(i, 10-i), mtry = 2, keep.forest = TRUE) 238 | class0_error <- c(class0_error, rf.fit$test$confusion[1,3]) 239 | class1_error <- c(class1_error, rf.fit$test$confusion[2,3]) 240 | overall_accuracy <- c(overall_accuracy, (rf.fit$test$confusion[1,1]+rf.fit$test$confusion[2,2])/sum(rf.fit$test$confusion[,-3])) 241 | } 242 | 243 | errors_df <- data.frame(weights = paste0(1:9,"-",9:1), class0_error, class1_error, overall_accuracy) 244 | errors_df 245 | ``` 246 | 247 | The above table shows that the optimium class wt ratio is around 1:9 248 | 249 | ```{r} 250 | rf.fit <- randomForest(y=train_data$converted, x = train_data[, -c(6)], ytest = test_data$converted, xtest = test_data[, -c(6)], ntree = 150, classwt=c(1,9), mtry = 2, keep.forest = TRUE) 251 | rf.fit 252 | # Visualize Important variables 253 | varImpPlot(rf.fit) 254 | ``` 255 | 256 | Now, lets try to build model without user_id 257 | 258 | ```{r} 259 | rf.fit2 <- randomForest(y=train_data$converted, x = train_data[, -c(1,6)], ytest = test_data$converted, xtest = test_data[, -c(1,6)], ntree = 150, classwt=c(1,9), mtry = 2, keep.forest = TRUE) 260 | rf.fit2 261 | 262 | ``` 263 | 264 | There is a clear drop in overall test set accuracy by dropping user_id. The optimum model should infact be rf.fit. 265 | ```{r} 266 | # Visualize Important variables 267 | varImpPlot(rf.fit) 268 | # Order variables by importance 269 | imp_vars <- importance(rf.fit) 270 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T)) 271 | ``` 272 | 273 | After tweaking the class weights, the model is not predicting all events as majority class. The improvement in class1 error was accomplished without hurting class0 and overall accuracy too much. The variables user id, lat and long are dominating other variables. Now, lets try to rebuild the model without including those variables and visualize the variable importance 274 | 275 | # PDP Analysis 276 | 277 | ```{r} 278 | # Loop through variables and build PDP in the decreasing order of variable importance 279 | for (i in seq_along(pdp_vars)) { 280 | partialPlot(rf.fit, data[,-6], pdp_vars[i], xlab="", 281 | main=paste("Partial Dependence on", pdp_vars[i]), which.class=1) 282 | } 283 | 284 | ``` 285 | 286 | # Conclusion 287 | 288 | * The Partial Dependence plots reveal that the conversion rate is relatively higher during the first 8 weeks (January and February) of year and goes down over the next 4 weeks and then picks up in the later weeks. Its possible that there could be some seasonality in the user buying patterns. 289 | 290 | * The PD plots and descriptive charts reveal that desktop version of the site is very much underperforming compared to mobile version. More precisely the search in desktop version which is causing greater proportion of users to leave without purchasing. One possible reason could be bad UI/experience or presence of bugs that are turning off users from converting in desktop version. The product team should immediately focus on identifying these issue with desktop version. 291 | 292 | * Another possible reason for higher number of users and lower conversion on desktop version is that the mobile app didn't somehow become popular, so you just have very engaged users using mobile and they obviously convert at a higher rate. There are very low number of casual mobile visitors. Also it's possible that the site is spending a lot of money on ads on desktop. And these ads are attracting the wrong kind of people. 293 | 294 | * Tuesday appears to be slightly better than other days and introducing promotions could improve the overall conversion rate for the site. 295 | 296 | * The PD plots also reveal that after controlling for device and other important variables, males appear to be performing better than females and naturally marketing should focus on increasing the relative weight of this segment. 297 | -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-1.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-2.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-3.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-4.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-15-5.png -------------------------------------------------------------------------------- /Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Funnel Analysis/Funnel_analysis_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON City Similarities.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'JSON City Similarities' 3 | author: "Siddhartha Jetti" 4 | date: "6/30/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | This is another challenge where your data is stored in a JSON file. Each row in this JSON stores info about all user searches within a session. 11 | 12 | Analyzing user behavior within the same session is often crucial. Clustering users based on their browsing behavior is probably the most important step if you want to personalize your site. 13 | 14 | The goal of this challenge is to build the foundation of personalization by identifying searches likely to happen together and cluster users based on their session searches. 15 | 16 | # Challenge Description 17 | 18 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com, etc. 19 | 20 | They store their data in JSON files. Each row in the json shows all different cities which have been searched for by a user within the same session (as well as some other info about the user). That is, if I go to expedia and look for hotels in NY and SF within the same session, the corresponding JSON row will show my user id, some basic info about myself and the two cities. 21 | 22 | You are given the following tasks: 23 | 24 | 1) There was a bug in the code and one country didn’t get logged. It just shows up as an empty field (“”). Can you guess which country was that? How? 25 | 26 | 2) For each city, find the most likely city to be also searched for within the same session. 27 | 28 | 3) Travel sites are browsed by two kinds of users. Users who are actually planning a trip and users who just dream about a vacation. The first ones have obviously a much higher purchasing intent. 29 | Users planning a trip often search for cities close to each other, while users who search for cities far away from each other are often just dreaming about a vacation. That is, a user searching for LA, SF and Las Vegas in the same session is much more likely to book a hotel than a user searching for NY, Paris, Kuala Lumpur (makes sense, right?). 30 | Based on this idea, come up with an algorithm that clusters sessions into two groups: high intent and low intent. Explain all assumptions you make along the way. 31 | 32 | 33 | # Data 34 | 35 | The file is 36 | 37 | city_searches - a list of searches happening within the same session 38 | 39 | ## Fields: 40 | 41 | * session_id : session id. Unique by row 42 | * unix_timestamp : unixtime stamp of when the session started 43 | * cities : the unique cities which were searched for within the same session by a user 44 | * user : it is has the following nested fields: 45 | * user_id: the id of the user 46 | * joining_date: when the user created the account 47 | * country: where the user is based 48 | 49 | 50 | # Problem Setup 51 | 52 | ```{r} 53 | # Load required libraries 54 | library(tidyverse) 55 | library(jsonlite) 56 | library(lubridate) 57 | library(ggplot2) 58 | 59 | # Read in the input data into a dataframe 60 | data_json <- fromJSON("city_searches.json") 61 | ``` 62 | 63 | 64 | # Question 1: 65 | 66 | Transforming the JSON data into a data frame 67 | ```{r} 68 | session_id <- c() 69 | unix_timestamp <- c() 70 | cities <- c() 71 | user_id <- c() 72 | joining_date <- c() 73 | country <- c() 74 | 75 | # Converting the data in the right format 76 | for(i in 1:length(data_json$user)){ 77 | session_id <- c(session_id, data_json$session_id[[i]]) 78 | unix_timestamp <- c(unix_timestamp, data_json$unix_timestamp[[i]]) 79 | cities <- c(cities, data_json$cities[[i]]) 80 | user_id <- c(user_id, data_json$user[[i]][[1]][[1]]) 81 | joining_date <- c(joining_date, data_json$user[[i]][[1]][[2]]) 82 | country <- c(country, data_json$user[[i]][[1]][[3]]) 83 | } 84 | 85 | # Convert to a data frame 86 | data <- data.frame(session_id = session_id, unix_timestamp = unix_timestamp, cities = cities, 87 | user_id = user_id, joining_date = joining_date, country = country, stringsAsFactors = F) %>% 88 | mutate(country = ifelse(country == "", "Missing", country)) 89 | 90 | # Check data types of each of the columns 91 | summary(data) 92 | 93 | # check if any duplicate session id exist 94 | length(data$session_id) == length(unique(data$session_id)) 95 | 96 | # Check if any missing values exist 97 | colSums(is.na(data) | data == "") 98 | 99 | ``` 100 | There are no missing values in input data. Also, session id appears to be unique. 101 | 102 | ```{r} 103 | # Obtain Time and Hour of day from the time stamp 104 | data <- data %>% 105 | mutate(time = as.POSIXct(unix_timestamp, origin = "1970-01-01"), hour = hour(time)) 106 | 107 | head(data) 108 | ``` 109 | 110 | Visualizing the data. Ploting the number of searches by hour of day for each of the user countries. 111 | ```{r} 112 | 113 | countries <- unique(data$country) 114 | 115 | for(i in countries){ 116 | data_country <- data %>% 117 | filter(country == i) %>% 118 | group_by(hour) %>% 119 | summarise(sessions = n()) 120 | 121 | plot <- ggplot(data = data_country, aes(x = hour, y = sessions)) + 122 | geom_bar(stat = "identity") + 123 | ggtitle(paste("Sessions by hour of day in", i, sep=' ')) 124 | 125 | print(plot) 126 | } 127 | 128 | ``` 129 | 130 | 131 | It looks like Unix time stamp is based on one of the time zones in US. 132 | 133 | From the sessions Vs Hour of day histogram in US, it is clear that peak traffic is between 10 AM and 2PM. 134 | By assuming similar distribution of sessions by hour of day in the missing country, The session Vs hour histogram for Missing country reveals that local time differs by about 11-12 hrs from US. This hints that the missing country could be in Asia and most likely India or China. 135 | 136 | # Question 2: 137 | 138 | Each city can be imagined as a point in the n-dimensional space spun by user sessions. Each coordinate of the point(n-dimensional) would be the number of searches of the city in the session corresponding to the cordinate. The goal her is to build city similarity matrix and extract the most similar city to each of the city searched. The most similar cities are more likely to be searched together in a session than ones that are not. 139 | 140 | ```{r} 141 | 142 | # Find the maximum number of cities in a given session 143 | # This is done by counting the occurences of "," + 1 144 | max_cities <- max(str_count(data$cities, ",")) + 1 145 | 146 | user_city_matrix <- data %>% 147 | separate(col = cities, into = paste0("city", 1:max_cities), sep = ", ") %>% 148 | select(-user_id, -joining_date, -country, -time,-hour, -unix_timestamp) %>% 149 | gather(key = "value", value = "cities", -session_id) %>% 150 | filter(!is.na(cities)) %>% 151 | group_by(session_id, cities) %>% 152 | summarise(nsearches = n()) %>% 153 | ungroup() %>% 154 | spread(cities, nsearches) %>% 155 | mutate_all(funs(replace_na(., 0))) 156 | 157 | # n-dimensional space 158 | dim(user_city_matrix) 159 | 160 | # Take a peek at data 161 | head(user_city_matrix) 162 | ``` 163 | Each city is a point in 20022 dimensions and each cordinate is number of searches on that city in that session. 164 | Cosine similarity is used to compute similarity between two cities. Most similar cities have cosine similarity close to 1 and least similar have similarity close to 0. 165 | 166 | ```{r} 167 | 168 | user_city_matrix <- user_city_matrix %>% 169 | select(-session_id) 170 | 171 | unique_cities <- colnames(user_city_matrix) 172 | 173 | # Define a function to compute the cosine similarity between two cities 174 | cosine_similarity <- function(x, y) { 175 | sum(x * y) / (sqrt(sum(x * x)) * sqrt(sum(y * y))) 176 | } 177 | 178 | # Define a place holder to hold similarity between each pair of cities 179 | # similarity between a city and itself is 1 180 | city_similarity <- diag(1, nrow = ncol(user_city_matrix), ncol = ncol(user_city_matrix)) 181 | rownames(city_similarity) <- unique_cities 182 | colnames(city_similarity) <- unique_cities 183 | 184 | ncity <- ncol(user_city_matrix) 185 | ``` 186 | 187 | Now, compute the pair-wise city smilarities and populate the city similarity matrix. 188 | ```{r} 189 | # Generate city similarity matrix 190 | # Loop through the columns 191 | for(i in 1:ncity) { 192 | # Loop through the columns for each column 193 | for(j in 1:ncity) { 194 | # Fill in placeholder with cosine similarities 195 | city_similarity[i, j] <- cosine_similarity(user_city_matrix[i], user_city_matrix[j]) 196 | } 197 | } 198 | 199 | # Take a peek at city 200 | head(city_similarity[, 1:10]) 201 | ``` 202 | 203 | Most likely city to be searched along with a given city is the city that has the highest similarity score after itself. 204 | 205 | ```{r} 206 | likely_searches <- data.frame(City = unique_cities, stringsAsFactors = FALSE) 207 | 208 | # We are interested in the most similar city after the city itself. 209 | for(i in 1:length(unique_cities)){ 210 | cities_sorted_similarity <- names(sort(city_similarity[unique_cities[i],], decreasing = TRUE)) 211 | similarity <- sort(city_similarity[unique_cities[i],], decreasing = TRUE) 212 | city <- cities_sorted_similarity[cities_sorted_similarity != unique_cities[i]][1] 213 | likely_searches$Most_Similar[i] <- city 214 | likely_searches$Similarity_score[i] <- city_similarity[unique_cities[i], city] 215 | } 216 | 217 | head(likely_searches) 218 | ``` 219 | 220 | # Question 3: 221 | 222 | The goal is to classify multi-city search sessions into high and low intent based on the distance between searched cities. The straight forward way to accomplish this is by finding the geographic distance between each pair of cities and then classify session based on the obtained distance between the cities. Due to lack of data on the geographic distance between cities, An in-direct method should be employed. 223 | 224 | The cosine similarity between a pair of cities, each represented by a vector in n-dimensional user session space, tends to be higher for the pair of cities that are often searched together. Conversely, the cities that are not searched together would have lower cosine similarity. 225 | 226 | If we assume that users of online travel site have a reasonable intent to travel then lower cosine similarity between the pair of cities can be viewed as a proxy for higher distance between them. The similarity score for multi-city search sessions can be calculated as the average of cosine similarities between each pair of cities. 227 | 228 | Now let us test the assumption using few examples. 229 | ```{r} 230 | # Top 5 least similar cities with San Jose CA 231 | names(sort(city_similarity["San Jose CA", ]))[1:5] 232 | 233 | # Top 5 least similar cities with Miami FL 234 | names(sort(city_similarity["Miami FL", ]))[1:5] 235 | 236 | # Top 5 least similar cities with New York NY 237 | names(sort(city_similarity["New York NY", ]))[1:5] 238 | ``` 239 | 240 | Clearly, the least similar cities are ones from a far away state and in some cases cities from a different coast. So, Similarity can be used as a proxy for distance with an inverse relationship. 241 | 242 | Now, Computing similarity score for the multi-city sessions. 243 | ```{r} 244 | 245 | # Define a function to compute the similarity score for the session 246 | session_similarity <- function(cities){ 247 | # Get all the cities searched 248 | searched_cities <- strsplit(cities, split = ", ")[[1]] 249 | # if only one city is searched then similarity is assigned 0 250 | if(length(searched_cities) > 1){ 251 | city_pairs <- t(combn(searched_cities, 2)) 252 | similarity <- mean(city_similarity[city_pairs]) 253 | } else { similarity <- NA } 254 | } 255 | 256 | # Loop through all the sessions and assign session similarity 257 | for(i in 1:nrow(data)){ 258 | data$session_similarity_score[i] <- session_similarity(data$cities[i]) 259 | } 260 | 261 | # distribution of session similarity score among sessions with more than one city being searched 262 | quantile(data$session_similarity_score[!is.na(data$session_similarity_score)], probs = seq(0, 1, by = 0.05)) 263 | 264 | data %>% 265 | filter(!is.na(session_similarity_score)) %>% 266 | ggplot()+ 267 | geom_histogram(bins = 50, aes(x = session_similarity_score, y = ..density..))+ 268 | geom_density(aes(x = session_similarity_score, y = ..density..)) 269 | ``` 270 | 271 | From the similarity quantiles and distribution, Looks like 0.06 is a reasonable cuttoff for session similarity that classifies 25% of multi-city sessions as low intent and 75% of them as high intent. 272 | 273 | ```{r} 274 | data <- data %>% 275 | filter(!is.na(session_similarity_score)) %>% 276 | mutate(Booking_Intent = ifelse(session_similarity_score > 0.06, "High Intent", "Low Intent")) 277 | 278 | table(data$Booking_Intent) 279 | ``` 280 | 281 | For sessions with one city search, Unfortunately The data provided is not sufficient to classify them into high or low intent. However, If variables like time spent on site or clickstream behavior are provided it would be possible to classify them. -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-2.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-3.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-4.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-5.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-6.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarities_files/figure-gfm/unnamed-chunk-4-7.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-2.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-3.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-4.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-5.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-6.png -------------------------------------------------------------------------------- /JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/JSON City Similarities/JSON_City_Similarity_files/figure-gfm/unnamed-chunk-4-7.png -------------------------------------------------------------------------------- /Loan granting/Loan granting.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Loan granting' 3 | author: "Siddhartha Jetti" 4 | date: "6/6/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Another area where data science and machine learning play a huge role is in choosing if granting a loan. This is a particularly hot field as many start-ups feel that bank loan models can be improved. Therefore, there is space to come up with better loaning strategies that can benefit both the lender and the borrower. 11 | 12 | In this challenge, you will have access to loan data from a bank and will have to improve their model. 13 | 14 | # Challenge Description 15 | 16 | We have access to a specific bank loan data. We have data about all loans asked to the bank, whether the bank decided to grant it and, finally, whether the borrower managed to repay it. We also have info about the person asking for the loan at the moment she is asking for the loan. 17 | 18 | You have to come up with a better strategy to grant loans. Specifically you should: 19 | 20 | Build a model which is better than the bank model. Assume that: 21 | If you grant the loan and the it doesn’t get repaid, you lose 1. 22 | If you grant the loan and the it does get repaid, you gain 1 23 | If you don’t grant the loan, you gain 0. 24 | 25 | Using the rules above, compare bank profitability vs your model profitability. 26 | 27 | * Describe the impact of the most important variables on the prediction. Also, focus on the variable “is_employed”, which describes whether the borrower is employed when she asks for the loan. How does this variable impact the model? Explain why. 28 | 29 | * Are there any other variables you’d like to include in the model? 30 | 31 | # Data 32 | 33 | We have 2 table downloadable by clicking here. 34 | 35 | The 2 tables are: 36 | 37 | loan_table - general information about the loan 38 | 39 | ### Columns: 40 | 41 | * loan_id : the id of the loan. Unique by loan. Can be joined to loan id in the other table 42 | * loan_purpose : the reason for asking the loan: investment, other, business, emergency_funds, home 43 | * date : when the loan was asked 44 | * loan_granted : whether the loan was granted 45 | * loan_repaid : whether the loan was repaid. NA means that the loan was not granted 46 | 47 | borrower_table - information about the borrower 48 | 49 | ### Columns: 50 | 51 | * loan_id : the id of the the loan. Unique by loan. Can be joined to loan id in the other table 52 | * is_first_loan : did she ask for any other loans in her lifetime? 53 | * fully_repaid_previous_loans : did she pay on time all of her previous loans? If this is the first loan, it is NA 54 | * currently_repaying_other_loans : is she currently repaying any other loans? If this is the first loan, it is NA 55 | * total_credit_card_limit : total credit card monthly limit 56 | * avg_percentage_credit_card_limit_used_last_year : on an average, how much did she use of her credit card limit in the previous 12 months. This number can be >1 since it is possible to go above the credit card limit 57 | * saving_amount : total saving amount balance when she asked for the loan 58 | * checking_amount : total checking amount balance when she asked for the loan 59 | * is_employed : whether she is employed (1) or not (0) 60 | * yearly_salary : how much she earned in the previous year 61 | * age : her age 62 | * dependent_number : number of people she claims as dependent 63 | 64 | 65 | # Problem Setup 66 | 67 | ```{r} 68 | # Load required libraries 69 | library(dplyr) 70 | library(ggplot2) 71 | library(randomForest) 72 | library(ROCR) 73 | 74 | # Read in the input data into a dataframe 75 | borrower <- read.csv("borrower_table.csv") 76 | loan <- read.csv("loan_table.csv") 77 | ``` 78 | 79 | 80 | # Question 1: 81 | 82 | Understanding and transforming the data 83 | ```{r} 84 | # Joining both loan and borrower tables. Transform variables to right format 85 | data <- loan %>% 86 | inner_join(borrower, by = "loan_id") 87 | 88 | # Take a peak at the data 89 | str(data) 90 | summary(data) 91 | ``` 92 | 93 | Check for missing values in the data. 94 | 95 | ```{r} 96 | # count of missing values by column in the merged data 97 | colSums(is.na(data)) 98 | 99 | # is there a pattern in missing values? 100 | data %>% 101 | filter(is.na(loan_repaid)) %>% 102 | select(loan_granted) %>% 103 | distinct() 104 | 105 | data %>% 106 | filter(is.na(fully_repaid_previous_loans) | is.na(currently_repaying_other_loans)) %>% 107 | select(is_first_loan) %>% 108 | distinct() 109 | 110 | data %>% 111 | filter(is.na(avg_percentage_credit_card_limit_used_last_year)) %>% 112 | select(total_credit_card_limit) %>% 113 | distinct() 114 | ``` 115 | 116 | Clearly, There is a pattern in missing values and are not occuring at random. 117 | 118 | Missing values exist in 'fully_repaid_previous_loans' and 'currently_repaying_other_loans' variables only for the customers who are applying for the first loan. The variable 'loan_repaid' is missing only for customers who are not granted a loan. Also, 'avg_percentage_credit_card_limit_used_last_year' is missing only for customers whose credit limit is zero. All this makes sense. 119 | 120 | Imputation for missing values is required. 121 | ```{r} 122 | data %>% 123 | select(loan_repaid, fully_repaid_previous_loans, currently_repaying_other_loans, avg_percentage_credit_card_limit_used_last_year) %>% 124 | head() 125 | ``` 126 | 127 | Three of the above variables are discrete and one is continuous. 128 | Due to the non-random nature of missing values, new category should be created for missing values in categorical variables and a very high number is imputed for missing values in continuous variables. 129 | 130 | ```{r} 131 | data <- data %>% 132 | mutate(fully_repaid_previous_loans = as.factor(ifelse(is.na(fully_repaid_previous_loans), -1, fully_repaid_previous_loans)), 133 | currently_repaying_other_loans = as.factor(ifelse(is.na(currently_repaying_other_loans), -1, currently_repaying_other_loans)), 134 | avg_percentage_credit_card_limit_used_last_year = ifelse(is.na(avg_percentage_credit_card_limit_used_last_year), 9999, avg_percentage_credit_card_limit_used_last_year), 135 | is_employed = as.factor(is_employed), 136 | is_first_loan = as.factor(is_first_loan)) 137 | ``` 138 | 139 | Estimating the bank profitability based on existing lending policy. 140 | 141 | ```{r} 142 | data %>% 143 | transmute(profitability = ifelse(loan_granted == 0, 0, loan_repaid*(loan_granted + 1) - 1)) %>% 144 | summarize(profit = sum(as.numeric(profitability))) 145 | 146 | ``` 147 | 148 | The bank profit based on the gain/loss rules is 13758. 149 | 150 | ```{r} 151 | summary(data) 152 | ``` 153 | 154 | After Imputation, no missing values exist in the dataset other than loan_repaid variable. 155 | 156 | Run descriptive stats on the input data. Clearly, variables if employed, saving_amount and salary appear to matter to determine the probability of repayment of the loan. 157 | 158 | Check if_employed vs loan_repaid 159 | ```{r} 160 | 161 | is_employed_data <- data %>% 162 | filter(loan_granted == 1) %>% 163 | mutate(loan_repaid = as.factor(loan_repaid)) %>% 164 | group_by(is_employed,loan_repaid) %>% 165 | summarize(counts = n()) 166 | 167 | ggplot(data = is_employed_data, aes(x = is_employed, y = counts, fill = loan_repaid))+ 168 | geom_bar(stat = "identity") 169 | 170 | ``` 171 | 172 | Clearly, Among the loans taken by employed majority of them are repaid, which makes sense. 173 | Check how salary effects loan repaid variable. 174 | ```{r} 175 | employed_data <- data %>% 176 | group_by(loan_repaid) %>% 177 | summarise(mean_yearly_salary = mean(yearly_salary), 178 | mean_savings = mean(saving_amount)) 179 | 180 | 181 | ggplot(data = employed_data, aes(x = loan_repaid, y = mean_yearly_salary))+ 182 | geom_bar(stat = "identity") 183 | 184 | ggplot(data = employed_data, aes(x = loan_repaid, y = mean_savings))+ 185 | geom_bar(stat = "identity") 186 | ``` 187 | 188 | The above plots reveal that loans that are repaid, reported on an average higher savings balance and higher yearly salary than the loans that are not repaid. 189 | 190 | # Model Building 191 | 192 | The gain or loss rules are as follows 193 | * If loan is granted and does not get repaid, loss 1. 194 | * If loan is granted and gets repaid, gain 1. 195 | * If loan is not granted, gain 0. 196 | 197 | Lets build a model to estimate probability of repayment of loan and eventually predict if loan gets repaid or not. Random forest would be a good candidate as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters.Partial dependence plots can be used to capture insights from the model. 198 | 199 | Define the class 0 as loan not getting repaid and class 1 as loan getting repaid. The goal is to build random forest model and find optimal cuttoff probability that maximizes profit based on the above rules. The loan will be granted only if estimated probability of repayment is more than chosen cuttoff else the loan gets denied. 200 | 201 | * False positive, FP results in -1 202 | * True positive, TP results in +1 203 | * False negative, FN results in 0 204 | * True negative, TN results in 0 205 | 206 | The model should be built using only the rows where loans are granted as we know if loan is eventually repaid or not. The rows where loan not granted are excluded from the training sample. A standard 66-33% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters 207 | 208 | ```{r} 209 | 210 | # save loans that are denied by bank for future use 211 | loans_denied <- data %>% 212 | filter(loan_granted == 0) 213 | 214 | # Training and test set split 215 | data <- data %>% 216 | filter(loan_granted == 1) %>% 217 | select(-loan_id, -date, -loan_granted) %>% 218 | mutate(loan_repaid = as.factor(loan_repaid)) 219 | 220 | set.seed(2019) 221 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66)) 222 | train_data = data[train_sample,] 223 | test_data = data[-train_sample,] 224 | 225 | # Build Random forest model on the data with mostly default settings except for class weight and #trees 226 | rf.fit <- randomForest(y = train_data$loan_repaid, x = train_data[,-c(2)], ytest = test_data$loan_repaid, 227 | xtest = test_data[,-c(2)], ntree = 250, mtry = 4, keep.forest = TRUE) 228 | 229 | rf.fit 230 | # Visualize Important variables 231 | varImpPlot(rf.fit) 232 | ``` 233 | 234 | From the variable importance plot, Variables 'saving_amount' and 'checking_amount' have too much importance compared with other variables. This is happening because those variables are acting as a proxy for other variables and have in them part of the information from other variables. Lets rebuild the model by removing these to see how it changes the variable importance. 235 | 236 | ```{r} 237 | # Build Random forest model on the data with mostly default settings except for class weight and #trees 238 | # without the top variables 239 | rf.fit2 <- randomForest(y = train_data$loan_repaid, x = train_data[,-c(2,8,9)], ytest = test_data$loan_repaid, 240 | xtest = test_data[,-c(2,8,9)], ntree = 250, keep.forest = TRUE) 241 | 242 | rf.fit2 243 | # Visualize Important variables 244 | varImpPlot(rf.fit2) 245 | ``` 246 | 247 | 248 | Overall error rate, class 0 error rate and class 1 error rate look decent. The model can be used for insights. Also, training and test error are close indicating that model is not overfitting to data. 249 | 250 | ```{r} 251 | # Order variables by importance 252 | imp_vars <- importance(rf.fit) 253 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T)) 254 | 255 | # Loop through variables and build PDP in the decreasing order of variable importance 256 | for (i in seq_along(pdp_vars)) { 257 | partialPlot(rf.fit, data[,-2], pdp_vars[i], xlab="", 258 | main=paste("Partial Dependence on", pdp_vars[i]), which.class=1) 259 | } 260 | 261 | ``` 262 | 263 | # Question 2 264 | 265 | Now finding the cuttoff probability that maximizes the profit. 266 | ```{r} 267 | #Compare predictions (votes) vs actual results for the test set 268 | pred = prediction(rf.fit$test$votes[,2], test_data$loan_repaid) 269 | 270 | error_cutoff = data.frame (pred@cutoffs, 271 | pred@tn, 272 | pred@fn, 273 | pred@fp, 274 | pred@tp, 275 | row.names = NULL) 276 | colnames(error_cutoff) = c("cutoff", "tn", "fn", "fp", "tp") 277 | 278 | error_cutoff <- error_cutoff %>% 279 | mutate(profit = tp-fp) %>% 280 | arrange(desc(profit)) 281 | 282 | error_cutoff[1,] 283 | ``` 284 | 285 | The cuttoff probability of 0.52 results in maximum profit based on the gain/loss rules. 286 | Now use the model "rf.fit" and the above cuttoff to score the all the loans where bank originally granted loans and compare the profit with bank's profit. 287 | 288 | ```{r} 289 | repayment_prob <- predict(rf.fit, data[,-2], type = "prob") 290 | loan_status <- ifelse(repayment_prob[,2] > 0.52, 'Grant', 'Deny') 291 | 292 | # Number of loans granted or denied 293 | table(loan_status) 294 | 295 | # Among the granted ones, how many would be repaid or not repaid 296 | repay_status <- data$loan_repaid[which(loan_status == 'Grant')] 297 | status <- table(repay_status) 298 | 299 | # Profit with new model 300 | status[2] - status[1] 301 | ``` 302 | 303 | Among the 47654 loans granted by bank, the new model would only grant 30402 loans and deny 17252 loans. 304 | Out of the granted 30402 loans, 29899 would be repaid and 503 would not be repaid. This results in a total gain of 29396 much higher than bank's profit of 13758. 305 | 306 | The new model reveals that to get maximum profit, the bank should grant loans only if the repayment probability predicted by it is more than 52%. 307 | 308 | Now use the new model to score the loans that were denied by the bank. 309 | ```{r} 310 | loans_denied_repayment <- predict(rf.fit, loans_denied[,-c(1,3,4,5)], type = "prob") 311 | loan_status <- ifelse(loans_denied_repayment[,2] > 0.52, 'Grant', 'Deny') 312 | 313 | table(loan_status) 314 | ``` 315 | The fact that bank denied these loans should indicate that they are bad loans with low probability of repayment. Not suprisingly, new model would deny 65% of these loans and grant only 35% of them. 316 | 317 | # Question 3 318 | 319 | The following conclusions can be drawn from the variable importance plot and Partial dependence plots of the full model "rf.fit". 320 | 321 | * The variables savings amount and checking amount appear to dominate other variables in terms of variable importance. It is likely that these variables are correlated and acting as a proxy for other variables and have in them part of the information from other variables as well. 322 | 323 | * The variable "is_employed" which appeared to be an important variable in descriptive stats appears to have very low importance in the combined model. This is counter intuitive because people who are employed have steady cash flow and are expected to have higher chance of repayment. This can be explained from the fact that is_employed is highly correlated with savings/checking amount and perfectly correlated with yearly salary. Also, banks offer higher credit limits for employed people. Generally people who are employed tend to have higher saving/checking balance,yearly salary and credit limit than people who are not. After controlling for the saving/checking amount, yearly salary and credit limit variables, there is not much information left to be extracted from is_employed and its importance appears low. The following plots provide evidence in favor of the above hypothesis. 324 | 325 | ```{r} 326 | employed_data <- data %>% 327 | group_by(is_employed) %>% 328 | summarise(mean_yearly_salary = mean(yearly_salary), 329 | mean_savings = mean(saving_amount), 330 | mean_checking = mean(checking_amount), 331 | mean_credit_limit = mean(total_credit_card_limit)) 332 | employed_data 333 | 334 | # Relation between is_employed and yearly salary 335 | ggplot(data = employed_data, aes(x = is_employed, y = mean_yearly_salary))+ 336 | geom_bar(stat = "identity") 337 | 338 | # Relation between is_employed and mean savings amount 339 | ggplot(data = employed_data, aes(x = is_employed, y = mean_savings))+ 340 | geom_bar(stat = "identity") 341 | 342 | # Relation between is_employed and mean checking 343 | ggplot(data = employed_data, aes(x = is_employed, y = mean_checking))+ 344 | geom_bar(stat = "identity") 345 | 346 | # Relation between is_employed and mean credit limit 347 | ggplot(data = employed_data, aes(x = is_employed, y = mean_credit_limit))+ 348 | geom_bar(stat = "identity") 349 | 350 | ``` 351 | 352 | * People with higher saving/checking amount, higher credit limit and higher salary perform better in terms of loan repayment. 353 | 354 | * People who are currently repaying other loans appear to have lower chance of repayment than people who are not currently repaying as they have higher financial burden. 355 | 356 | * The chance of repayment appear to be highest in middle aged people roughly around 40 yrs and appears to drop after. 357 | 358 | * Loans for emergency funds appear to be less likely to be repaid than loans taken for business or investment purpose. 359 | 360 | * People with more number of dependents are less likely to repay than people with fewer dependents. 361 | 362 | * People who repaid loans fully in the past appear to have higher chances of repaying the loans again. 363 | 364 | * After controlling for all the other variables, is first loan variable does not appear to matter much. 365 | 366 | 367 | # Question 4 368 | 369 | Suprisingly, the data provided for this exercise lacks a critical variable like loan amount (amount they are seeking through the loan). If loan amount is very high then it could make even a credit worthy person to look bad and get denied on the loan. I would include this variable in the model. 370 | 371 | Another variable that is not provided is the total existing debt (all kinds) each person has. This variable is important because higher debt would cause higher financial burden which could affect the chance of repayment. -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-10.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-11.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-12.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-2.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-3.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-4.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-5.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-6.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-7.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-8.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-12-9.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-2.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-3.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-16-4.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Loan granting/Loan_granting_files/figure-gfm/unnamed-chunk-9-2.png -------------------------------------------------------------------------------- /Online Video Challenge/Online Video Challenge.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Online Video Challenge' 3 | author: "Siddhartha Jetti" 4 | date: "7/4/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | The company of this challenge allows users to upload videos online, just like YouTube. 11 | 12 | This company is interested in knowing whether a video is “hot” (i.e. trending up in terms of popularity), stable or going down. Understanding this would allow to optimize the videos promoted on the home-page and, therefore, maximize ads revenue. 13 | 14 | # Challenge Description 15 | 16 | Company XYZ is an online video streaming company, just like YouTube or Dailymotion. 17 | 18 | The Head of Product has identified as a major problem for the site a very high home page drop-off rate. That is, users come to the home-page and then leave the site without taking any action or watching any video. 19 | Since customer acquisition costs are very high, this is a huge problem: the company is spending a lot of money to acquire users who don’t generate any revenue by clicking on ads. 20 | 21 | Currently, the videos shown on the home page to new users are manually chosen. The Head of Product had this idea of creating a new recommended video section on the home page. 22 | 23 | He asked you the following: 24 | 25 | 1) Classify each video into one these 3 categories: 26 | ”Hot” - means trending up. These videos are candidate to be shown. 27 | “Stable and Popular” - video view counts are flat, but very high. These videos are candidates to be shown too. 28 | “Everything else” - these videos won’t be shown. 29 | What are the main characteristics of the “hot videos”? 30 | 31 | 2) After having identified the characteristics of the hot videos, how would you use this information from a product standpoint? 32 | 33 | # Data 34 | 35 | We have 2 tables downloadable by clicking here. 36 | 37 | The 2 tables are: 38 | 39 | video_count - provides information about how many times each video was seen each day. 40 | 41 | ## Columns: 42 | 43 | * video_id : video id, unique by video and joinable to the video id in the other table 44 | * count : total count of views for each video 45 | * date : on which day that video was watched that many times 46 | 47 | video_features - characteristics of the video. 48 | 49 | ## Columns: 50 | 51 | * video_id : video id, unique by video and joinable to the video id in the other table 52 | * video_length : length of the video in seconds 53 | * video_language : language of the video, as selected by the user when she uploaded the video 54 | * video_upload_date : when the video was uploaded 55 | * video_quality : quality of the video. It can be [ 240p, 360p, 480p, 720p, 1080p] 56 | 57 | # Problem Setup 58 | 59 | ```{r} 60 | # Load required libraries 61 | library(dplyr) 62 | library(ggplot2) 63 | library(lubridate) 64 | library(randomForest) 65 | 66 | 67 | # Read in the input data into a dataframe 68 | views <- read.csv("video_count.csv", stringsAsFactors = F) 69 | features <- read.csv("video_features.csv", stringsAsFactors = F) 70 | ``` 71 | 72 | # Question 1: 73 | 74 | Explore the views dataset 75 | ```{r} 76 | # Transform views dataset 77 | views <- views %>% 78 | mutate(date = as.Date(date)) %>% 79 | arrange(video_id, date) 80 | 81 | # Check data types of each of the columns 82 | summary(views) 83 | 84 | # Take a peek at data 85 | head(views) 86 | ``` 87 | 88 | Explore the features dataset 89 | ```{r} 90 | 91 | # Transform features dataset 92 | features <- features %>% 93 | mutate(video_upload_date = as.Date(video_upload_date)) %>% 94 | arrange(video_id) 95 | 96 | # Check data types of each of the columns 97 | summary(features) 98 | 99 | # Take a peek at data 100 | head(features) 101 | ``` 102 | 103 | Check for missing values in the views and features data. 104 | ```{r} 105 | # count of missing values by column in views dataset 106 | colSums(is.na(views)) 107 | # count of missing values by column in features dataset 108 | colSums(is.na(features)) 109 | ``` 110 | No missing values exist anywhere in the data. 111 | 112 | Check if count data exists for all the videos for all the dates. 113 | ```{r} 114 | table(views$date) 115 | ``` 116 | 117 | Clearly, The video counts are provided for 15 days starting from January 1st 2015. 118 | 119 | ```{r} 120 | length(unique(views$video_id)) 121 | ``` 122 | 123 | There are 2785 videos and 15 days of view counts data exist for every video. 124 | 125 | Here the goal is to classify videos into Hot, stable and others based on if videos are trending up or down. Extracting trend from the time series of views count is essential to accomplish this task. 126 | 127 | There are several ways to extract the trend. The simplest method is to fit a straight line that is closest to the views count data points for each video, In other words finding the line that minimizes the total sum of squared distances between the points and line. 128 | 129 | Let's pick few arbitrary videos and visualize the time series of views. 130 | 131 | ```{r} 132 | views %>% 133 | filter(video_id == 176) %>% 134 | 135 | ggplot(aes(date, count)) + 136 | geom_point() + 137 | geom_smooth(method = "lm", se = FALSE) 138 | ``` 139 | 140 | ```{r} 141 | views %>% 142 | filter(video_id == 499) %>% 143 | 144 | ggplot(aes(date, count)) + 145 | geom_point() + 146 | geom_smooth(method = "lm", se = FALSE) 147 | ``` 148 | 149 | For the selected video "176", the least sum of squares line appears to be a good fit to the data. But for video "499", it looks like a poor fit. Since the goal here is not to predict future counts but only for a descriptive purpose, the linear trend line approach can still be used. The slope of trendline is the coefficient of date variable in regression line equation. The positive slope indicates that views are trending up and a negative slope indicates the video is trending down in terms of views over time, 150 | 151 | Before finding the least sum of squares fit to all the videos, let us normalize the variables so that visualizing the trendlines of various videos becomes easy. 152 | 153 | ```{r} 154 | # Function to normalize the variables 155 | normalize <- function(x) { 156 | return((x - min(x)) / (max(x) - min(x))) 157 | } 158 | 159 | # Function to extract slope and trend in the data 160 | slope_trendline <- function(y, x) { 161 | trendline <- lm(formula = y ~ x) 162 | return(trendline$coefficients[2]) 163 | } 164 | 165 | # Now group by video id, normalize variables and extract slope 166 | views_summary <- views %>% 167 | mutate(date = as.numeric(date), 168 | count_norm = normalize(count), 169 | date_norm = normalize(date)) %>% 170 | group_by(video_id) %>% 171 | summarise(avg_views_per_day = round(mean(count), digits = 0), 172 | slope = slope_trendline(count_norm, date_norm)) 173 | 174 | # Take a look at the data 175 | head(views_summary) 176 | ``` 177 | 178 | Here is the distribution of average counts per day 179 | ```{r} 180 | quantile(views_summary$avg_views_per_day, probs = seq(0, 1, by = 0.05)) 181 | views_summary %>% 182 | ggplot() + 183 | geom_histogram(bins = 30, aes(x = avg_views_per_day, y = ..density..)) + 184 | geom_density(aes(x = avg_views_per_day, y = ..density..)) 185 | ``` 186 | 187 | The distribution of slopes of trendlines 188 | ```{r} 189 | quantile(views_summary$slope, probs = seq(0, 1, by = 0.1)) 190 | views_summary %>% 191 | ggplot() + 192 | geom_histogram(bins = 30, aes(x = slope)) 193 | ``` 194 | 195 | Based on the above distributions, Here are the rules that can be thought of to classify videos into Hot, stable and others. 196 | 197 | * If slope >= 0.02 then video is "Hot" 198 | * If -0.02 < slope < 0.02 and average views per day more than 1,000,000 then video is "Stable and Popular". 199 | * If none of the above then video is "Everything else"" 200 | 201 | ```{r} 202 | # Classifying videos based on the stated rules 203 | views_summary <- views_summary %>% 204 | mutate(category = case_when( 205 | slope >= 0.02 ~ "Hot", 206 | slope > -0.02 & slope < 0.02 & avg_views_per_day > 1000000 ~ "Stable and Popular", 207 | TRUE ~ "Everything else") 208 | ) 209 | 210 | # Frequency of different video categories 211 | table(views_summary$category) 212 | ``` 213 | 214 | # Question 2: 215 | 216 | Now, The goal is to build a model to predict if a video is "Hot" and understand the factors that effect it. 217 | 218 | Run descriptive stats on the video features dataset 219 | ```{r} 220 | # Merging the two datasets 221 | data <- features %>% 222 | left_join(views_summary, by = "video_id") %>% 223 | mutate(is_hot = as.factor(ifelse(category == "Hot", 1, 0)), 224 | days_after_upload = as.Date("2015-01-15") - video_upload_date, 225 | video_language = as.factor(video_language), 226 | video_quality = as.factor(video_quality), 227 | upload_weekday = as.factor(weekdays(video_upload_date)), 228 | upload_week = week(video_upload_date)) %>% 229 | select(-video_upload_date, -category, -slope) 230 | 231 | # Variable data types 232 | summary(data) 233 | ``` 234 | 235 | ```{r} 236 | data %>% 237 | group_by(is_hot) %>% 238 | summarise(avg_length = mean(video_length)) %>% 239 | ggplot() + 240 | geom_col(aes(x = is_hot, y = avg_length)) 241 | 242 | ggplot(data, aes(x = video_length, group = is_hot)) + 243 | geom_histogram(binwidth = 120, aes(fill = is_hot)) 244 | ``` 245 | 246 | The average length of Hot videos is much lower than other videos. The histogram of video length reveals that proportion of Hot videos is much lower in the long videos than shorter ones. 247 | 248 | Average views per day by language 249 | ```{r} 250 | data %>% 251 | group_by(video_language) %>% 252 | summarise(avg_views_per_day_per_video = mean(avg_views_per_day)) %>% 253 | ggplot() + 254 | geom_col(aes(x = video_language, y = avg_views_per_day_per_video)) 255 | ``` 256 | 257 | 258 | Number of videos uploaded and proportion of hot videos by language 259 | ```{r} 260 | data %>% 261 | group_by(video_language, is_hot) %>% 262 | summarise(counts = n()) %>% 263 | mutate(freq = round(counts / sum(counts), digits = 2)) %>% 264 | ggplot(aes(x = video_language, y = counts, group = is_hot)) + 265 | geom_col(aes(fill = is_hot)) + 266 | geom_text(aes(label = paste0(freq*100, "%"))) 267 | ``` 268 | 269 | The bar chart reveals that German and French videos are much less in number than other languages. The highest number of videos exist in English and Chinese than other languages. 270 | 271 | Number of videos uploaded and proportion of hot videos by quality 272 | ```{r} 273 | data %>% 274 | group_by(video_quality, is_hot) %>% 275 | summarise(counts = n()) %>% 276 | mutate(freq = round(counts / sum(counts), digits = 2)) %>% 277 | ggplot(aes(x = video_quality, y = counts, group = is_hot)) + 278 | geom_col(aes(fill = is_hot)) + 279 | geom_text(aes(label = paste0(freq*100, "%"))) 280 | ``` 281 | 282 | Number of videos uploaded and proportion of hot videos by upload week day 283 | ```{r} 284 | data %>% 285 | group_by(upload_weekday, is_hot) %>% 286 | summarise(counts = n()) %>% 287 | mutate(freq = round(counts / sum(counts), digits = 2)) %>% 288 | ggplot(aes(x = upload_weekday, y = counts, group = is_hot)) + 289 | geom_col(aes(fill = is_hot)) + 290 | geom_text(aes(label = paste0(freq*100, "%"))) 291 | ``` 292 | 293 | ## Model Building: 294 | 295 | Random forest would be a good choice for predicting "Hot" videos as it is strong with outliers and works well with correlated, continuous and discrete variables. Also, random forest is easier to optimize parameters. Partial dependence plots can be used to capture insights from the model. 296 | 297 | A standard 66-30% split can be used to generate training and test datasets. If training dataset is not large enough then cross validation can be used to arrive at the optimum parameters 298 | 299 | ```{r} 300 | 301 | # Split the available data into training and test data sets 302 | set.seed(2019) 303 | train_sample = sample(nrow(data), size = round(nrow(data)*0.66)) 304 | train_data = data[train_sample,] 305 | test_data = data[-train_sample,] 306 | 307 | # Build Random forest model on the data with mostly default settings except for class weight and #trees 308 | rf.fit <- randomForest(y = train_data$is_hot, x = train_data[, -c(5,6)], ytest = test_data$is_hot, 309 | xtest = test_data[, -c(5,6)], ntree = 250, mtry = 3, keep.forest = TRUE) 310 | 311 | rf.fit 312 | 313 | # Visualize Important variables 314 | varImpPlot(rf.fit) 315 | ``` 316 | 317 | ```{r} 318 | # Order variables by importance 319 | imp_vars <- importance(rf.fit) 320 | pdp_vars <- names(sort(imp_vars[,1], decreasing = T)) 321 | 322 | # Loop through variables and build PDP in the decreasing order of variable importance 323 | for (i in seq_along(pdp_vars)) { 324 | partialPlot(rf.fit, data[,-c(5,6)], pdp_vars[i], xlab = "", 325 | main = paste("Partial Dependence on", pdp_vars[i]), which.class = 1) 326 | } 327 | 328 | ``` 329 | 330 | The variable importance plot reveals that video length is a major factor in predicting if the video is trending or not followed by days after upload and video quality variables. The variable upload week is just proxy for days after upload. 331 | 332 | The following are the characteristics of videos that are trending up: 333 | 334 | * Short video lengths. 335 | 336 | * Newer videos are more likely to be trending than older ones. This is expected as videos age, they lose the novelty affect and become less effective at generating more views. 337 | 338 | * Low resolution videos (240p) have a slight higher chance of trending on the platform than higher resolution. This could be because lower resolution videos are easier to stream in low band width areas like streaming videos with mobiles on buses and subways. 339 | 340 | * German and French videos are more likely to be trending than other languages. This is not suprising as there are so few German and French videos uploaded on the site and the low supply in those languages could be resulting in higher proportion of Hot videos. 341 | 342 | # Question 3: 343 | 344 | Let us dig deep and explore relationship between video length and views. 345 | ```{r} 346 | # Bin vodeos based on the video length 347 | data <- data %>% 348 | mutate(video_length_type = case_when(video_length > 0 & video_length <= 300 ~ "0-5 mins", 349 | video_length > 300 & video_length <= 600 ~ "05-10 mins", 350 | video_length > 600 & video_length <= 900 ~ "10-15 mins", 351 | TRUE ~ "15+ mins") 352 | ) 353 | 354 | data %>% 355 | group_by(video_length_type) %>% 356 | summarise(avg_views_per_day_per_video = mean(avg_views_per_day)) %>% 357 | ggplot() + 358 | geom_col(aes(x = video_length_type, y = avg_views_per_day_per_video)) 359 | ``` 360 | 361 | The above plot shows that long videos (15+ mins duration) are generating far lower number of views than short ones (less than 10 min duration) and average views per day is lower by more than 35%. Also,The partial dependence plots reveal that short videos are more likely to be trending than longer ones. 362 | 363 | From the product stand point, Here are some recommendations: 364 | 365 | * The company should investigate why there are so few German and French videos uploaded. This could be due to some translation problems in the upload process or is it something else. 366 | 367 | * Clearly, it appears that lengthy videos are generating far fewer views and are less likely to be trending than short ones. So, The Company should encourage video creators to make short ones with engaging content. They could consider creating a seperate "short video" section with a limit on duration of video being uploaded somewhere between 5-10 mins. The company should commission studies on what the optimum allowable video length should be for the short video section. 368 | 369 | * The company could also consider an approach to prompt users with a pop up telling the user that videos shorter than 10 mins are more likely to generate more views. If user is uploading really long videos then forbidding from uploading video to platform. 370 | -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-14-2.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-15-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-15-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-17-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-17-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-18-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-18-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-19-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-19-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-2.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-3.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-4.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-5.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-6.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-20-7.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-21-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-21-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/Online Video Challenge/Online_Video_Challenge_files/figure-gfm/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A-Collection-of-Data-Science-Take-Home-Challenges 2 | 3 | R based Solutions to the book "A Collection of Data Science Take-Home Challenges" by Giulio Palombo. 4 | -------------------------------------------------------------------------------- /Song Challenge/Song_Challenge.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Song Challenge' 3 | author: "Siddhartha Jetti" 4 | date: "8/4/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Company XYZ is a very early stage startup. They allow people to stream music from their mobile for free. Right now, they still only have songs from the Beatles in their music collection, but they are planning to expand soon. 11 | 12 | They still have all their data in json files and they are interested in getting some basic info about their users as well as building a very preliminary song recommendation model in order to increase user engagement. 13 | 14 | # Challenge Description 15 | 16 | You are the fifth employee at company XYZ. The good news is that if the company becomes big, you will become very rich with the stocks. The bad news is that at such an early stage the data is usually very messy. All their data is stored in json format. 17 | 18 | The company CEO asked you for very specific questions: 19 | 20 | * What are the top 3 and the bottom 3 states in terms number of users? 21 | 22 | * What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little. 23 | 24 | * The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users? 25 | 26 | * Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to “Eight Days A Week“, which song has the highest probability of being played right after it by the same user? This is going to be V1 of a song recommendation model. 27 | 28 | * How would you set up a test to check whether your model works well? 29 | 30 | # Data 31 | 32 | The json is: 33 | data - Each row represents a song that was listened by a user. 34 | 35 | ## Fields: 36 | 37 | id : it is unique. 38 | user_id : user id who listened to a given song. 39 | user_state : where the user is based. 40 | user_sign_up_date : when the user signed-up. 41 | song_played : the song that was listened. 42 | time_played : at which time the user started listening to the song (local time). 43 | 44 | # Problem Setup 45 | 46 | ```{r} 47 | # Load required libraries 48 | library(tidyverse) 49 | library(jsonlite) 50 | library(lubridate) 51 | 52 | # Read in the input data into a dataframe 53 | songs <- fromJSON("song.json") 54 | ``` 55 | 56 | # Data Exploration and checks 57 | 58 | Check data types of columns in songs dataset 59 | ```{r} 60 | # Check data types of each of the columns 61 | str(songs) 62 | ``` 63 | 64 | ```{r} 65 | # take a peek at the data 66 | summary(songs) 67 | ``` 68 | 69 | Check for missing values in the data 70 | ```{r} 71 | # Check if any missing values exist 72 | colSums(is.na(songs)) 73 | ``` 74 | 75 | Check for duplicates in the data 76 | ```{r} 77 | # check if any duplicate id exist 78 | length(songs$id) == length(unique(songs$id)) 79 | ``` 80 | 81 | ```{r} 82 | # check if any duplicate user id exist in the data 83 | length(songs$user_id) == length(unique(songs$user_id)) 84 | ``` 85 | 86 | Clearly, there are duplicate user ids in dataset. This is OK because single user can listen to multiple songs. 87 | However, id appears to be unique. 88 | 89 | Check if dates make sense. The time played for all the entries should NOT be before the sign-up date 90 | 91 | ```{r} 92 | all(as.Date(songs$user_sign_up_date) <= as.Date(songs$time_played)) 93 | ``` 94 | 95 | Clearly, All the entries have sign-up dates before time played. Overall, the data looks OK. 96 | 97 | 98 | # Question 1 99 | 100 | Summarize the data by user state 101 | ```{r} 102 | top3_states <- songs %>% 103 | group_by(user_state) %>% 104 | summarise(user_count = n_distinct(user_id)) %>% 105 | ungroup() %>% 106 | arrange(desc(user_count), user_state) %>% 107 | filter(row_number() <= 3) 108 | 109 | bottom3_states <- songs %>% 110 | group_by(user_state) %>% 111 | summarise(user_count = n_distinct(user_id)) %>% 112 | ungroup() %>% 113 | arrange(user_count, user_state) %>% 114 | filter(row_number() <= 3) 115 | 116 | top3_states 117 | bottom3_states 118 | ``` 119 | 120 | # Question 2 121 | 122 | Based on the given data and problem description, the only way users engage with the service is by playing songs. 123 | I define user engagement as number of play events per user in a given period of time. I plan to use average daily user engagement, which is average number of play events per day per user, as the metric to decide the top and bottom states for product usage. 124 | 125 | If the users use the product a lot then number of play events per day per user would go up and hence would drive the metric up. Also, The daily user engagement rates can be used to visualize the trends over time. 126 | 127 | The user engagement should be calculated using the number of user signups prior to the play event. 128 | 129 | The number of user sign-ups by state and date. 130 | ```{r} 131 | total_signups_by_date <- songs %>% 132 | arrange(user_sign_up_date) %>% 133 | group_by(user_sign_up_date, user_state) %>% 134 | summarize(counts = n_distinct(user_id)) %>% 135 | ungroup() %>% 136 | arrange(user_state, user_sign_up_date) 137 | 138 | # Unique states 139 | unique_states <- songs %>% 140 | select(user_state) %>% 141 | distinct() %>% 142 | arrange(user_state) 143 | 144 | ``` 145 | 146 | The dates for which the daily engagement rate needs to be computed 147 | ```{r} 148 | required_dates <- substring(songs$time_played, 1, 10) %>% unique() %>% sort() 149 | ``` 150 | 151 | Initialize a place holder to hold daily engagement rate. 152 | ```{r} 153 | engagement_state_date <- data.frame(date = required_dates, stringsAsFactors = F) %>% 154 | merge(unique_states) 155 | 156 | # Merge with other dataset to get the number of play events 157 | daily_engagement_by_state_date <- songs %>% 158 | mutate(date_played = substring(time_played, 1, 10)) %>% 159 | group_by(user_state, date_played) %>% 160 | summarise(plays = n()) %>% 161 | ungroup() %>% 162 | right_join(engagement_state_date, by = c("user_state" = "user_state", "date_played" = "date")) %>% 163 | mutate(plays = ifelse(is.na(plays), 0, plays), signups_till_date = NA) 164 | 165 | head(daily_engagement_by_state_date) 166 | ``` 167 | 168 | Compute daily user engagement by state as "number of play events/ number of user sign ups till date" 169 | 170 | ```{r} 171 | # Loop through the each of the entries 172 | for(i in 1:nrow(daily_engagement_by_state_date)){ 173 | tmp <- total_signups_by_date %>% 174 | filter(user_state == daily_engagement_by_state_date$user_state[i], 175 | as.Date(user_sign_up_date) <= as.Date(daily_engagement_by_state_date$date_played[i])) 176 | 177 | daily_engagement_by_state_date$signups_till_date[i] <- sum(tmp$counts) 178 | } 179 | 180 | daily_engagement_by_state_date <- daily_engagement_by_state_date %>% 181 | mutate(daily_engagement = plays/signups_till_date) 182 | 183 | head(daily_engagement_by_state_date) 184 | ``` 185 | 186 | ```{r} 187 | daily_engagement_summary <- daily_engagement_by_state_date %>% 188 | group_by(user_state) %>% 189 | summarise(avg_daily_engagement = round(mean(daily_engagement), digits = 2)) 190 | 191 | daily_engagement_summary 192 | ``` 193 | 194 | Top and bottom 3 states by user daily user engagement 195 | 196 | ```{r} 197 | top3_states_engagement <- daily_engagement_summary %>% 198 | arrange(desc(avg_daily_engagement)) %>% 199 | filter(row_number() <= 3) 200 | 201 | bottom3_states_engagement <- daily_engagement_summary %>% 202 | arrange(avg_daily_engagement) %>% 203 | filter(row_number() <= 3) 204 | 205 | top3_states_engagement 206 | bottom3_states_engagement 207 | ``` 208 | 209 | # Question 3 210 | 211 | First users by state 212 | 213 | ```{r} 214 | first_users_by_state <- songs %>% 215 | group_by(user_state) %>% 216 | arrange(user_sign_up_date) %>% 217 | filter(row_number() == 1) %>% 218 | ungroup() %>% 219 | select(user_state, user_id) %>% 220 | arrange(user_state) 221 | 222 | first_users_by_state 223 | ``` 224 | 225 | # Question 4 226 | 227 | The approach to build song recommendation system is to use first order Markov chain where for each song, we predict the most likely next song without looking at user history, but only taking into account the current song. The Markov chain approach is combined with similarity score obtained from Collaborative filtering to break any ties or for cases of cold start. 228 | 229 | The algorithm is to build a data set where for each user and song, it gives the very next song listened to. 230 | We can then group by each song across all users and find the next song with the highest count in a given time window. Here, I choose the time window as one day. For every song, We are interested in finding the counterpart that is played consecutively the most number of times but on the same day across all the users. In the cases where there is a tie or missing data, the similarity using collaborative filtering is used to give the prediction. 231 | 232 | ## Markov Chain 233 | 234 | ```{r} 235 | songs <- songs %>% 236 | mutate(k = 1) 237 | 238 | # Cartesian join with the same table and apply appropriate filter 239 | songs_joined <- songs %>% 240 | select(user_id1 = user_id, song = song_played, time_played_song1 = time_played, k) %>% 241 | full_join(songs, by = "k") %>% 242 | # Only interested in next song played most times by that user for that day 243 | filter(user_id1 == user_id, date(ymd_hms(time_played_song1)) == date(ymd_hms(time_played)), 244 | ymd_hms(time_played_song1) < ymd_hms(time_played), song != song_played) %>% 245 | select(user_id, song, next_song = song_played) 246 | 247 | # Most likely next song based on Markov chain 248 | song_pairs <- songs_joined %>% 249 | mutate(song = toupper(song), next_song = toupper(next_song)) %>% 250 | group_by(song, next_song) %>% 251 | summarise(counts = n()) %>% 252 | ungroup() %>% 253 | arrange(song, desc(counts)) 254 | 255 | ``` 256 | 257 | Clearly, ties exist in the data. 258 | 259 | ## Collaborative Filtering 260 | 261 | Using coll.filtering to break ties. Each song can be imagined as a point in the n-dimensional user space. Each coordinate of the point(n-dimensional) would be the number of times the song is played by the particular user. 262 | 263 | ```{r} 264 | # Build user song matrix 265 | user_song_matrix <- songs %>% 266 | group_by(user_id, song_played) %>% 267 | summarise(nplays = n()) %>% 268 | ungroup() %>% 269 | spread(song_played, nplays) %>% 270 | mutate_all(list(~replace_na(., 0))) %>% 271 | select(-user_id) 272 | 273 | unique_songs <- colnames(user_song_matrix) 274 | ``` 275 | 276 | Cosine similarity is used to compute similarity between two songs. The idea here is if two songs are played by the same set of users, then they must be similar and have high cosine similarity value. 277 | 278 | ```{r} 279 | # Define a function to compute the cosine similarity between two songs 280 | cosine_similarity <- function(x, y) { 281 | sum(x * y) / (sqrt(sum(x * x)) * sqrt(sum(y * y))) 282 | } 283 | 284 | # Define a place holder to hold similarity between each pair of songs 285 | # similarity between a song and itself is 1 286 | song_similarity <- diag(1, nrow = ncol(user_song_matrix), ncol = ncol(user_song_matrix)) 287 | rownames(song_similarity) <- toupper(unique_songs) 288 | colnames(song_similarity) <- toupper(unique_songs) 289 | nsongs <- ncol(user_song_matrix) 290 | ``` 291 | 292 | Generate song similarity matrix 293 | ```{r} 294 | # Loop through the columns 295 | for(i in 1:nsongs) { 296 | # Loop through the columns for each column 297 | for(j in 1:nsongs) { 298 | # Fill in placeholder with cosine similarities 299 | song_similarity[i, j] <- cosine_similarity(user_song_matrix[i], user_song_matrix[j]) 300 | } 301 | } 302 | 303 | # Process song pairs 304 | song_similarity_df <- song_similarity %>% 305 | as.data.frame() 306 | row.names(song_similarity_df) <- c() 307 | 308 | song_similarity_df$song1 <- row.names(song_similarity) 309 | song_similarity_df <- song_similarity_df %>% 310 | select(song1, 1:100) %>% 311 | gather(key = "song2", value = "similarity", -song1) %>% 312 | filter(song1 != song2) 313 | 314 | # Take a peek at the song pair similarity scores 315 | head(song_similarity_df) 316 | ``` 317 | 318 | For every song, get the song with most counts and if multiple songs have the most counts then use highest similarity score. 319 | ```{r} 320 | # summarize 321 | next_song <- song_pairs %>% 322 | left_join(song_similarity_df, by = c("song" = "song1", "next_song" = "song2")) %>% 323 | arrange(song, desc(counts), desc(similarity)) %>% 324 | group_by(song) %>% 325 | filter(row_number() == 1) 326 | ``` 327 | 328 | Based on the number of songs, Not all songs got a prediction on the next song to be played. For those cases, choose the song with highest similarity as the next likely song. 329 | ```{r} 330 | # Get the missing songs similarity 331 | missing_songs <- song_similarity_df %>% 332 | filter(!song1 %in% next_song$song) %>% 333 | arrange(song1, desc(similarity)) %>% 334 | group_by(song1) %>% 335 | filter(row_number() == 1) 336 | 337 | ``` 338 | 339 | Combine all the predictions. 340 | ```{r} 341 | # Combining 342 | next_song_final <- missing_songs %>% 343 | select(song = song1, next_song = song2) %>% 344 | bind_rows(next_song) %>% 345 | arrange(song) %>% 346 | select(song, next_song) 347 | 348 | next_song_final 349 | ``` 350 | 351 | Now, Define the function to get the mostr likely next song using the above data set. 352 | ```{r} 353 | # Function to get the next song 354 | get_next_song <- function(song){ 355 | if(!toupper(song) %in% next_song_final$song){ 356 | return("Song not found in database!") 357 | } 358 | return(next_song_final$next_song[next_song_final$song == toupper(song)]) 359 | } 360 | 361 | # Test cases 362 | get_next_song("Eight Days A Week") 363 | get_next_song("XXXXXXX") 364 | ``` 365 | 366 | # Question 5 367 | 368 | Launching song recommendation system to an existing product is a major change and is likely to introduce lot of UI changes. When testing the song recommendation system, It is important to isolate the effect of UI changes from the overall change in metric before and after introducing the recommender, to know the goodness of recommender algorithm. To accomplish this we test 3 versions. 369 | 370 | V1 - original product with out recommendation. 371 | V2 - with recommendation and associated UI changes (recommendation based on random guess or a very rudimentary model). 372 | V3 - With recommendation based on the built algorithm and associated UI changes. 373 | 374 | Perform multiple A/B testing on three versions as follows. 375 | 376 | * First estimate the number days the test needs to be run for the desired effect size, p-value and statistical power. P-value should be corrected using Boniferroni correction as multiple tests are involved. 377 | * Randomly split users into three groups, Each group is shown one of the three versions to be tested. 378 | * Collect data on "average number of play events per user per day" for all three groups. 379 | * Here are the hypotheses to be tested. 380 | * H0 : No difference in the metric across the groups. 381 | * H1 : There is a difference in the metric between the groups. 382 | * After test period, perform T-test on each pair of the groups and check if you can reject or fail to reject H0 at adjusted p-value (employing Boniferroni correction) and judge the effect of recommendation algorithm. 383 | 384 | -------------------------------------------------------------------------------- /URL Parsing Challenge/URL Parsing Challenge.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'URL Parsing Challenge' 3 | author: "Siddhartha Jetti" 4 | date: "7/14/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Being able to efficiently parse URLs and extract info from them is a very important skill for a data scientist. 11 | 12 | Firstly, if you join a very early stage startup, it might be that a lot of data are just stored in the URLs visited by the users. And, therefore, you will have to build a system that parses the URLs, extracts fields out of it, and saves them into a table that can be easily queried (not the most fun of the jobs, but very useful!). 13 | 14 | Secondly, often using external data can help a lot your models. And a way to get external data is by scraping websites. And this is often done by being able to play with a given site URL structure (assuming it is allowed by the external site ToS). 15 | 16 | The goal of this project is to parse a sequence of URLs about user searches and extract some basic info out of it. 17 | 18 | 19 | # Challenge Description 20 | 21 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com, etc. 22 | 23 | They haven’t invested in data science yet and all the data they have about user searches are simply stored in the URLs the users generate when they search for a hotel. If you are not familiar with URLs, you can run a search on any OTA site and see how all search parameters are always present in the URL. 24 | 25 | You are asked to: 26 | 27 | 1) Create a clean data set where each column is a field in the URL, each row is a given search and the cells are the corresponding URL values. 28 | 29 | For each search query, how many amenities were selected? 30 | 31 | 2) Often, to measure the quality of a search algorithm, data scientists use some metric based on how often users click on the second page, third page, and so on. The idea here is that a great search algorithm should return all interesting results on the first page and never force users to visit the other pages (how often do you click on the second page results when you search on Google? Almost never, right?). 32 | 33 | Create a metric based on the above idea and find the city with the worst search algorithm. 34 | 35 | 36 | # Data 37 | 38 | The file is: 39 | 40 | url_list - a list of search URLs generated by the users when searching for hotels 41 | 42 | ## Fields: 43 | 44 | * hotel.checkin : checkin date in the search query. It is mandatory 45 | * hotel.customMinimumPriceFilter : This filter allows to only return hotels whose nightly price is above a certain threshold (in USD). Useful to filter out the really bad hotels 46 | * hotel.customMaximumPriceFilter : This filter allows to only return hotels whose nightly price is below a certain threshold (in USD). Useful to filter out the hotels you can’t afford 47 | * hotel.freeCancellation : It is a check box. If the user selects it, only hotels with free cancellation are returned. 48 | * hotel.stars_5 : It is a check box. If the user selects it, 5-star hotels are returned. Multiple choices are allowed. For instance, a user can select 5 and 4 star hotels by checking this box and the one below. If no check box is selected, all hotels are returned. 49 | * hotel.stars_4 : It is a check box. If the user selects it, 4-star hotels are returned 50 | * hotel.stars_3 : It is a check box. If the user selects it, 3-star hotels are returned 51 | * hotel.stars_2 : It is a check box. If the user selects it, 2-star hotels are returned 52 | * hotel.stars_1 : It is a check box. If the user selects it, 1-star hotels are returned 53 | * hotel.max_score : This filter allows to only return hotels whose score is below a certain threshold. Score is 1-5 with high being good (you can think of TripAdvisor score to get an idea of what it is) 54 | * hotel.min_score : This filter allows to only return hotels whose score is above a certain threshold 55 | * hotel.couponCode : If the user uses a coupon in her search, this fields gets populated with “Yes” 56 | * hotel.adults : Number of adults in the search query. This sis the number of adults who would stay in the same room. It is mandatory 57 | * hotel.city : Which city is the user searching for. It is mandatory 58 | * hotel.children : Is the user traveling with children? This field returns the number of children in the search query 59 | * hotel.amenities : There are a few amenities that the user can select in her search via different check boxes. The possible amenities are: 60 | + shuttle: free shuttle transportation from the airport 61 | + internet: free internet 62 | + breakfast : free breakfast 63 | + lounge : does the hotel have a lounge 64 | + yes_smoking : are there rooms where smoking is allowed 65 | + yes_pet : is it allowed to bring pets 66 | * hotel.checkout : Check out date. It is mandatory 67 | * hotel.search_page : Search page visited. 1 means the user in on the first page results, 2 -> clicked on the second page etc. This will be used to estimate the goodness of ranking for different cities 68 | 69 | 70 | # Problem Setup 71 | 72 | ```{r} 73 | # Load required libraries 74 | library(tidyverse) 75 | library(ggplot2) 76 | 77 | # Read in the input data into a dataframe 78 | urls <- read.table("url_list.txt", stringsAsFactors = F) 79 | urls1 <- gsub("http://www.mysearchforhotels.com/shop/hotelsearch?", "", urls[,1], fixed = TRUE) 80 | 81 | ``` 82 | 83 | # Question 1: 84 | 85 | Transform the URL data 86 | ```{r} 87 | url_list <- strsplit(urls1, "&") 88 | nurls <- nrow(urls) 89 | ``` 90 | 91 | Process and clean the data 92 | ```{r} 93 | # all possible URL fields 94 | url_fields <- c("hotel.checkin", "hotel.customMinimumPriceFilter", "hotel.customMaximumPriceFilter", 95 | "hotel.freeCancellation", "hotel.stars_5", "hotel.stars_4", "hotel.stars_3", "hotel.stars_2", 96 | "hotel.stars_1", "hotel.max_score", "hotel.min_score", "hotel.couponCode", "hotel.adults", 97 | "hotel.city", "hotel.children", "hotel.amenities", "hotel.checkout", "hotel.search_page") 98 | 99 | # Initialize a data frame to hold the cleaned data in rectangular format 100 | url_data <- as.data.frame(matrix(NA, nrow = nurls, ncol = 19)) 101 | names(url_data) <- c("id", url_fields) 102 | url_data$id <- seq(1, nurls, by = 1) 103 | ``` 104 | 105 | Build the rectangular data set. 106 | ```{r} 107 | # Loop through all the URLs 108 | for(i in 1:nurls) { 109 | search <- url_list[[i]] 110 | # Loop through all the fields for every url entry 111 | for(field in url_fields){ 112 | if(any(grepl(field, search))){ 113 | search_strings <- paste(search[grep(field, search)], collapse = ',') 114 | url_data[i, field] <- gsub(paste0(field, "="), "", search_strings, fixed = TRUE) 115 | } else { url_data[i, field] <- NA } 116 | } 117 | } 118 | # Remove special characters 119 | url_data <- url_data %>% 120 | mutate(hotel.city = gsub("+", "", hotel.city, fixed = TRUE)) 121 | 122 | # Take a peek at the final data 123 | summary(url_data) 124 | head(url_data) 125 | 126 | # Check if any rows exist with all fields other than id missing 127 | any(rowSums(is.na(url_data)) == ncol(url_data[,-1])) 128 | 129 | ``` 130 | 131 | Every search query need not have all the fields specified. So, the cleaned dataset is expected to be sparse (several missing values in each row). No rows exist with all fields other than id missing. 132 | 133 | 134 | ```{r} 135 | # Check if missing values exist for search page 136 | sum(is.na(url_data$hotel.search_page)) 137 | 138 | # Check missing values for city 139 | sum(is.na(url_data$hotel.city)) 140 | ``` 141 | 142 | No missing values exist in critical fields. The data looks good. 143 | 144 | Count number of amenities for every search 145 | ```{r} 146 | # Frequency by categories 147 | table(url_data$hotel.amenities, useNA = "always") 148 | 149 | # Count the amenities searched by splitting at "," for searches involving multiple amenity selection 150 | url_data <- url_data %>% 151 | mutate(amenities_count = ifelse(is.na(hotel.amenities), 0, str_count(hotel.amenities, ",") + 1)) 152 | 153 | table(url_data$amenities_count, useNA = "always") 154 | ``` 155 | 156 | Clearly, More than 99% of the searches does not have any amenity selected. Only less than 1% of searches have one or more amenities selected and highest number of amenities selected in a search is two. 157 | 158 | # Question 2: 159 | 160 | A good search algorithm should rank the most relevant search results highly and hence show interesting results in the first page. I think users having to look at second page of search results is as bad as having to look at 10th page because most users don't bother to go beyond first page and ones shown in later pages may not reach the user. So, it makes sense to bin searches into two categories one where searches are found on first page and second category being any where else other than first page. 161 | 162 | The metric I choose to know the relavancy of search results, in other words performance of search algorithm, is "the fraction of searches with users looking beyond first search results page". The lower the value of this metric, the better is the relavancy of search results and the search algorithm. If large percentage of users looked beyond the first page then this metric will go up and indicate that search algorithm is not performing well. 163 | 164 | ```{r} 165 | # Frequency by searched city 166 | table(url_data$hotel.city, useNA = "always") 167 | 168 | # Order cities based on the chosen metric 169 | url_data %>% 170 | mutate(next_search_page = ifelse(hotel.search_page == "1", 0, 1)) %>% 171 | group_by(hotel.city) %>% 172 | summarize(next_search_page_fraction = mean(next_search_page)) %>% 173 | arrange(desc(next_search_page_fraction)) 174 | ``` 175 | 176 | It looks like London had the worst search algorithm with users looking beyond first search results page in about 47% of searches. Followed by New York with about 44% of searches going beyond first page. 177 | -------------------------------------------------------------------------------- /URL Parsing Challenge/URL_Parsing_Challenge.md: -------------------------------------------------------------------------------- 1 | URL Parsing Challenge 2 | ================ 3 | Siddhartha Jetti 4 | 7/14/2019 5 | 6 | # Goal 7 | 8 | Being able to efficiently parse URLs and extract info from them is a 9 | very important skill for a data scientist. 10 | 11 | Firstly, if you join a very early stage startup, it might be that a lot 12 | of data are just stored in the URLs visited by the users. And, 13 | therefore, you will have to build a system that parses the URLs, 14 | extracts fields out of it, and saves them into a table that can be 15 | easily queried (not the most fun of the jobs, but very useful\!). 16 | 17 | Secondly, often using external data can help a lot your models. And a 18 | way to get external data is by scraping websites. And this is often done 19 | by being able to play with a given site URL structure (assuming it is 20 | allowed by the external site ToS). 21 | 22 | The goal of this project is to parse a sequence of URLs about user 23 | searches and extract some basic info out of it. 24 | 25 | # Challenge Description 26 | 27 | Company XYZ is a Online Travel Agent site, such as Expedia, Booking.com, 28 | etc. 29 | 30 | They haven’t invested in data science yet and all the data they have 31 | about user searches are simply stored in the URLs the users generate 32 | when they search for a hotel. If you are not familiar with URLs, you can 33 | run a search on any OTA site and see how all search parameters are 34 | always present in the URL. 35 | 36 | You are asked to: 37 | 38 | 1) Create a clean data set where each column is a field in the URL, 39 | each row is a given search and the cells are the corresponding URL 40 | values. 41 | 42 | For each search query, how many amenities were selected? 43 | 44 | 2) Often, to measure the quality of a search algorithm, data scientists 45 | use some metric based on how often users click on the second page, 46 | third page, and so on. The idea here is that a great search 47 | algorithm should return all interesting results on the first page 48 | and never force users to visit the other pages (how often do you 49 | click on the second page results when you search on Google? Almost 50 | never, right?). 51 | 52 | Create a metric based on the above idea and find the city with the worst 53 | search algorithm. 54 | 55 | # Data 56 | 57 | The file is: 58 | 59 | url\_list - a list of search URLs generated by the users when searching 60 | for hotels 61 | 62 | ## Fields: 63 | 64 | - hotel.checkin : checkin date in the search query. It is mandatory 65 | - hotel.customMinimumPriceFilter : This filter allows to only return 66 | hotels whose nightly price is above a certain threshold (in USD). 67 | Useful to filter out the really bad hotels 68 | - hotel.customMaximumPriceFilter : This filter allows to only return 69 | hotels whose nightly price is below a certain threshold (in USD). 70 | Useful to filter out the hotels you can’t afford 71 | - hotel.freeCancellation : It is a check box. If the user selects it, 72 | only hotels with free cancellation are returned. 73 | - hotel.stars\_5 : It is a check box. If the user selects it, 5-star 74 | hotels are returned. Multiple choices are allowed. For instance, a 75 | user can select 5 and 4 star hotels by checking this box and the one 76 | below. If no check box is selected, all hotels are returned. 77 | - hotel.stars\_4 : It is a check box. If the user selects it, 4-star 78 | hotels are returned 79 | - hotel.stars\_3 : It is a check box. If the user selects it, 3-star 80 | hotels are returned 81 | - hotel.stars\_2 : It is a check box. If the user selects it, 2-star 82 | hotels are returned 83 | - hotel.stars\_1 : It is a check box. If the user selects it, 1-star 84 | hotels are returned 85 | - hotel.max\_score : This filter allows to only return hotels whose 86 | score is below a certain threshold. Score is 1-5 with high being 87 | good (you can think of TripAdvisor score to get an idea of what it 88 | is) 89 | - hotel.min\_score : This filter allows to only return hotels whose 90 | score is above a certain threshold 91 | - hotel.couponCode : If the user uses a coupon in her search, this 92 | fields gets populated with “Yes” 93 | - hotel.adults : Number of adults in the search query. This sis the 94 | number of adults who would stay in the same room. It is mandatory 95 | - hotel.city : Which city is the user searching for. It is mandatory 96 | - hotel.children : Is the user traveling with children? This field 97 | returns the number of children in the search query 98 | - hotel.amenities : There are a few amenities that the user can select 99 | in her search via different check boxes. The possible amenities are: 100 | - shuttle: free shuttle transportation from the airport 101 | - internet: free internet 102 | - breakfast : free breakfast 103 | - lounge : does the hotel have a lounge 104 | - yes\_smoking : are there rooms where smoking is allowed 105 | - yes\_pet : is it allowed to bring pets 106 | - hotel.checkout : Check out date. It is mandatory 107 | - hotel.search\_page : Search page visited. 1 means the user in on the 108 | first page results, 2 -\> clicked on the second page etc. This will 109 | be used to estimate the goodness of ranking for different cities 110 | 111 | # Problem Setup 112 | 113 | ``` r 114 | # Load required libraries 115 | library(tidyverse) 116 | ``` 117 | 118 | ## Registered S3 methods overwritten by 'ggplot2': 119 | ## method from 120 | ## [.quosures rlang 121 | ## c.quosures rlang 122 | ## print.quosures rlang 123 | 124 | ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── 125 | 126 | ## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2 127 | ## ✔ tibble 2.1.1 ✔ dplyr 0.8.1 128 | ## ✔ tidyr 0.8.3 ✔ stringr 1.4.0 129 | ## ✔ readr 1.3.1 ✔ forcats 0.4.0 130 | 131 | ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── 132 | ## ✖ dplyr::filter() masks stats::filter() 133 | ## ✖ dplyr::lag() masks stats::lag() 134 | 135 | ``` r 136 | library(ggplot2) 137 | 138 | # Read in the input data into a dataframe 139 | urls <- read.table("url_list.txt", stringsAsFactors = F) 140 | urls1 <- gsub("http://www.mysearchforhotels.com/shop/hotelsearch?", "", urls[,1], fixed = TRUE) 141 | ``` 142 | 143 | # Question 1: 144 | 145 | Transform the URL data 146 | 147 | ``` r 148 | url_list <- strsplit(urls1, "&") 149 | nurls <- nrow(urls) 150 | ``` 151 | 152 | Process and clean the data 153 | 154 | ``` r 155 | # all possible URL fields 156 | url_fields <- c("hotel.checkin", "hotel.customMinimumPriceFilter", "hotel.customMaximumPriceFilter", 157 | "hotel.freeCancellation", "hotel.stars_5", "hotel.stars_4", "hotel.stars_3", "hotel.stars_2", 158 | "hotel.stars_1", "hotel.max_score", "hotel.min_score", "hotel.couponCode", "hotel.adults", 159 | "hotel.city", "hotel.children", "hotel.amenities", "hotel.checkout", "hotel.search_page") 160 | 161 | # Initialize a data frame to hold the cleaned data in rectangular format 162 | url_data <- as.data.frame(matrix(NA, nrow = nurls, ncol = 19)) 163 | names(url_data) <- c("id", url_fields) 164 | url_data$id <- seq(1, nurls, by = 1) 165 | ``` 166 | 167 | Build the rectangular data set. 168 | 169 | ``` r 170 | # Loop through all the URLs 171 | for(i in 1:nurls) { 172 | search <- url_list[[i]] 173 | # Loop through all the fields for every url entry 174 | for(field in url_fields){ 175 | if(any(grepl(field, search))){ 176 | search_strings <- paste(search[grep(field, search)], collapse = ',') 177 | url_data[i, field] <- gsub(paste0(field, "="), "", search_strings, fixed = TRUE) 178 | } else { url_data[i, field] <- NA } 179 | } 180 | } 181 | # Remove special characters 182 | url_data <- url_data %>% 183 | mutate(hotel.city = gsub("+", "", hotel.city, fixed = TRUE)) 184 | 185 | # Take a peek at the final data 186 | summary(url_data) 187 | ``` 188 | 189 | ## id hotel.checkin hotel.customMinimumPriceFilter 190 | ## Min. : 1 Length:77677 Length:77677 191 | ## 1st Qu.:19420 Class :character Class :character 192 | ## Median :38839 Mode :character Mode :character 193 | ## Mean :38839 194 | ## 3rd Qu.:58258 195 | ## Max. :77677 196 | ## hotel.customMaximumPriceFilter hotel.freeCancellation hotel.stars_5 197 | ## Length:77677 Length:77677 Length:77677 198 | ## Class :character Class :character Class :character 199 | ## Mode :character Mode :character Mode :character 200 | ## 201 | ## 202 | ## 203 | ## hotel.stars_4 hotel.stars_3 hotel.stars_2 204 | ## Length:77677 Length:77677 Length:77677 205 | ## Class :character Class :character Class :character 206 | ## Mode :character Mode :character Mode :character 207 | ## 208 | ## 209 | ## 210 | ## hotel.stars_1 hotel.max_score hotel.min_score 211 | ## Length:77677 Length:77677 Length:77677 212 | ## Class :character Class :character Class :character 213 | ## Mode :character Mode :character Mode :character 214 | ## 215 | ## 216 | ## 217 | ## hotel.couponCode hotel.adults hotel.city 218 | ## Length:77677 Length:77677 Length:77677 219 | ## Class :character Class :character Class :character 220 | ## Mode :character Mode :character Mode :character 221 | ## 222 | ## 223 | ## 224 | ## hotel.children hotel.amenities hotel.checkout 225 | ## Length:77677 Length:77677 Length:77677 226 | ## Class :character Class :character Class :character 227 | ## Mode :character Mode :character Mode :character 228 | ## 229 | ## 230 | ## 231 | ## hotel.search_page 232 | ## Length:77677 233 | ## Class :character 234 | ## Mode :character 235 | ## 236 | ## 237 | ## 238 | 239 | ``` r 240 | head(url_data) 241 | ``` 242 | 243 | ## id hotel.checkin hotel.customMinimumPriceFilter 244 | ## 1 1 2015-09-19 245 | ## 2 2 2015-09-14 246 | ## 3 3 2015-09-26 247 | ## 4 4 2015-09-02 248 | ## 5 5 2015-09-20 249 | ## 6 6 2015-09-14 250 | ## hotel.customMaximumPriceFilter hotel.freeCancellation hotel.stars_5 251 | ## 1 252 | ## 2 253 | ## 3 175 254 | ## 4 yes 255 | ## 5 275 256 | ## 6 yes 257 | ## hotel.stars_4 hotel.stars_3 hotel.stars_2 hotel.stars_1 hotel.max_score 258 | ## 1 yes 259 | ## 2 yes 260 | ## 3 yes 261 | ## 4 yes 262 | ## 5 263 | ## 6 264 | ## hotel.min_score hotel.couponCode hotel.adults 265 | ## 1 4 3 266 | ## 2 4 3 267 | ## 3 5 2 268 | ## 4 4 1 269 | ## 5 5 3 270 | ## 6 2 271 | ## hotel.city hotel.children hotel.amenities 272 | ## 1 NewYork,NY,UnitedStates 273 | ## 2 London,UnitedKingdom 274 | ## 3 NewYork,NY,UnitedStates 275 | ## 4 HongKong,HongKong 276 | ## 5 London,UnitedKingdom 277 | ## 6 SanFrancisco,California,UnitedStates 278 | ## hotel.checkout hotel.search_page 279 | ## 1 2015-09-20 1 280 | ## 2 2015-09-15 1 281 | ## 3 2015-09-27 1 282 | ## 4 2015-09-03 1 283 | ## 5 2015-09-29 1 284 | ## 6 2015-09-16 1 285 | 286 | ``` r 287 | # Check if any rows exist with all fields other than id missing 288 | any(rowSums(is.na(url_data)) == ncol(url_data[,-1])) 289 | ``` 290 | 291 | ## [1] FALSE 292 | 293 | Every search query need not have all the fields specified. So, the 294 | cleaned dataset is expected to be sparse (several missing values in each 295 | row). No rows exist with all fields other than id missing. 296 | 297 | ``` r 298 | # Check if missing values exist for search page 299 | sum(is.na(url_data$hotel.search_page)) 300 | ``` 301 | 302 | ## [1] 0 303 | 304 | ``` r 305 | # Check missing values for city 306 | sum(is.na(url_data$hotel.city)) 307 | ``` 308 | 309 | ## [1] 0 310 | 311 | No missing values exist in critical fields. The data looks good. 312 | 313 | Count number of amenities for every search 314 | 315 | ``` r 316 | # Frequency by categories 317 | table(url_data$hotel.amenities, useNA = "always") 318 | ``` 319 | 320 | ## 321 | ## breakfast breakfast,yes_pet internet 322 | ## 39 1 272 323 | ## lounge shuttle yes_pet 324 | ## 22 111 85 325 | ## yes_smoking yes_smoking,yes_pet 326 | ## 170 4 76973 327 | 328 | ``` r 329 | # Count the amenities searched by splitting at "," for searches involving multiple amenity selection 330 | url_data <- url_data %>% 331 | mutate(amenities_count = ifelse(is.na(hotel.amenities), 0, str_count(hotel.amenities, ",") + 1)) 332 | 333 | table(url_data$amenities_count, useNA = "always") 334 | ``` 335 | 336 | ## 337 | ## 0 1 2 338 | ## 76973 699 5 0 339 | 340 | Clearly, More than 99% of the searches does not have any amenity 341 | selected. Only less than 1% of searches have one or more amenities 342 | selected and highest number of amenities selected in a search is two. 343 | 344 | # Question 2: 345 | 346 | A good search algorithm should rank the most relevant search results 347 | highly and hence show interesting results in the first page. I think 348 | users having to look at second page of search results is as bad as 349 | having to look at 10th page because most users don’t bother to go beyond 350 | first page and ones shown in later pages may not reach the user. So, it 351 | makes sense to bin searches into two categories one where searches are 352 | found on first page and second category being any where else other than 353 | first page. 354 | 355 | The metric I choose to know the relavancy of search results, in other 356 | words performance of search algorithm, is “the fraction of searches with 357 | users looking beyond first search results page”. The lower the value of 358 | this metric, the better is the relavancy of search results and the 359 | search algorithm. If large percentage of users looked beyond the first 360 | page then this metric will go up and indicate that search algorithm is 361 | not performing well. 362 | 363 | ``` r 364 | # Frequency by searched city 365 | table(url_data$hotel.city, useNA = "always") 366 | ``` 367 | 368 | ## 369 | ## HongKong,HongKong London,UnitedKingdom 370 | ## 11786 28058 371 | ## NewYork,NY,UnitedStates SanFrancisco,California,UnitedStates 372 | ## 29384 8449 373 | ## 374 | ## 0 375 | 376 | ``` r 377 | # Order cities based on the chosen metric 378 | url_data %>% 379 | mutate(next_search_page = ifelse(hotel.search_page == "1", 0, 1)) %>% 380 | group_by(hotel.city) %>% 381 | summarize(next_search_page_fraction = mean(next_search_page)) %>% 382 | arrange(desc(next_search_page_fraction)) 383 | ``` 384 | 385 | ## # A tibble: 4 x 2 386 | ## hotel.city next_search_page_fraction 387 | ## 388 | ## 1 London,UnitedKingdom 0.473 389 | ## 2 NewYork,NY,UnitedStates 0.442 390 | ## 3 HongKong,HongKong 0.0892 391 | ## 4 SanFrancisco,California,UnitedStates 0.0407 392 | 393 | It looks like London had the worst search algorithm with users looking 394 | beyond first search results page in about 47% of searches. Followed by 395 | New York with about 44% of searches going beyond first page. 396 | -------------------------------------------------------------------------------- /User Referral Program/User Referral Program.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "User Referral Program" 3 | author: "Siddhartha Jetti" 4 | date: "10/12/2019" 5 | output: rmarkdown::github_document 6 | --- 7 | 8 | # Goal 9 | 10 | Almost all sites have a user referral program: you can invite new users to try a given product. Typically, after the new user completes a transaction, you get rewarded with a certain amount of money or credit to be used on the site. 11 | 12 | The goal of this challenge is to analyze the data from a referral program and draw conclusions about its effectiveness. 13 | 14 | # Challenge Description 15 | 16 | Company XYZ has started a new referral program on Oct, 31. Each user who refers a new user will get 10$ in credit when the new user buys something. 17 | 18 | The program has been running for almost a month and the Growth Product Manager wants to know if it's been successful. She is very excited cause, since the referral program started, the company saw a spike in number of users and wants you to be able to give her some data she can show to her boss. 19 | 20 | * Can you estimate the impact the program had on the site? 21 | * Based on the data, what would you suggest to do as a next step? 22 | * The referral program wasn't really tested in a rigorous way. It simply started on a given day for all users and you are drawing conclusions by looking at the data before and after the test started. What kinds of risks this approach presents? Can you think of a better way to test the referral program and measure its impact? 23 | 24 | # Data 25 | 26 | We have one dataset "referral.csv" 27 | 28 | ### Columns: 29 | * user_id : the id of the user 30 | * date : date of the purchase 31 | * country : user country based on ip address 32 | * money_spent : how much the item bought costs (in USD) 33 | * is_referral : whether the user came from the referral program(1) or not (0) 34 | * device_id : It is an identifier of the device used to make the purchase. You can assume here that for a given device, its id never changes 35 | 36 | # Problem Setup 37 | ```{r} 38 | # Load required libraries 39 | library(tidyverse) 40 | library(ggplot2) 41 | 42 | # Read in the input data into a dataframe 43 | data <- read.csv("referral.csv", stringsAsFactors = F) 44 | ``` 45 | 46 | # Data Exploration 47 | 48 | Explore data 49 | ```{r} 50 | # Check datatypes of all the variables in the dataset 51 | str(data) 52 | summary(data) 53 | ``` 54 | 55 | The range of each of the variables appear reasonable without any non-sensical values. 56 | 57 | ```{r} 58 | # check for any missing values in the dataset 59 | colSums(is.na(data)) 60 | ``` 61 | 62 | No missing values exist in the dataset. 63 | 64 | Run descriptive stats on the dataset. 65 | 66 | ```{r} 67 | # Total revenue by date 68 | data %>% 69 | mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>% 70 | ggplot(aes(x = date, y = money_spent)) + 71 | stat_summary(fun.y=sum, geom="line") + 72 | stat_summary(fun.y=sum, geom="point") + 73 | geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "red") 74 | 75 | ``` 76 | 77 | The above plot shows a jump in daily sales after the referral program is launched on Oct 31st (dashed line in the plot). 78 | 79 | ```{r} 80 | 81 | # Split revenue between users from referrals and non=referred users 82 | data %>% 83 | group_by(is_referral) %>% 84 | summarise(unique_users = n_distinct(user_id), total_sales = sum(money_spent), revenue_per_user = sum(money_spent)/n_distinct(user_id)) 85 | 86 | # Total revenue split between referrals and non-referrals 87 | data %>% 88 | mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>% 89 | ggplot(aes(x = date, y = money_spent, color = is_referral)) + 90 | stat_summary(fun.y=sum, geom="line") + 91 | stat_summary(fun.y=sum, geom="point") + 92 | geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "black") 93 | ``` 94 | 95 | The above plot reveals that daily sales from users, who came from referrals, are helping the overall sales after the launch of program. However, due to the way the experiment is set up, the jump in sales cannot be attributed to the referral program. 96 | 97 | # Question 1: 98 | 99 | Can you estimate the impact the program had on the site? 100 | 101 | Here I plan to use daily sales as the metric to assess the impact of referral program. It is easy to think of other metrics like number of users who made atleast one purchase, money spent per purchase etc. At the end of day, total revenue is much more meaningful metric than others. 102 | 103 | Summarize the data by date to get the daily sales. 104 | 105 | ```{r} 106 | data_daily <- data %>% 107 | group_by(date) %>% 108 | summarise(daily_sales = sum(money_spent)) 109 | 110 | data_daily 111 | ``` 112 | 113 | Compare the daily sales before and after the launch of referral program. This is done by running T-test on daily sales and check if the difference is statistically significant. 114 | 115 | ```{r} 116 | t.test(data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")], data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")]) 117 | ``` 118 | 119 | The T-test suggests that the difference in daily sales before and after the launch of program is not statistically significant. 120 | 121 | Check the performance of referral program across different user segments. 122 | 123 | ```{r} 124 | # Compute average daily sales for each country 125 | data_daily_country <- data %>% 126 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"), 127 | country = as.factor(country)) %>% 128 | group_by(date, country, before_after_launch) %>% 129 | summarise(daily_sales = sum(money_spent)) %>% 130 | group_by(country) %>% 131 | summarise( avg_diff_daily_sales = mean(daily_sales[before_after_launch == "Before Launch"]) - mean(daily_sales[before_after_launch == "After Launch"]), 132 | p_value = t.test(daily_sales[before_after_launch == "Before Launch"], daily_sales[before_after_launch == "After Launch"])$p.value) 133 | 134 | data_daily_country 135 | ``` 136 | 137 | The difference in average daily sales before and after launch are statistically significant for Mexico MX and China CH. However, sales are better after launch in MX and worse in CH. 138 | 139 | ```{r} 140 | # Plot average daily sales Vs Country 141 | data %>% 142 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"), 143 | country = as.factor(country)) %>% 144 | group_by(date, country, before_after_launch) %>% 145 | summarise(daily_sales = sum(money_spent)) %>% 146 | group_by(country, before_after_launch) %>% 147 | summarise(avg_daily_sales = mean(daily_sales)) %>% 148 | ggplot(aes(x = country, y = avg_daily_sales, color = before_after_launch, group = before_after_launch)) + 149 | geom_line() + 150 | geom_point() 151 | 152 | ``` 153 | 154 | The above plot reveals that the referral program is performing differently across different countries. 155 | The average daily sales dropped after the launch of program in countries like China CH and Germany DE. While sales appear to increase in other countries after the launch. 156 | 157 | Based on the way the experiment is set up, Any change in sales after the launch of referral program cannot be attributed to it alone. There are other confounding factors like seasonality or launch of some marketing event that could be affecting the sales. In US and in some other countries, Thanks giving (occurs in end of November every year) is one of the major holiday season and sales are generally expected to grow and the experiment is not controlling for the lurking variables like that. 158 | 159 | # Question 2: 160 | 161 | Based on the data, what would you suggest to do as a next step? 162 | 163 | From the revenue split between referrals and non-referrals plot, It is apparent that the sales from the non-referrals, which includes all existing users who signed up/made purchases before Oct 31st, after launch of the program are down significantly. 164 | 165 | Lets check if the drop in sales from the non-referrals is significant before and after launch of program is significant. 166 | 167 | ```{r} 168 | nonreferrals_before_vs_after <- data %>% 169 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch")) %>% 170 | filter(is_referral == 0) %>% 171 | group_by(date, before_after_launch) %>% 172 | summarise(daily_sales = sum(money_spent)) 173 | 174 | t.test(nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "Before Launch"], nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "After Launch"]) 175 | ``` 176 | 177 | Clearly, T-test suggests that the difference in sales from non-referrals before and after the launch of program is statistically significant. 178 | 179 | It is quite possible that new program is cannibalizing sales from the existing users. Lets check if that is the case. 180 | ```{r} 181 | # Check if multiple users are using same device 182 | counts_deviceid <- data %>% 183 | group_by(device_id) %>% 184 | summarise(n_users = n_distinct(user_id), referral_types = n_distinct(is_referral)) %>% 185 | arrange(desc(referral_types)) 186 | 187 | counts_deviceid 188 | 189 | # Are new referred users coming from existing devices 190 | referrals_using_existing_devices <- data %>% 191 | filter(device_id %in% counts_deviceid$device_id[counts_deviceid$referral_types > 1], is_referral == 1) %>% 192 | select(user_id) %>% 193 | distinct() 194 | 195 | nrow(referrals_using_existing_devices) 196 | ``` 197 | 198 | There are about 8571 users that were showing up as new referrals, but were using devices seen earlier. 199 | Clearly, users are referring themselves or members of same family. 200 | 201 | Although it appears that sales have increased after launch, The new program is cannibalizing the sales from the existing users. Users are simply creating new accounts and referring themselves or members of family to take advantage of the $10 credit for a new referral. The jump in sales we witnessed after launch could very well have happened even if the program would not have launched. 202 | 203 | ### Next Steps : 204 | The company should come up with a robust way to test the referral program that controls for unwanted confounders. 205 | If they decide to launch the referral program, They should look into ways to prevent the abuse of program by restricting number new user sign-ups from a single device. 206 | 207 | 208 | # Question 3: 209 | 210 | The referral program wasn't really tested in a rigorous way. It simply started on a given day for all users and you are drawing conclusions by looking at the data before and after the test started. What kinds of risks this approach presents? Can you think of a better way to test the referral program and measure its impact? 211 | 212 | The way the referral program was tested is faulty. The experimental set up is not accounting for the seasonal nature of sales and there could be confounding variables in play that affect the sales. This approach poses the risk of false positive, concluding that the new program is causing the jump in sales when in fact it is not. 213 | 214 | A better way to test the referral program is Testing by Markets. 215 | Metric : Average Daily sales 216 | 217 | ### Experimental Setup : 218 | 219 | * Identify the pairs of markets or zones that have similar sales and demographic characteristics by looking at the historical data. 220 | * Estimate the sample size (number of days) needed to detect the expected effect size. 221 | * Now among the similar pair, randomly select one to be test group and launch the referral program. The other will be the control group. 222 | * Run the test at the sametime for both test and contol groups until required sample size is reached. Running the test at the sametime is critical to control for the seasonality and other confounders. 223 | * Check if difference in sales is significant using a T-test. 224 | 225 | 226 | 227 | -------------------------------------------------------------------------------- /User Referral Program/User_Referral_Program.md: -------------------------------------------------------------------------------- 1 | User Referral Program 2 | ================ 3 | Siddhartha Jetti 4 | 10/12/2019 5 | 6 | # Goal 7 | 8 | Almost all sites have a user referral program: you can invite new users 9 | to try a given product. Typically, after the new user completes a 10 | transaction, you get rewarded with a certain amount of money or credit 11 | to be used on the site. 12 | 13 | The goal of this challenge is to analyze the data from a referral 14 | program and draw conclusions about its effectiveness. 15 | 16 | # Challenge Description 17 | 18 | Company XYZ has started a new referral program on Oct, 31. Each user who 19 | refers a new user will get 10$ in credit when the new user buys 20 | something. 21 | 22 | The program has been running for almost a month and the Growth Product 23 | Manager wants to know if it’s been successful. She is very excited 24 | cause, since the referral program started, the company saw a spike in 25 | number of users and wants you to be able to give her some data she can 26 | show to her boss. 27 | 28 | - Can you estimate the impact the program had on the site? 29 | - Based on the data, what would you suggest to do as a next step? 30 | - The referral program wasn’t really tested in a rigorous way. It 31 | simply started on a given day for all users and you are drawing 32 | conclusions by looking at the data before and after the test 33 | started. What kinds of risks this approach presents? Can you think 34 | of a better way to test the referral program and measure its impact? 35 | 36 | # Data 37 | 38 | We have one dataset “referral.csv” 39 | 40 | ### Columns: 41 | 42 | - user\_id : the id of the user 43 | - date : date of the purchase 44 | - country : user country based on ip address 45 | - money\_spent : how much the item bought costs (in USD) 46 | - is\_referral : whether the user came from the referral program(1) or 47 | not (0) 48 | - device\_id : It is an identifier of the device used to make the 49 | purchase. You can assume here that for a given device, its id never 50 | changes 51 | 52 | # Problem Setup 53 | 54 | ``` r 55 | # Load required libraries 56 | library(tidyverse) 57 | ``` 58 | 59 | ## Registered S3 methods overwritten by 'ggplot2': 60 | ## method from 61 | ## [.quosures rlang 62 | ## c.quosures rlang 63 | ## print.quosures rlang 64 | 65 | ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── 66 | 67 | ## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2 68 | ## ✔ tibble 2.1.1 ✔ dplyr 0.8.1 69 | ## ✔ tidyr 0.8.3 ✔ stringr 1.4.0 70 | ## ✔ readr 1.3.1 ✔ forcats 0.4.0 71 | 72 | ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── 73 | ## ✖ dplyr::filter() masks stats::filter() 74 | ## ✖ dplyr::lag() masks stats::lag() 75 | 76 | ``` r 77 | library(ggplot2) 78 | 79 | # Read in the input data into a dataframe 80 | data <- read.csv("referral.csv", stringsAsFactors = F) 81 | ``` 82 | 83 | # Data Exploration 84 | 85 | Explore data 86 | 87 | ``` r 88 | # Check datatypes of all the variables in the dataset 89 | str(data) 90 | ``` 91 | 92 | ## 'data.frame': 97341 obs. of 6 variables: 93 | ## $ user_id : int 2 3 6 7 7 10 17 19 19 19 ... 94 | ## $ date : chr "2015-10-03" "2015-10-03" "2015-10-03" "2015-10-03" ... 95 | ## $ country : chr "FR" "CA" "FR" "UK" ... 96 | ## $ money_spent: int 65 54 35 73 35 36 25 69 17 29 ... 97 | ## $ is_referral: int 0 0 0 0 0 0 0 0 0 0 ... 98 | ## $ device_id : chr "EVDCJTZMVMJDG" "WUBZFTVKXGQQX" "CBAPCJRTFNUJG" "PRGXJZAJKMXRH" ... 99 | 100 | ``` r 101 | summary(data) 102 | ``` 103 | 104 | ## user_id date country money_spent 105 | ## Min. : 1 Length:97341 Length:97341 Min. : 10.00 106 | ## 1st Qu.: 2020 Class :character Class :character 1st Qu.: 27.00 107 | ## Median : 4053 Mode :character Mode :character Median : 42.00 108 | ## Mean : 6355 Mean : 44.69 109 | ## 3rd Qu.:10286 3rd Qu.: 59.00 110 | ## Max. :20000 Max. :220.00 111 | ## is_referral device_id 112 | ## Min. :0.0000 Length:97341 113 | ## 1st Qu.:0.0000 Class :character 114 | ## Median :0.0000 Mode :character 115 | ## Mean :0.2878 116 | ## 3rd Qu.:1.0000 117 | ## Max. :1.0000 118 | 119 | The range of each of the variables appear reasonable without any 120 | non-sensical values. 121 | 122 | ``` r 123 | # check for any missing values in the dataset 124 | colSums(is.na(data)) 125 | ``` 126 | 127 | ## user_id date country money_spent is_referral device_id 128 | ## 0 0 0 0 0 0 129 | 130 | No missing values exist in the dataset. 131 | 132 | Run descriptive stats on the dataset. 133 | 134 | ``` r 135 | # Total revenue by date 136 | data %>% 137 | mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>% 138 | ggplot(aes(x = date, y = money_spent)) + 139 | stat_summary(fun.y=sum, geom="line") + 140 | stat_summary(fun.y=sum, geom="point") + 141 | geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "red") 142 | ``` 143 | 144 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png) 145 | 146 | The above plot shows a jump in daily sales after the referral program is 147 | launched on Oct 31st (dashed line in the plot). 148 | 149 | ``` r 150 | # Split revenue between users from referrals and non=referred users 151 | data %>% 152 | group_by(is_referral) %>% 153 | summarise(unique_users = n_distinct(user_id), total_sales = sum(money_spent), revenue_per_user = sum(money_spent)/n_distinct(user_id)) 154 | ``` 155 | 156 | ## # A tibble: 2 x 4 157 | ## is_referral unique_users total_sales revenue_per_user 158 | ## 159 | ## 1 0 11951 3034612 254. 160 | ## 2 1 12715 1315787 103. 161 | 162 | ``` r 163 | # Total revenue split between referrals and non-referrals 164 | data %>% 165 | mutate(is_referral = as.factor(is_referral), date = as.Date(date)) %>% 166 | ggplot(aes(x = date, y = money_spent, color = is_referral)) + 167 | stat_summary(fun.y=sum, geom="line") + 168 | stat_summary(fun.y=sum, geom="point") + 169 | geom_vline(xintercept = as.Date("2015-10-31"), linetype = "dashed", color = "black") 170 | ``` 171 | 172 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png) 173 | 174 | The above plot reveals that daily sales from users, who came from 175 | referrals, are helping the overall sales after the launch of program. 176 | However, due to the way the experiment is set up, the jump in sales 177 | cannot be attributed to the referral program. 178 | 179 | # Question 1: 180 | 181 | Can you estimate the impact the program had on the site? 182 | 183 | Here I plan to use daily sales as the metric to assess the impact of 184 | referral program. It is easy to think of other metrics like number of 185 | users who made atleast one purchase, money spent per purchase etc. At 186 | the end of day, total revenue is much more meaningful metric than 187 | others. 188 | 189 | Summarize the data by date to get the daily sales. 190 | 191 | ``` r 192 | data_daily <- data %>% 193 | group_by(date) %>% 194 | summarise(daily_sales = sum(money_spent)) 195 | 196 | data_daily 197 | ``` 198 | 199 | ## # A tibble: 56 x 2 200 | ## date daily_sales 201 | ## 202 | ## 1 2015-10-03 128475 203 | ## 2 2015-10-04 130772 204 | ## 3 2015-10-05 46946 205 | ## 4 2015-10-06 49711 206 | ## 5 2015-10-07 45242 207 | ## 6 2015-10-08 48463 208 | ## 7 2015-10-09 48389 209 | ## 8 2015-10-10 129837 210 | ## 9 2015-10-11 128229 211 | ## 10 2015-10-12 48712 212 | ## # … with 46 more rows 213 | 214 | Compare the daily sales before and after the launch of referral program. 215 | This is done by running T-test on daily sales and check if the 216 | difference is statistically 217 | significant. 218 | 219 | ``` r 220 | t.test(data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")], data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")]) 221 | ``` 222 | 223 | ## 224 | ## Welch Two Sample t-test 225 | ## 226 | ## data: data_daily$daily_sales[data_daily$date < as.Date("2015-10-31")] and data_daily$daily_sales[data_daily$date >= as.Date("2015-10-31")] 227 | ## t = -1.1138, df = 53.026, p-value = 0.2704 228 | ## alternative hypothesis: true difference in means is not equal to 0 229 | ## 95 percent confidence interval: 230 | ## -33770.245 9655.459 231 | ## sample estimates: 232 | ## mean of x mean of y 233 | ## 71657.00 83714.39 234 | 235 | The T-test suggests that the difference in daily sales before and after 236 | the launch of program is not statistically significant. 237 | 238 | Check the performance of referral program across different user 239 | segments. 240 | 241 | ``` r 242 | # Compute average daily sales for each country 243 | data_daily_country <- data %>% 244 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"), 245 | country = as.factor(country)) %>% 246 | group_by(date, country, before_after_launch) %>% 247 | summarise(daily_sales = sum(money_spent)) %>% 248 | group_by(country) %>% 249 | summarise( avg_diff_daily_sales = mean(daily_sales[before_after_launch == "Before Launch"]) - mean(daily_sales[before_after_launch == "After Launch"]), 250 | p_value = t.test(daily_sales[before_after_launch == "Before Launch"], daily_sales[before_after_launch == "After Launch"])$p.value) 251 | 252 | data_daily_country 253 | ``` 254 | 255 | ## # A tibble: 9 x 3 256 | ## country avg_diff_daily_sales p_value 257 | ## 258 | ## 1 CA -412 0.703 259 | ## 2 CH 512. 0.0139 260 | ## 3 DE 1843. 0.163 261 | ## 4 ES -2012. 0.0750 262 | ## 5 FR -3250. 0.0637 263 | ## 6 IT -2542. 0.0515 264 | ## 7 MX -2058. 0.0193 265 | ## 8 UK -2983. 0.0970 266 | ## 9 US -1156. 0.498 267 | 268 | The difference in average daily sales before and after launch are 269 | statistically significant for Mexico MX and China CH. However, sales are 270 | better after launch in MX and worse in CH. 271 | 272 | ``` r 273 | # Plot average daily sales Vs Country 274 | data %>% 275 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch"), 276 | country = as.factor(country)) %>% 277 | group_by(date, country, before_after_launch) %>% 278 | summarise(daily_sales = sum(money_spent)) %>% 279 | group_by(country, before_after_launch) %>% 280 | summarise(avg_daily_sales = mean(daily_sales)) %>% 281 | ggplot(aes(x = country, y = avg_daily_sales, color = before_after_launch, group = before_after_launch)) + 282 | geom_line() + 283 | geom_point() 284 | ``` 285 | 286 | ![](User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png) 287 | 288 | The above plot reveals that the referral program is performing 289 | differently across different countries. The average daily sales dropped 290 | after the launch of program in countries like China CH and Germany DE. 291 | While sales appear to increase in other countries after the launch. 292 | 293 | Based on the way the experiment is set up, Any change in sales after the 294 | launch of referral program cannot be attributed to it alone. There are 295 | other confounding factors like seasonality or launch of some marketing 296 | event that could be affecting the sales. In US and in some other 297 | countries, Thanks giving (occurs in end of November every year) is one 298 | of the major holiday season and sales are generally expected to grow and 299 | the experiment is not controlling for the lurking variables like that. 300 | 301 | # Question 2: 302 | 303 | Based on the data, what would you suggest to do as a next step? 304 | 305 | From the revenue split between referrals and non-referrals plot, It is 306 | apparent that the sales from the non-referrals, which includes all 307 | existing users who signed up/made purchases before Oct 31st, after 308 | launch of the program are down significantly. 309 | 310 | Lets check if the drop in sales from the non-referrals is significant 311 | before and after launch of program is significant. 312 | 313 | ``` r 314 | nonreferrals_before_vs_after <- data %>% 315 | mutate(before_after_launch = ifelse(date < as.Date("2015-10-31"), "Before Launch", "After Launch")) %>% 316 | filter(is_referral == 0) %>% 317 | group_by(date, before_after_launch) %>% 318 | summarise(daily_sales = sum(money_spent)) 319 | 320 | t.test(nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "Before Launch"], nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "After Launch"]) 321 | ``` 322 | 323 | ## 324 | ## Welch Two Sample t-test 325 | ## 326 | ## data: nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == and nonreferrals_before_vs_after$daily_sales[nonreferrals_before_vs_after$before_after_launch == "Before Launch"] and "After Launch"] 327 | ## t = 4.3754, df = 40.085, p-value = 8.417e-05 328 | ## alternative hypothesis: true difference in means is not equal to 0 329 | ## 95 percent confidence interval: 330 | ## 18799.12 51070.88 331 | ## sample estimates: 332 | ## mean of x mean of y 333 | ## 71657 36722 334 | 335 | Clearly, T-test suggests that the difference in sales from non-referrals 336 | before and after the launch of program is statistically significant. 337 | 338 | It is quite possible that new program is cannibalizing sales from the 339 | existing users. Lets check if that is the case. 340 | 341 | ``` r 342 | # Check if multiple users are using same device 343 | counts_deviceid <- data %>% 344 | group_by(device_id) %>% 345 | summarise(n_users = n_distinct(user_id), referral_types = n_distinct(is_referral)) %>% 346 | arrange(desc(referral_types)) 347 | 348 | counts_deviceid 349 | ``` 350 | 351 | ## # A tibble: 17,887 x 3 352 | ## device_id n_users referral_types 353 | ## 354 | ## 1 AAEIJZQUXUETO 1 2 355 | ## 2 AAFWFCPHRVTTE 2 2 356 | ## 3 AALHUFWNVROUD 2 2 357 | ## 4 AANBNDYFXDDTU 2 2 358 | ## 5 AANMQLIXUDKFJ 2 2 359 | ## 6 AAPOSGLGOGVMI 2 2 360 | ## 7 AAUOIIUZHHWTL 1 2 361 | ## 8 ABABAWCOLPZXO 2 2 362 | ## 9 ABADIUGCJRDOV 2 2 363 | ## 10 ABAPCEOKZCDAH 1 2 364 | ## # … with 17,877 more rows 365 | 366 | ``` r 367 | # Are new referred users coming from existing devices 368 | referrals_using_existing_devices <- data %>% 369 | filter(device_id %in% counts_deviceid$device_id[counts_deviceid$referral_types > 1], is_referral == 1) %>% 370 | select(user_id) %>% 371 | distinct() 372 | 373 | nrow(referrals_using_existing_devices) 374 | ``` 375 | 376 | ## [1] 8571 377 | 378 | There are about 8571 users that were showing up as new referrals, but 379 | were using devices seen earlier. Clearly, users are referring themselves 380 | or members of same family. 381 | 382 | Although it appears that sales have increased after launch, The new 383 | program is cannibalizing the sales from the existing users. Users are 384 | simply creating new accounts and referring themselves or members of 385 | family to take advantage of the $10 credit for a new referral. The jump 386 | in sales we witnessed after launch could very well have happened even if 387 | the program would not have launched. 388 | 389 | ### Next Steps : 390 | 391 | The company should come up with a robust way to test the referral 392 | program that controls for unwanted confounders. If they decide to launch 393 | the referral program, They should look into ways to prevent the abuse of 394 | program by restricting number new user sign-ups from a single device. 395 | 396 | # Question 3: 397 | 398 | The referral program wasn’t really tested in a rigorous way. It simply 399 | started on a given day for all users and you are drawing conclusions by 400 | looking at the data before and after the test started. What kinds of 401 | risks this approach presents? Can you think of a better way to test the 402 | referral program and measure its impact? 403 | 404 | The way the referral program was tested is faulty. The experimental set 405 | up is not accounting for the seasonal nature of sales and there could be 406 | confounding variables in play that affect the sales. This approach poses 407 | the risk of false positive, concluding that the new program is causing 408 | the jump in sales when in fact it is not. 409 | 410 | A better way to test the referral program is Testing by Markets. Metric 411 | : Average Daily sales 412 | 413 | ### Experimental Setup : 414 | 415 | - Identify the pairs of markets or zones that have similar sales and 416 | demographic characteristics by looking at the historical data. 417 | - Estimate the sample size (number of days) needed to detect the 418 | expected effect size. 419 | - Now among the similar pair, randomly select one to be test group and 420 | launch the referral program. The other will be the control group. 421 | - Run the test at the sametime for both test and contol groups until 422 | required sample size is reached. Running the test at the sametime is 423 | critical to control for the seasonality and other confounders. 424 | - Check if difference in sales is significant using a T-test. 425 | -------------------------------------------------------------------------------- /User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhujetty/Product-analytics-insights-collection/bc93bc92ccfc27e5d7c24afab427cd188915a8ca/User Referral Program/User_Referral_Program_files/figure-gfm/unnamed-chunk-9-1.png --------------------------------------------------------------------------------