├── Images
    ├── after_normalization.png
    ├── age_vs_daily_internet_usage.png
    ├── age_vs_daily_time_spent_on_site.png
    ├── before_normalization.png
    ├── bivariate_cats.png
    ├── bivariate_nums.png
    ├── bivariate_nums2.png
    ├── clicked_on_ad.png
    ├── confusion_matrix_best_model.png
    ├── corr_cats.png
    ├── corr_nums.png
    ├── corr_nums_target.png
    ├── daily_internet_usage_vs_daily_spent_time_on_site.png
    ├── data_exploration.png
    ├── desc_categorical.png
    ├── desc_numerical.png
    ├── duplicates.png
    ├── feature_importance.png
    ├── feature_importance2.png
    ├── header.png
    ├── learning_curve_best_model.png
    ├── null_values.png
    ├── null_values_after.png
    ├── outliers_after.png
    ├── outliers_before.png
    ├── uni_cats.png
    ├── uni_nums.png
    └── uni_nums2.png
├── Predicting_Ad_Clicks_Classification_by_Using_Machine_Learning.ipynb
├── README.md
└── requirements.txt


/Images/after_normalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/after_normalization.png


--------------------------------------------------------------------------------
/Images/age_vs_daily_internet_usage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/age_vs_daily_internet_usage.png


--------------------------------------------------------------------------------
/Images/age_vs_daily_time_spent_on_site.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/age_vs_daily_time_spent_on_site.png


--------------------------------------------------------------------------------
/Images/before_normalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/before_normalization.png


--------------------------------------------------------------------------------
/Images/bivariate_cats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/bivariate_cats.png


--------------------------------------------------------------------------------
/Images/bivariate_nums.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/bivariate_nums.png


--------------------------------------------------------------------------------
/Images/bivariate_nums2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/bivariate_nums2.png


--------------------------------------------------------------------------------
/Images/clicked_on_ad.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/clicked_on_ad.png


--------------------------------------------------------------------------------
/Images/confusion_matrix_best_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/confusion_matrix_best_model.png


--------------------------------------------------------------------------------
/Images/corr_cats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/corr_cats.png


--------------------------------------------------------------------------------
/Images/corr_nums.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/corr_nums.png


--------------------------------------------------------------------------------
/Images/corr_nums_target.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/corr_nums_target.png


--------------------------------------------------------------------------------
/Images/daily_internet_usage_vs_daily_spent_time_on_site.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/daily_internet_usage_vs_daily_spent_time_on_site.png


--------------------------------------------------------------------------------
/Images/data_exploration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/data_exploration.png


--------------------------------------------------------------------------------
/Images/desc_categorical.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/desc_categorical.png


--------------------------------------------------------------------------------
/Images/desc_numerical.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/desc_numerical.png


--------------------------------------------------------------------------------
/Images/duplicates.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/duplicates.png


--------------------------------------------------------------------------------
/Images/feature_importance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/feature_importance.png


--------------------------------------------------------------------------------
/Images/feature_importance2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/feature_importance2.png


--------------------------------------------------------------------------------
/Images/header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/header.png


--------------------------------------------------------------------------------
/Images/learning_curve_best_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/learning_curve_best_model.png


--------------------------------------------------------------------------------
/Images/null_values.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/null_values.png


--------------------------------------------------------------------------------
/Images/null_values_after.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/null_values_after.png


--------------------------------------------------------------------------------
/Images/outliers_after.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/outliers_after.png


--------------------------------------------------------------------------------
/Images/outliers_before.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/outliers_before.png


--------------------------------------------------------------------------------
/Images/uni_cats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/uni_cats.png


--------------------------------------------------------------------------------
/Images/uni_nums.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/uni_nums.png


--------------------------------------------------------------------------------
/Images/uni_nums2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/2c7db2e2dc82bec2a0b7d609155130a662455100/Images/uni_nums2.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Predicting Ad Clicks: Classification by Using Machine Learning
  2 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/header.png" alt="Predict Ad Clicks" style="width:600px;height:400px;">
  3 | 
  4 | ## Background
  5 | An Indonesian company is interested in evaluating the performance of advertisement on their website. This evaluation is essential for the company as it helps gauge the advertisement’s reach and its ability to engage customers.
  6 | 
  7 | By analyzing historical advertisement data and uncovering insights and trends, this approach can aid the company in defining their marketing objectives. In this specific case, the primary emphasis lies in developing a machine learning classification model that can accurately identify the target customer demographic.
  8 | 
  9 | ## Problem
 10 | 1. The company in question currently displays ads for all of its users. This “shotgun” strategy yields them an ad click rate of 50%.
 11 | 2. The company spends a nonoptimal amount of resources by displaying ads for all of its users.
 12 | 
 13 | ## Objectives
 14 | 1. Create well-fit machine learning models that can reliably predict which users are likely to click on an ad.
 15 | 2. Determine the best machine learning model to be implemented in this use case based on; evaluation metrics (Recall and Accuracy), model simplicity, and prediction time.
 16 | 3. Identify the factors that most influence users’ likelihood to click on an ad.
 17 | 4. Provide recommendations for potential strategies regarding targeted ads to marketing and management teams based on findings from analyzes and modeling.
 18 | 5. Calculate the potential impact of model implementation on profit and click rate.
 19 | 
 20 | ## About the Dataset
 21 | The dataset was obtained from [Rakamin Academy](https://www.rakamin.com/).
 22 | 
 23 | **Description:**
 24 | 
 25 | - <code>Unnamed: 0</code> = ID of Customers
 26 | - <code>Daily Time Spent on Site</code> = Time spent by the user on a site in minutes
 27 | - <code>Age</code> = Customer’s age in terms of years
 28 | - <code>Area Income</code> = Average income of geographical area of consumer
 29 | - <code>Daily Internet Usage</code> = Average minutes in a day consumer is on the internet
 30 | - <code>Male</code> = Gender of the customer
 31 | - <code>Timestamp</code> = Time at which user clicked on an Ad or the closed window
 32 | - <code>Clicked on Ad</code> = Whether or not the customer clicked on an Ad (Target Variable)
 33 | - <code>city</code> = City of the consumer
 34 | - <code>province</code> = Province of the consumer
 35 | - <code>category</code> = Category of the advertisement
 36 | 
 37 | **Overview:**
 38 | 
 39 | 1. Dataset contains 1000 rows, 10 features and 1 <code>Unnamed: 0</code> column which is the ID column.
 40 | 2. Dataset consists of 3 data types; float64, int64 and object.
 41 | 3. <code>Timestamp</code> feature could be changed into datetime data type.
 42 | 4. Dataset contains null values in various columns.
 43 | 
 44 | ## Data Analysis
 45 | ### Univariate
 46 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/uni_nums.png" alt="Univariate Numerical" style="width:600px;height:400px;">
 47 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>KDE Plot of Numerical Features</b>
 48 | <br><br><br>
 49 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/uni_nums2.png" alt="Univariate Numerical" style="width:700px;height:300px;">
 50 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>Boxplot of Numerical Features</b>
 51 | 
 52 | **Analysis:**
 53 | 
 54 | - <code>Area Income</code> is the only feature with a slight skew (left-skewed).
 55 | - <code>Daily Internet Usage</code> is nearly uniformly distributed.
 56 | - While <code>Age</code> and <code>Daily Time Spent on Site</code> is nearly normally distributed.
 57 | 
 58 | ### Bivariate
 59 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/bivariate_nums.png" alt="Bivariate Numerical" style="width:600px;height:400px;">
 60 | &nbsp;&nbsp;&nbsp;&nbsp;<b>KDE Plot of Numerical Features Between Users That Clicked and Didn’t Click On Ad
 61 | </b>
 62 | <br><br><br>
 63 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/bivariate_nums2.png" alt="Bivariate Numerical" style="width:700px;height:400px;">
 64 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>Boxplot of Numerical Features Between Users That Clicked and Didn’t Click On Ad</b>
 65 | <br>
 66 | 
 67 | ### Scatterplot
 68 | 
 69 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/age_vs_daily_internet_usage.png" alt="Scatterplot" style="width:500px;height:500px;">
 70 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/age_vs_daily_time_spent_on_site.png" alt="Scatterplot" style="width:500px;height:500px;">
 71 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/daily_internet_usage_vs_daily_spent_time_on_site.png" alt="Scatterplot" style="width:500px;height:500px;">
 72 | 
 73 | **Analysis:**
 74 | 
 75 | - The more time is spent on site by the customer the less likely they will click on an ad.
 76 | - The average age of customers that clicked on an ad is 40, while the average for those that didn’t is 31.
 77 | - The average area income of customers that clicked on an ad is considerably lower than those that didn’t.
 78 | - Similar to time spent, the more the daily internet usage is, the less likely the customer will click on an ad.
 79 | - As can be seen the last scatterplot above, there is a quite clear separation between two clusters of data. One cluster is less active and the other more so. Less active customers have a higher tendency to click on an ad compared to more active customers.
 80 | 
 81 | ### Multivariate (Numerical)
 82 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/corr_nums.png" alt="Numerical Correlation" style="width:500px;height:500px;">
 83 | 
 84 | Since the target variable is binary (Clicked or didn’t click), we can’t use the standard Pearson correlation to see the correlation between the numerical features and the target. Hence, the **Point Biserial correlation** was used to analyze the correlation between the target and the numerical features.
 85 | 
 86 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/corr_nums_target.png" alt="Numerical Correlation with Target" style="width:500px;height:400px;">
 87 | 
 88 | ### Multivariate (Categorical)
 89 | In order to see the correlation between categorical features, I again couldn’t employ the standard Pearson correlation. There are numerous methods out there, but in this study I used **Cramer’s V** to understand the association and by extension the correlation between categorical features.
 90 | 
 91 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/corr_cats.png" alt="Categorical Correlation" style="width:500px;height:500px;">
 92 | 
 93 | **Analysis:**
 94 | 
 95 | - The perfect correlation coefficient of 1 indicates that <code>city</code> and <code>province</code> are perfectly associated. This makes sense, since if you knew the city you would also know the province. This means that using both features for machine learning modeling is redundant.
 96 | - All the numerical features (especially <code>Daily Internet Usage</code> and <code>Daily Time Spent on Site</code>) have high correlation with the target variable.
 97 | 
 98 | ## Data Preprocessing
 99 | ### Handling Missing Values
100 | Missing or null values were present in various columns in the data. These null values were imputed using statistically central values such as mean or median accordingly.
101 | 
102 | ### Feature Extraction
103 | The <code>Timestamp</code> feature was of “object” data type. This feature was transformed into “date-time” data type in order to have its contents extracted. There are 3 extra features that were extracted from the Timestamp feature, these are:
104 | 
105 | 1. Month
106 | 2. Week
107 | 3. Day (Mon-Sun)
108 | 
109 | ### Handling Outliers
110 | Outliers were present in the <code>Area Income</code> feature. To help the data in becoming more “model friendly” to linear and distance based models, these outliers were removed from the data (using IQR method).
111 | 
112 | **Before:**
113 | 
114 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/outliers_before.png" alt="Outliers Before" style="width:500px;height:150px;">
115 | 
116 | **After:**
117 | 
118 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/outliers_after.png" alt="Outliers Before" style="width:500px;height:150px;">
119 | 
120 | ### Feature Selection
121 | From the multivariate analysis it can be seen that <code>city</code> and <code>province</code> are redundant. However, both features have high cardinality and don’t correlate too well with the target variable. As a result, both of these features were excluded in the modeling as to prevent the curse of dimensionality. <code>Timestamp</code> was also excluded since its contents have been extracted and it was no longer needed. <code>Unnamed: 0</code> or ID was obviously excluded since it is an identifier column and is unique for every user.
122 | 
123 | ### Feature Encoding
124 | The categorical features were encoded so that machine learning models could read and understand its values. The <code>category</code> feature was One-Hot encoded whilst the rest (<code>Gender</code> and <code>Clicked on Ad</code>) were label encoded.
125 | 
126 | ### Dataset Splitting
127 | The data was split into training and testing sets. This is common practice as to make sure that the machine learning models weren’t simply memorizing the answers from the data it was given. The data was split in a 75:25 ratio, 75% training data and 25% testing data.
128 | 
129 | ## Modeling
130 | In the modeling phase, an experiment was conducted. Machine learning models were trained and tested with non-normalized/non-standardized version of the data. The results of which were compared to models trained and tested with normalized/standardized version of the data. Hyperparameter tuning was done in both scenarios, as to get the best out of each model. The model and standardization method with the best results will be selected. Six models were experimented with, these are:
131 | 
132 | - <code>Logistic Regression</code>
133 | - <code>Decision Tree</code>
134 | - <code>Random Forest</code>
135 | - <code>K-Nearest Neighbors</code>
136 | - <code>Gradient Boosting</code>
137 | - <code>XGBoost</code>
138 | 
139 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/clicked_on_ad.png" alt="Target Variable" style="width:300px;height:50px;">
140 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>Target Variable Class Balance</b>
141 | <br><br>
142 | Because the target has perfect class balance the primary metric that will be used is <code>Accuracy</code>. <code>Recall</code> will be the secondary metric as to minimize false negatives.
143 | 
144 | ### Without Normalization/Standardization
145 | **Results:**
146 | 
147 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/before_normalization.png" alt="Before Normalization/Standardization" style="width:800px;height:200px;">
148 | 
149 | **Observation:**
150 | - <code>Decision Tree</code> had the lowest fit time of all the models but the second lowest accuracy overall.
151 | - <code>Gradient Boosting</code> had the highest accuracy and recall scores but <code>XGBoost</code> is not far behind.
152 | - Due to the non-normalized data, distance based algorithms like <code>K-Nearest Neighbours</code> and linear algorithms like <code>Logistic Regression</code> suffered heavily.
153 |   - <code>Logistic Regression</code> could not converge properly using newton-cg and as a result had the highest fit time of all the models, even though it probably is the simplest model of them all.
154 |   - <code>K-Nearest Neighbours</code> suffered in accuracy and recall scores, with both being by far the lowest of all the models tested.
155 | 
156 | ### Using Normalization/Standardization
157 | **Results:**
158 | 
159 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/before_normalization.png" alt="After Normalization/Standardization" style="width:800px;height:200px;">
160 | 
161 | **Observation:**
162 | - <code>K-Nearest Neighbours</code> had the highest fit time and elapsed time.
163 | - <code>Gradient Boosting</code> had the highest accuracy and the highest recall (tied with <code>Random Forest</code>), but <code>Logistic Regression</code> in close second, had the highest cross-validated accuracy of all the models tested.
164 | - <code>Random Forest</code> and <code>XGBoost</code> also had nearly identical scores in close third and fourth, although <code>XGBoost</code> had the better fit and elapsed times.
165 | - With normalized data, the previously poor performing distance based and linear models have shone through.
166 |   - <code>Logistic Regression</code>'s fit and elapsed times had been reduced significantly making it the model with the lowest times. It's scores have also massively improved making it a close second-place model.
167 |   - <code>K-Nearest Neighbours</code> also saw massive improvement in its scores, with the model no longer sitting in last place in terms of scores.
168 |  
169 | By taking consideration of not only the above metrics but also the simplicity, explainability and fit and elapsed times, the model that will be chosen is the <code>Logistic Regression</code> model with normalization/standardization. This is not only because of the very high scores (especially cross-validated scores) but also the simplicity, explainability and relatively quick fit and elapsed times.
170 | 
171 | ### Evaluation of Selected Model (Logistic Regreession)
172 | **Learning Curve:**
173 | 
174 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/learning_curve_best_model.png" alt="Learning Curve" style="width:700px;height:400px;">
175 | As can be seen from the above learning curve, the tuned <code>Logistic Regression</code> model with normalization/standardization is well fitted with no overfitting/underfitting.
176 | <br><br>
177 | 
178 | **Confusion Matrix:**
179 | 
180 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/confusion_matrix_best_model.png" alt="Confusion Matrix" style="width:600px;height:400px;">
181 | 
182 | From the test set confusion matrix above, from 120 people that clicked on an ad the algorithm correctly classified 116 of them and incorrectly classified 4 of them. Similarly, out of 128 people that did not click on an ad the algorithm correctly classified 124 of them and incorrectly classified only 4.
183 | 
184 | Based on the confusion matrix and also the learning curve, it can be seen that <code>Logistic Regression</code> is a more than capable model to be implemented on this dataset.
185 | 
186 | **Feature Importance:**
187 | 
188 | Since <code>Logistic Regression</code> is such a simple and explainable model, to get the feature importance we can simply look at the <code>coefficients</code> of each feature in the model.
189 | 
190 | The <code>coefficients</code> represent the change in the log odds for a one-unit change in the feature variable. Larger absolute values indicate a stronger relationship between the feature and the target variable, while the sign of the <code>coefficients</code> (negative or positive) indicates the direction of the relationship between the two.
191 | 
192 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/feature_importance.png" alt="Feature Importance" style="width:600px;height:300px;">
193 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/feature_importance2.png" alt="Feature Importance" style="width:600px;height:300px;">
194 | 
195 | **Analysis:**
196 | 
197 | Based on the feature importance charts above, it can clearly be seen that the two features with most effect on the model are <code>Daily Time Spent on Site</code> and <code>Daily Internet Usage</code>.
198 | - The lower the <code>Daily Time Spent on Site</code> the bigger the odds that the customer will click on an ad and vice versa.
199 | - Similarly, the lower the <code>Daily Internet Usage</code> the higher the chances that the customer will click on an ad and vice versa.
200 | - Other Important features include; <code>Area Income</code> and <code>Age</code>.
201 | 
202 | ## Business Recommendations
203 | Based on the insights that have been gathered in the EDA as well as the feature importance from the model, the following business recommendations are formulated.
204 | 
205 | - **Content Personalization and Targeting:**
206 | Since the lower the Daily Time Spent on Site is the more likely the user is to click on an ad, it’s essential to focus on content personalization and user engagement. Tailor content to keep users engaged but not overloaded. This can be achieved through strategies like recommending relevant content and using user data to customize the user experience.
207 | - **Age-Targeted Advertising:**
208 | Older individuals are more likely to engage with ads. Therefore we can consider creating ad campaigns that are specifically designed to target and appeal to older demographics. This may include promoting products or services relevant to their age group.
209 | - **Income-Level Targeting:**
210 | Users in areas with lower income levels are more likely to click on ads. Therefore we can create ad campaigns that are budget-friendly and appealing to users with lower income. Additionally, consider tailoring the ad messaging to highlight cost-effective solutions.
211 | - **Optimize Ad Placement for Active Internet Users:**
212 | Heavy internet users are less responsive to ads. To improve ad performance, consider optimizing ad placement for users with lower internet usage or finding ways to make ads stand out to this group, such as through eye-catching visuals or unique offers.
213 | 
214 | ## Potential Impact Simulation
215 | <img src="https://github.com/farrellwahyudi/Predicting-Ad-Clicks-Classification-by-Using-Machine-Learning/blob/main/Images/clicked_on_ad.png" alt="Target Variable" style="width:250px;height:50px;">
216 | Using the original dataset’s Clicked on Ad numbers as can be seen above, the business simulation of before and after model implementation are as follows:
217 | <br><br>
218 | 
219 | **Assumption:**
220 | 
221 | Cost per Advertisement: Rp.1000
222 | 
223 | Revenue per Ad clicked: Rp.4000
224 | ****
225 | **Before model implementation:**
226 | 
227 | - **No. Users Advertised**:<br>
228 | Every User = 1000
229 | - **Click Rate**: <br>
230 | 500/1000 = 50%
231 | - **Total Cost**: <br>
232 | No. Users Advertised x Cost per Ad = 1000 x 1000 = Rp.1,000,000
233 | - **Total Revenue**: <br>
234 | Click Rate x No. Users Advertised x Revenue per Ad Clicked = 0.5 x 1000 x 4000 = Rp.2,000,000
235 | - **Total Profit**:<br>
236 | Total Revenue - Total Cost = **Rp.1,000,000**
237 | 
238 | **After model implementation:**
239 | 
240 | - **No. Users Advertised**:<br>
241 | (Precision x 500) + ((1-Specificity) x 500) = (96.67% x 500) + (0.03125 x 500) = 483 + 16 = 499
242 | - **Click Rate**:<br>
243 | (Precision x 500)/No. Users Advertised = 483/499 = 96.8%
244 | - **Total Cost**:<br>
245 | No. Users Advertised x Cost per Ad = 499 x 1000 = Rp.499,000
246 | - **Total Revenue**:<br>
247 | Click Rate x No. Users Advertised x Revenue per Ad Clicked = 0.968 x 499 x 4000 = Rp.1,932,000
248 | - **Total Profit**:<br>
249 | Total Revenue - Total Cost = 1,932,000 - 499,000 = **Rp.1,433,000**
250 | 
251 | **Conclusion:**
252 | 
253 | By comparing the profits and click rates of before and after model implementation, we can see that with model implementation click rate is up from **50%** to **96.8%**, and similarly profit is up from **Rp.1,000,000** to **Rp.1,433,000** (a **43.3%** increase).
254 | 
255 | <img src="https://cdn-icons-gif.flaticon.com/6416/6416353.gif" alt="Predict Ad Clicks" style="width:400px;height:400px;" loop=infinite>
256 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==1.5.3
2 | numpy==1.23.5
3 | matplotlib==3.7.1
4 | seaborn==0.13.0
5 | scipy==1.11.3
6 | scikit-learn==1.2.2
7 | xgboost==2.0.1


--------------------------------------------------------------------------------