├── README.md
└── av_jobathon_code_file.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | **Introduction**
 2 | 
 3 | [On February 11th, 2022, Analytics Vidhya conducted a 3-day hackathon in data science](https://datahack.analyticsvidhya.com/contest/job-a-thon-february-2022/?utm_source=datahack&utm_medium=navbar#About). The top candidates had the chance to be selected by various participating companies across a multitude of roles, in the field of data science, analytics and business intelligence.
 4 | 
 5 | The objective of the hackathon was to develop a machine learning approach to predict the engagement between a user and a video. The training data already had an engagement score feature, which was a floating-point number between 0 and 5. This considerably simplified matters, as in recommender systems, calculating the engagement score is often more challenging than predicting them. The challenge, therefore, was to predict this score based on certain user related and video related features.
 6 | 
 7 | The list of features in the dataset is given below:
 8 | 
 9 | | **Variable** | **Description** |
10 | | --- | --- |
11 | | row\_id | Unique identifier of the row |
12 | | user\_id | Unique identifier of the user |
13 | | category\_id | Category of the video |
14 | | video\_id | Unique identifier of the video |
15 | | age | Age of the user |
16 | | gender | Gender of the user (Male and Female) |
17 | | profession | Profession of the user (Student, Working Professional, Other) |
18 | | followers | No. of users following a particular category |
19 | | views | Total views of the videos present in the particular category |
20 | | engagement\_score | Engagement score of the video for a user |
21 | 
22 | **Initial Ideas**
23 | 
24 | Two main approaches were considered:
25 | 
26 | - **Regression Models** : Since the engagement score was a continuous variable, one could use a regression model. There were several reasons that I did not use this method:
27 |   1. The lack of features on which to build the model. The features &quot;user\_id&quot;, &quot;category\_id&quot; and &quot;video\_id&quot; were discrete features that would need to be encoded. Since each of these features had many unique values, using a simple One-Hot Encoding would not work due to the increase in the number of dimensions. I also considered using other categorical encoders, like the CatBoost encoder, however, I&#39;ve found that most categorical encoders only do well if the target value is a categorical variable, not a continuous variable like the case here. This would leave us with only 5 features on which to make the model, which never felt enough. The 5 features also had very little scope for feature engineering, apart from some ideas on combining gender and profession and age and profession.
28 |   2. The fact that traditional regression models have never done well as recommender systems. The now famous Netflix challenge clearly showed the advantage collaborative filtering methods had over simple regressor models.
29 | 
30 | - **Collaborative Filtering** : This was the option I eventually went with. There was enough evidence to see that Collaborative Filtering was in almost all cases better than regression models. There are many different ways of implementing collaborative filtering, of which I finally decided to use the Matrix Factorization method (SVD).
31 | 
32 | **Final Model**
33 | 
34 | The first few runs of collaborative filtering were not very successful, with a low r2 score on the test set. However, running a Grid Search over the 3 main hyperparameters – the number of factors, the number of epochs and the learning rate, soon gave the optimal SVD. The final hyperparameters were:
35 | 
36 | - Number of Factors: 100
37 | - Number of Epochs: 500,
38 | - Learning Rate: 0.05
39 | 
40 | The r2 score on the test set at this point was 0.506, which took me to the top of the leaderboard.
41 | 
42 | The next step was to try to improve the model further. I decided to model the errors of the SVD model, so that the predictions of the SVD could be further adjusted by the error estimates. After trying a few different models, I selected the Linear Regression model to predict the errors.
43 | 
44 | To generate the error, the SVD model first had to run on a subset of the training set, so that the model could predict the error on the validation set. On this validation set, the Linear Regression was trained. The SVD was run on 95% of the training set, therefore, the regression was done on only 5% of the entire training set. The steps in the process were:
45 | 
46 | 1. Get the engagement score predictions using the SVD model for the validation set.
47 | 2. Calculate the error.
48 | 3. Train the model on this validation set, using the features – &quot;age&quot;, &quot;followers&quot;, &quot;views&quot;, &quot;gender&quot;, &quot;profession&quot; and &quot;initial\_estimate&quot;. The target variable was the error.
49 | 4. Finally, run both models on the actual test set, first the SVD, then the Linear Regression.
50 | 5. The final prediction is the difference between the initial estimates and the weighted error estimates. The error estimates were given a weight of 5%, since that was the proportion of data on which the Linear Regression model was trained.
51 | 6. There could be scenarios of the final prediction going above 5 or below 0. In such cases, adjust the prediction to either 5 or 0.
52 | 
53 | The final r2 score was 0.532, an increase of 2.9 points.
54 | 
55 | Ideas for Improvement
56 | 
57 | There are many ways I feel the model can be further improved. Some of them are:
58 | 
59 | 1. **Choosing the Correct Regression Model to Predict the Error** : It was quite unexpected that a weak learner like Linear Regression did better than stronger models like Random Forest and XGBoost. I feel that the main reason for this is that dataset used to train these regressors were quite small, only 5% of the entire training set. While the linear regression model worked well with such a small dataset, the more complicated models did not.
60 | 2. **Setting the Correct Subset for the SVD** : After trying a few different values, the SVD subset was set at 95% while the error subset set at only 5%. The reason for setting such a high percentage was that the SVD was the more powerful algorithm and I wanted that to be as accurate as possible. However, this severely compromised the error predictor. Finding the perfect balance here could improve the model performance.
61 | 3. **Selecting the Correct Weights for the Final Prediction** : The final prediction was the difference between the initial estimate and the weighted error estimate. Further analysis is needed to get the most optimum weights. Ideally, the weights should not be needed at all.
62 | 4. **Feature Engineering** : The error estimator had no feature engineering at all, in fact, I removed the feature &quot;category\_id&quot; as well. Adding new features could potentially help in improving the error estimates, however, the benefits would be low, as it accounts for only 5% of the final prediction.
63 | 


--------------------------------------------------------------------------------
/av_jobathon_code_file.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "75ad875b-1cbc-4413-beaa-4c84399696b0",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Based on the POC, the following will be the final method:\n",
  9 |     "\n",
 10 |     "1. Train SVD on a subset of the data. \n",
 11 |     "2. Get the initial predictions.\n",
 12 |     "3. Calculate the error\n",
 13 |     "4. Train a regressor model on the error\n",
 14 |     "5. Calculate the final prediction by adding the error estimate and the initial prediction."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "id": "7566b111-cb43-4ace-af8c-7c36991a35a9",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "**Final Questions to Answer**\n",
 23 |     "\n",
 24 |     "1. On what percentage of the total records should the SVD be trained? \n",
 25 |     "2. What should be the hyperparameters of SVD and the regressors? SVD is already answered, from the POC.\n",
 26 |     "3. While training the regressors, should we include category id? If yes, which encoder should be used?\n",
 27 |     "4. Can we use weighted sums to get the final prediction, rather than the simple sum?"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 76,
 33 |    "id": "95ca3346-be82-479a-8704-124bd8e36554",
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "# MANY UNUSED IMPORTS. CLEAN UP\n",
 38 |     "import pandas as pd\n",
 39 |     "import numpy as np\n",
 40 |     "import seaborn as sb\n",
 41 |     "%matplotlib inline\n",
 42 |     "\n",
 43 |     "from sklearn.metrics import r2_score, accuracy_score, mean_squared_error\n",
 44 |     "from category_encoders.leave_one_out import LeaveOneOutEncoder\n",
 45 |     "from sklearn.ensemble import RandomForestRegressor\n",
 46 |     "from xgboost import XGBRegressor\n",
 47 |     "from sklearn.model_selection import train_test_split\n",
 48 |     "from category_encoders import CatBoostEncoder\n",
 49 |     "\n",
 50 |     "from surprise.prediction_algorithms import SVD as SVD_sp\n",
 51 |     "from surprise import Reader as Reader_sp\n",
 52 |     "from surprise import Dataset as Dataset_sp\n",
 53 |     "from surprise.model_selection import train_test_split as train_test_split_sp\n",
 54 |     "from surprise.accuracy import rmse as rmse_sp\n",
 55 |     "from surprise.model_selection import GridSearchCV as GridSearchCV_sp\n",
 56 |     "from sklearn.linear_model import LinearRegression, Lasso, Ridge"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 6,
 62 |    "id": "78e3bd8f-d634-4b10-a652-02694c66a5f4",
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "df = pd.read_csv('train_0OECtn8.csv')"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "8af43d8d-7550-48ec-aa71-22d093503afe",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "# General Hyperparameters"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 348,
 80 |    "id": "085d6c9b-84a1-44cd-8653-5c9c09bbaa1c",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "# subset of the data on which the SVD is trained. This means that the regressors will be trained on 1 - svd_subset\n",
 85 |     "svd_subset = 0.95"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "id": "ec5782b9-d9b6-4cd3-b3a4-eda2f0afd73d",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "# SVD"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 349,
 99 |    "id": "b9808995-23a4-4ffb-a7ea-80e7cef16f4d",
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "# Convert the dataset into one that works with Surprise\n",
104 |     "df_surprise = df[['user_id', 'video_id', 'engagement_score']].copy()\n",
105 |     "df_surprise = df_surprise.sample(frac=1).reset_index(drop=True)\n",
106 |     "reader = Reader_sp(rating_scale=(0, 5))\n",
107 |     "df_surprise = Dataset_sp.load_from_df(df_surprise, reader=reader)\n",
108 |     "train_sp, test_sp = train_test_split_sp(df_surprise, train_size=svd_subset, test_size=1-svd_subset)\n"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 350,
114 |    "id": "f26032e1-55d2-4f14-9ace-b652c0ed5f62",
115 |    "metadata": {},
116 |    "outputs": [
117 |     {
118 |      "data": {
119 |       "text/plain": [
120 |        "<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdb3d08d7c0>"
121 |       ]
122 |      },
123 |      "execution_count": 350,
124 |      "metadata": {},
125 |      "output_type": "execute_result"
126 |     }
127 |    ],
128 |    "source": [
129 |     "# Train on the train set\n",
130 |     "svd = SVD_sp(n_factors=100, n_epochs=500, lr_all=0.05)\n",
131 |     "svd.fit(train_sp)\n"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 351,
137 |    "id": "f43a5132-acd2-42ad-9a6a-999db6959fd0",
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "# Get the initial estimates for the test set\n",
142 |     "preds_sp = svd.test(test_sp)\n",
143 |     "y_pred = list(map(lambda x: x.est, preds_sp))\n",
144 |     "y_true = list(map(lambda x: x.r_ui, preds_sp))"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 352,
150 |    "id": "4edccdc7-45bf-4ef3-b11b-9b777d1d0f53",
151 |    "metadata": {},
152 |    "outputs": [
153 |     {
154 |      "data": {
155 |       "text/plain": [
156 |        "0.4655336895459784"
157 |       ]
158 |      },
159 |      "execution_count": 352,
160 |      "metadata": {},
161 |      "output_type": "execute_result"
162 |     }
163 |    ],
164 |    "source": [
165 |     "r2_score(y_true, y_pred)"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "id": "75712213-553a-4d53-8b9a-82fe281cc8c5",
171 |    "metadata": {},
172 |    "source": [
173 |     "# Error Prediction\n",
174 |     "\n",
175 |     "We train the regressor model on the test set created earlier."
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 353,
181 |    "id": "156976a8-03cd-41c1-9ec9-08db71e0bf76",
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "# Filter out only those records from the source dataframe that are in test_sp\n",
186 |     "test_sp = pd.DataFrame(test_sp)\n",
187 |     "test_sp.columns = ['user_id', 'video_id', 'engagement_score']\n",
188 |     "df_v2 = df.merge(test_sp, on=['user_id', 'video_id', 'engagement_score'], how='inner')"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": 354,
194 |    "id": "7118ee8c-216d-4407-89cc-4e6e57d74af9",
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "# Add in the initial estimates and the errorto the new train set\n",
199 |     "df_v2['initial_estimate'] = y_pred\n",
200 |     "df_v2['error'] = df_v2['initial_estimate'] - df_v2['engagement_score']"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "id": "598c6d4f-a522-4586-b70c-7811dcb4f041",
206 |    "metadata": {},
207 |    "source": [
208 |     "We **subtract** the error from the initial estimate to get the correct value."
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 356,
214 |    "id": "5623ea38-c662-4be2-b04d-82a318d58055",
215 |    "metadata": {},
216 |    "outputs": [],
217 |    "source": [
218 |     "# Drop row id and engagement score, cannot use those\n",
219 |     "df_v2 = df_v2.drop(['row_id', 'engagement_score'], axis=1)"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 357,
225 |    "id": "9bbce1a3-f66e-4e3f-ac89-cbf1dfd93d78",
226 |    "metadata": {},
227 |    "outputs": [],
228 |    "source": [
229 |     "# Dummy columns for gender and profession\n",
230 |     "df_v2 = pd.get_dummies(df_v2, columns=['gender', 'profession'], drop_first=True)"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": 358,
236 |    "id": "12cf0da2-1a73-4f0c-8628-c090e1da1852",
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "# Train XGB on df_v2\n",
241 |     "X = df_v2.drop(['error'], axis=1)\n",
242 |     "Y = df_v2['error'].values"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "id": "d0025aee-09d7-4881-8201-cf4fc9251d4b",
248 |    "metadata": {},
249 |    "source": [
250 |     "Decided against using the category id in the training set, as no encoding method was working well. *Perhaps I need to study encoding methods in more detail*"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 359,
256 |    "id": "9d4162c2-4f89-48c5-9275-da100b4fb8ee",
257 |    "metadata": {},
258 |    "outputs": [],
259 |    "source": [
260 |     "# # Encode the other categorical variables\n",
261 |     "# cbe = CatBoostEncoder(cols=['category_id', 'video_id', 'user_id'])\n",
262 |     "# cbe.fit(X, Y)\n",
263 |     "# X = cbe.transform(X)\n",
264 |     "\n",
265 |     "# Drop categorical variables\n",
266 |     "X = X.drop(['user_id', 'category_id', 'video_id'], axis=1)"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "markdown",
271 |    "id": "4d03e652-61c9-464b-a0e2-84e9a5a7eab3",
272 |    "metadata": {},
273 |    "source": [
274 |     "**Train a Few Models** - Finally going with Linear Regression."
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 360,
280 |    "id": "f61303ab-5375-42eb-9918-77f7def23169",
281 |    "metadata": {},
282 |    "outputs": [
283 |     {
284 |      "data": {
285 |       "text/plain": [
286 |        "RandomForestRegressor(n_estimators=500)"
287 |       ]
288 |      },
289 |      "execution_count": 360,
290 |      "metadata": {},
291 |      "output_type": "execute_result"
292 |     }
293 |    ],
294 |    "source": [
295 |     "# # Train RF\n",
296 |     "# rf = RandomForestRegressor(n_estimators=500)\n",
297 |     "# rf.fit(X, Y)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 361,
303 |    "id": "bcd452af-ef2f-4fe7-bdb0-c635cb04e442",
304 |    "metadata": {},
305 |    "outputs": [
306 |     {
307 |      "data": {
308 |       "text/plain": [
309 |        "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
310 |        "             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n",
311 |        "             gamma=0, gpu_id=-1, importance_type=None,\n",
312 |        "             interaction_constraints='', learning_rate=0.300000012,\n",
313 |        "             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n",
314 |        "             monotone_constraints='()', n_estimators=100, n_jobs=12,\n",
315 |        "             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n",
316 |        "             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n",
317 |        "             validate_parameters=1, verbosity=None)"
318 |       ]
319 |      },
320 |      "execution_count": 361,
321 |      "metadata": {},
322 |      "output_type": "execute_result"
323 |     }
324 |    ],
325 |    "source": [
326 |     "# # Train XGBoost\n",
327 |     "# xgb = XGBRegressor(n_estimators = 100)\n",
328 |     "# xgb.fit(X, Y)"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "id": "e0b9f95a-bdf3-4fed-9473-d46c92daf6a9",
334 |    "metadata": {},
335 |    "source": [
336 |     "Use some basic error estimators"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": 363,
342 |    "id": "697c6dd0-e260-43c9-ad07-5845d8ab7db0",
343 |    "metadata": {},
344 |    "outputs": [
345 |     {
346 |      "data": {
347 |       "text/plain": [
348 |        "LinearRegression()"
349 |       ]
350 |      },
351 |      "execution_count": 363,
352 |      "metadata": {},
353 |      "output_type": "execute_result"
354 |     }
355 |    ],
356 |    "source": [
357 |     "lr = LinearRegression(fit_intercept=True)\n",
358 |     "lr.fit(X, Y)"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": null,
364 |    "id": "1b50d530-61f8-48fd-85ce-532f540eca4b",
365 |    "metadata": {},
366 |    "outputs": [],
367 |    "source": []
368 |   },
369 |   {
370 |    "cell_type": "markdown",
371 |    "id": "557995bb-3827-44e1-90c4-b33c55b947af",
372 |    "metadata": {},
373 |    "source": [
374 |     "# Join All\n",
375 |     "\n",
376 |     "Run the entire pipeline on the test set\n"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": 364,
382 |    "id": "8719da6e-afe6-4a62-a7b2-64cb9d283a1b",
383 |    "metadata": {},
384 |    "outputs": [],
385 |    "source": [
386 |     "df_test = pd.read_csv('test_1zqHu22.csv')"
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "code",
391 |    "execution_count": 365,
392 |    "id": "791be8e4-e784-4a50-9249-7bcf5ebeb634",
393 |    "metadata": {},
394 |    "outputs": [],
395 |    "source": [
396 |     "df_test_predictions = df_test.drop(['row_id'], axis=1)"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "code",
401 |    "execution_count": 366,
402 |    "id": "5c5b6c97-391f-40b9-933e-68961523efe0",
403 |    "metadata": {},
404 |    "outputs": [],
405 |    "source": [
406 |     "df_test_predictions = pd.get_dummies(df_test_predictions, columns=['gender', 'profession'], drop_first=True)"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "code",
411 |    "execution_count": 367,
412 |    "id": "c3af38c8-4686-4ca4-9576-3da2201324f5",
413 |    "metadata": {},
414 |    "outputs": [],
415 |    "source": [
416 |     "# Make the predictions with SVD\n",
417 |     "initial_predictions = [svd.predict(\n",
418 |     "                            df_test_predictions.loc[i, 'user_id'], \n",
419 |     "                            df_test_predictions.loc[i, 'video_id']\n",
420 |     "                                    ).est for i in range(len(df_test_predictions))]\n",
421 |     "\n",
422 |     "df_test_predictions['initial_estimate'] = initial_predictions"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": 369,
428 |    "id": "ef9ad109-617b-424b-a51e-06ae85e306f1",
429 |    "metadata": {},
430 |    "outputs": [],
431 |    "source": [
432 |     "# Get the error estimate using the regressor.\n",
433 |     "df_test_predictions = df_test_predictions.drop(['user_id', 'category_id', 'video_id'], axis=1)\n",
434 |     "error_predictions = lr.predict(df_test_predictions)\n",
435 |     "df_test_predictions['error_estimate'] = error_predictions"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "markdown",
440 |    "id": "0898bbc6-713f-4de0-984f-c0aa91d93164",
441 |    "metadata": {},
442 |    "source": [
443 |     "The final engagement score is the weighted sum of the initial estimate and the error estimate. The weight is `1-svd_subset`. \n",
444 |     "\n",
445 |     "Currently, the svd_subset is set at 0.95, so the error gets only 5% weightage."
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": 372,
451 |    "id": "41db19f9-1874-4822-b845-a3fad2d66b93",
452 |    "metadata": {},
453 |    "outputs": [],
454 |    "source": [
455 |     "df_test_predictions['engagement_score'] = df_test_predictions['initial_estimate'] - ((1-svd_subset)*(df_test_predictions['error_estimate']))"
456 |    ]
457 |   },
458 |   {
459 |    "cell_type": "code",
460 |    "execution_count": 373,
461 |    "id": "2ddbcda3-6eb2-4b91-aa39-345d02e86b2c",
462 |    "metadata": {},
463 |    "outputs": [],
464 |    "source": [
465 |     "df_test['engagement_score'] = df_test_predictions['engagement_score'].values"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": 374,
471 |    "id": "5604a813-8f72-4786-a722-760cfa0ae8fb",
472 |    "metadata": {},
473 |    "outputs": [],
474 |    "source": [
475 |     "# In case the engagement score crosses 5 or drops below 0, get them back to the threshold.\n",
476 |     "df_test.loc[df_test['engagement_score']>5, 'engagement_score'] = 5\n",
477 |     "df_test.loc[df_test['engagement_score']<0, 'engagement_score'] = 0"
478 |    ]
479 |   },
480 |   {
481 |    "cell_type": "code",
482 |    "execution_count": 382,
483 |    "id": "2e29bb92-d7f5-405c-898d-5bcd9d8b9a41",
484 |    "metadata": {},
485 |    "outputs": [],
486 |    "source": [
487 |     "df_test[['row_id', 'engagement_score']].to_csv('final_submission.csv', index=False)"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": null,
493 |    "id": "2aae8938-e1a9-4717-8e51-56ea5798dfb5",
494 |    "metadata": {},
495 |    "outputs": [],
496 |    "source": []
497 |   }
498 |  ],
499 |  "metadata": {
500 |   "kernelspec": {
501 |    "display_name": "Python 3",
502 |    "language": "python",
503 |    "name": "python3"
504 |   },
505 |   "language_info": {
506 |    "codemirror_mode": {
507 |     "name": "ipython",
508 |     "version": 3
509 |    },
510 |    "file_extension": ".py",
511 |    "mimetype": "text/x-python",
512 |    "name": "python",
513 |    "nbconvert_exporter": "python",
514 |    "pygments_lexer": "ipython3",
515 |    "version": "3.9.7"
516 |   }
517 |  },
518 |  "nbformat": 4,
519 |  "nbformat_minor": 5
520 | }
521 | 


--------------------------------------------------------------------------------