├── Home Project #1.ipynb ├── README.md ├── home exam 2 Nadav Har-Tuv.ipynb └── home test #3 notebook.ipynb /Home Project #1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c2528265-14f9-491d-9d25-80e0ffef3f2f", 6 | "metadata": {}, 7 | "source": [ 8 | "## 1. Import data and dependencies" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "3c76839b-cfc7-4e39-858f-dafc61ea9d29", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pickle \n", 19 | "import scipy\n", 20 | "from scipy import sparse\n", 21 | "import numpy as np\n", 22 | "import pandas as pd\n", 23 | "from ast import literal_eval" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "id": "4075c1b6-ecb4-4633-9689-eed90937021d", 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "# load the data from the pkl file \n", 34 | "objects = []\n", 35 | "with (open(\"home_project.pkl\", \"rb\")) as openfile:\n", 36 | " while True:\n", 37 | " try:\n", 38 | " objects.append(pickle.load(openfile))\n", 39 | " except EOFError:\n", 40 | " break\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "id": "0211ef36-1416-4b6d-9043-7db238b8d39d", 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# assign names to the dataframes\n", 51 | "project_tf_idf_mat = objects[0]['project_tf_idf_mat']\n", 52 | "project_df = objects[0]['project_df']\n", 53 | "holdout_tf_idf_mat = objects[0]['holdout_tf_idf_mat']\n", 54 | "holdout_df = objects[0]['holdout_df']" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "a8c4e163-186f-405c-996e-234c317f8099", 60 | "metadata": {}, 61 | "source": [ 62 | "## 2. Inspect and preprocess the data\n", 63 | "- text_features is a dictionary, it needs to be split to columns of feature\n", 64 | "- country has some missing values" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 4, 70 | "id": "11196283-a052-484f-844a-f83c8674aa93", 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/html": [ 76 | "
\n", 77 | "\n", 90 | "\n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | "
datapoint_idinvoice_arrival_datecountryrel_doctext_features
0142402022-01-03 18:09:53.421000+00:00AUTrue{'num_of_rows': 98, 'num_of_punc_in_text_words...
1358372021-01-18 13:07:49.108000+00:00AUTrue{'num_of_rows': 19, 'num_of_punc_in_text_words...
2321652021-11-05 00:06:48.725000+00:00AUTrue{'num_of_rows': 47, 'num_of_punc_in_text_words...
3566702021-04-05 19:08:41.746000+00:00AUTrue{'num_of_rows': 9, 'num_of_punc_in_text_words'...
4383722021-02-02 13:39:24.751000+00:00AUTrue{'num_of_rows': 76, 'num_of_punc_in_text_words...
\n", 144 | "
" 145 | ], 146 | "text/plain": [ 147 | " datapoint_id invoice_arrival_date country rel_doc \\\n", 148 | "0 14240 2022-01-03 18:09:53.421000+00:00 AU True \n", 149 | "1 35837 2021-01-18 13:07:49.108000+00:00 AU True \n", 150 | "2 32165 2021-11-05 00:06:48.725000+00:00 AU True \n", 151 | "3 56670 2021-04-05 19:08:41.746000+00:00 AU True \n", 152 | "4 38372 2021-02-02 13:39:24.751000+00:00 AU True \n", 153 | "\n", 154 | " text_features \n", 155 | "0 {'num_of_rows': 98, 'num_of_punc_in_text_words... \n", 156 | "1 {'num_of_rows': 19, 'num_of_punc_in_text_words... \n", 157 | "2 {'num_of_rows': 47, 'num_of_punc_in_text_words... \n", 158 | "3 {'num_of_rows': 9, 'num_of_punc_in_text_words'... \n", 159 | "4 {'num_of_rows': 76, 'num_of_punc_in_text_words... " 160 | ] 161 | }, 162 | "execution_count": 4, 163 | "metadata": {}, 164 | "output_type": "execute_result" 165 | } 166 | ], 167 | "source": [ 168 | "# look at the data \n", 169 | "project_df.head()" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "id": "4baa07fd-9ee6-43dd-a107-7e6c1dc60e69", 175 | "metadata": {}, 176 | "source": [ 177 | "The features are in a dictionary form, they have to be split to columns in order to understand what is going on here. The dictionaries are all strings so they have to be converted to dictionaries and then they can be converted to Pandas Series and used as columns of data. The function literal_eval will convert the text to dictionaries and the function pd.Series will convert the dictionaries to columns. " 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "ba7b54e3-8a33-4542-b5c7-967cb8d982b0", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "# apply literal_eval to the text column\n", 188 | "project_df.text_features.apply(literal_eval)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "0389af89-0b71-48d7-ac24-ad0ee80b006c", 194 | "metadata": {}, 195 | "source": [ 196 | "It not working, let's check if we can locate problematic strings." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 5, 202 | "id": "7d59e37e-d84d-4fa3-8f80-aa011fe6b4a8", 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "22645\n", 210 | "23209\n" 211 | ] 212 | } 213 | ], 214 | "source": [ 215 | "# iterate over all strings and try to find where literal_eval is not working \n", 216 | "for i in range(len(project_df.text_features)):\n", 217 | " try:\n", 218 | " literal_eval(project_df.text_features[i])\n", 219 | " except Exception:\n", 220 | " print(i)\n", 221 | " " 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "id": "1a5a221c-9872-47cd-b306-ab75671cfc94", 227 | "metadata": {}, 228 | "source": [ 229 | "only two strings are causing the problems, let's take a closer look at them." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 6, 235 | "id": "990c3294-7436-4d8f-8564-137526948976", 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "data": { 240 | "text/plain": [ 241 | "\"{'num_of_rows': 1, 'num_of_punc_in_text_words': 0, 'num_of_punc_in_text_chars': 0, 'lines_made_of_symbols': 0, 'empty _spaces': 1, 'characters_in_raw_invoice': 2, 'words_raw_invoice_by_split': 0, 'ascii_characters_in_invoice': 0, 'words_capital_first': 0, 'words_all_uppercase': 0, 'alphanumeric_words': 0, 'words_repeated_characters': 0, 'web_adresses': 0, 'email_adresses': 0, 'num_of_digits': 0, 'solo_numbers': 0, 'float_point_numbers': 0, 'numbers_line_delimited': 0, 'total_number_of_numbers_in_invoice': 0, 'punc_prop': nan, 'lines_symbols_prop': nan, 'num_of_chrs_prop': inf, 'words_num_prop': nan, 'ascii_characters_to_prop': nan, 'words_capital_first_prop': nan, 'words_all_uppercase_prop': nan, 'alphanumeric_words_prop': nan, 'words_repeated_characters_prop': nan, 'digits_in_invoice_prop': nan, 'solo_numbers_prop': nan, 'float_point_numbers_prop': nan}\"" 242 | ] 243 | }, 244 | "execution_count": 6, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "project_df.text_features[22645]" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 7, 256 | "id": "3f642003-b5b2-45d2-96bd-f48a54ccb170", 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "data": { 261 | "text/plain": [ 262 | "\"{'num_of_rows': 1, 'num_of_punc_in_text_words': 0, 'num_of_punc_in_text_chars': 0, 'lines_made_of_symbols': 0, 'empty _spaces': 0, 'characters_in_raw_invoice': 1, 'words_raw_invoice_by_split': 0, 'ascii_characters_in_invoice': 0, 'words_capital_first': 0, 'words_all_uppercase': 0, 'alphanumeric_words': 0, 'words_repeated_characters': 0, 'web_adresses': 0, 'email_adresses': 0, 'num_of_digits': 0, 'solo_numbers': 0, 'float_point_numbers': 0, 'numbers_line_delimited': 0, 'total_number_of_numbers_in_invoice': 0, 'punc_prop': nan, 'lines_symbols_prop': nan, 'num_of_chrs_prop': inf, 'words_num_prop': nan, 'ascii_characters_to_prop': nan, 'words_capital_first_prop': nan, 'words_all_uppercase_prop': nan, 'alphanumeric_words_prop': nan, 'words_repeated_characters_prop': nan, 'digits_in_invoice_prop': nan, 'solo_numbers_prop': nan, 'float_point_numbers_prop': nan}\"" 263 | ] 264 | }, 265 | "execution_count": 7, 266 | "metadata": {}, 267 | "output_type": "execute_result" 268 | } 269 | ], 270 | "source": [ 271 | "project_df.text_features[23209]" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "4a863528-8b8b-4d28-a903-e9f9d84c31be", 277 | "metadata": {}, 278 | "source": [ 279 | "Both of these look like empty documents. I think it's safe to just remove these rows from the data." 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 8, 285 | "id": "0ef14f9e-ce93-4682-9d54-096cc3e582f1", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "# drop the bad rows \n", 290 | "project_df.drop([22645, 23209], axis= 0, inplace=True)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 9, 296 | "id": "58f61dff-c12f-4b23-8c8e-750edc73f075", 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "# transform text_features to columns of data and save the new dataframe.\n", 301 | "project_df = pd.concat([project_df.drop(['text_features'], axis =1), project_df.text_features.apply(literal_eval).apply(pd.Series)], axis=1)" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 10, 307 | "id": "1aee0505-ce0e-4be9-b7c4-506d8bb94582", 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "name": "stdout", 312 | "output_type": "stream", 313 | "text": [ 314 | "\n", 315 | "Int64Index: 49998 entries, 0 to 49999\n", 316 | "Data columns (total 35 columns):\n", 317 | " # Column Non-Null Count Dtype \n", 318 | "--- ------ -------------- ----- \n", 319 | " 0 datapoint_id 49998 non-null int64 \n", 320 | " 1 invoice_arrival_date 49998 non-null datetime64[ns, UTC]\n", 321 | " 2 country 48155 non-null object \n", 322 | " 3 rel_doc 49998 non-null bool \n", 323 | " 4 num_of_rows 49998 non-null float64 \n", 324 | " 5 num_of_punc_in_text_words 49998 non-null float64 \n", 325 | " 6 num_of_punc_in_text_chars 49998 non-null float64 \n", 326 | " 7 lines_made_of_symbols 49998 non-null float64 \n", 327 | " 8 empty _spaces 49998 non-null float64 \n", 328 | " 9 characters_in_raw_invoice 49998 non-null float64 \n", 329 | " 10 words_raw_invoice_by_split 49998 non-null float64 \n", 330 | " 11 ascii_characters_in_invoice 49998 non-null float64 \n", 331 | " 12 words_capital_first 49998 non-null float64 \n", 332 | " 13 words_all_uppercase 49998 non-null float64 \n", 333 | " 14 alphanumeric_words 49998 non-null float64 \n", 334 | " 15 words_repeated_characters 49998 non-null float64 \n", 335 | " 16 web_adresses 49998 non-null float64 \n", 336 | " 17 email_adresses 49998 non-null float64 \n", 337 | " 18 num_of_digits 49998 non-null float64 \n", 338 | " 19 solo_numbers 49998 non-null float64 \n", 339 | " 20 float_point_numbers 49998 non-null float64 \n", 340 | " 21 numbers_line_delimited 49998 non-null float64 \n", 341 | " 22 total_number_of_numbers_in_invoice 49998 non-null float64 \n", 342 | " 23 punc_prop 49998 non-null float64 \n", 343 | " 24 lines_symbols_prop 49998 non-null float64 \n", 344 | " 25 num_of_chrs_prop 49998 non-null float64 \n", 345 | " 26 words_num_prop 49998 non-null float64 \n", 346 | " 27 ascii_characters_to_prop 49998 non-null float64 \n", 347 | " 28 words_capital_first_prop 49998 non-null float64 \n", 348 | " 29 words_all_uppercase_prop 49998 non-null float64 \n", 349 | " 30 alphanumeric_words_prop 49998 non-null float64 \n", 350 | " 31 words_repeated_characters_prop 49998 non-null float64 \n", 351 | " 32 digits_in_invoice_prop 49998 non-null float64 \n", 352 | " 33 solo_numbers_prop 49998 non-null float64 \n", 353 | " 34 float_point_numbers_prop 49998 non-null float64 \n", 354 | "dtypes: bool(1), datetime64[ns, UTC](1), float64(31), int64(1), object(1)\n", 355 | "memory usage: 13.4+ MB\n" 356 | ] 357 | } 358 | ], 359 | "source": [ 360 | "# take a look at the data, look for missing values\n", 361 | "project_df.info()\n", 362 | "\n", 363 | "# only country has missing values" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "id": "ce6c8174-2bf5-4811-8a11-067e2c068367", 369 | "metadata": {}, 370 | "source": [ 371 | "The only variable with missing values is 'country', I will replace the missing values with 'missing'" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 11, 377 | "id": "305df0ab-dd9e-4f2d-86f5-665010d7a983", 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# fill missing countries with 'missing'\n", 382 | "project_df['country'].fillna('missing', inplace=True)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 109, 388 | "id": "f81fcb46-db97-4cb9-8a91-883ce9f28059", 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "# copy the data to new dataframe so that we can always go back\n", 393 | "from copy import deepcopy\n", 394 | "df_copy = deepcopy(project_df)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 110, 400 | "id": "0f89013f-db54-42b4-a06f-0be714ff96de", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "# replace countries with small number of counts with 'other'\n", 405 | "country_count = df_copy.country.value_counts()\n", 406 | "under_50 = country_count[country_count<50]\n", 407 | "df_copy.loc[df_copy[\"country\"].isin(under_50.index.tolist()), 'country'] = \"other\"" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 15, 413 | "id": "b7c73ab9-f9a6-4e18-97e6-e93f83d8e331", 414 | "metadata": {}, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "text/plain": [ 419 | "AU 39670\n", 420 | "DK 5321\n", 421 | "missing 1843\n", 422 | "DE 731\n", 423 | "US 689\n", 424 | "other 335\n", 425 | "GB 238\n", 426 | "FR 232\n", 427 | "PL 221\n", 428 | "HU 133\n", 429 | "NZ 118\n", 430 | "AT 101\n", 431 | "NL 97\n", 432 | "LU 59\n", 433 | "IT 58\n", 434 | "CH 51\n", 435 | "NO 51\n", 436 | "ES 50\n", 437 | "Name: country, dtype: int64" 438 | ] 439 | }, 440 | "execution_count": 15, 441 | "metadata": {}, 442 | "output_type": "execute_result" 443 | } 444 | ], 445 | "source": [ 446 | "df_copy.country.value_counts()" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "id": "f7f81c43-a691-4559-b609-ad5483971cbd", 452 | "metadata": {}, 453 | "source": [ 454 | "I need to remove the two problematic rows from the sparse csr matrix as well. I found the following function online, I didn't implement it myself." 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 17, 460 | "id": "1ace50d8-1528-4068-b233-124fced0a511", 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "def delete_rows_csr(mat, indices):\n", 465 | " \"\"\"\n", 466 | " Remove the rows denoted by ``indices`` form the CSR sparse matrix ``mat``.\n", 467 | " \"\"\"\n", 468 | " if not isinstance(mat, scipy.sparse.csr_matrix):\n", 469 | " raise ValueError(\"works only for CSR format -- use .tocsr() first\")\n", 470 | " indices = list(indices)\n", 471 | " mask = np.ones(mat.shape[0], dtype=bool)\n", 472 | " mask[indices] = False\n", 473 | " return mat[mask]" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 18, 479 | "id": "9872371d-06ed-427a-8673-1c643a3405ae", 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "project_tf_idf_mat = delete_rows_csr(project_tf_idf_mat, [22645, 23209])" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "id": "12d4636d-3a60-4eec-a237-b877d1fed978", 489 | "metadata": {}, 490 | "source": [ 491 | "Now I want to get rid of the id and date columns, I don't think they will be helpful:" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 19, 497 | "id": "a3b9714e-cc62-4ed9-ad3a-38c86209fb74", 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "df_copy.drop(['datapoint_id', 'invoice_arrival_date'], axis = 1, inplace=True)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "id": "c9123353-7f32-4a26-8d26-10d2e4606983", 507 | "metadata": {}, 508 | "source": [ 509 | "Now convert country to dummies:" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 106, 515 | "id": "753974d4-57a2-4e65-bfe7-bb7d24737c85", 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "df_copy = pd.get_dummies(df_copy)" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 23, 525 | "id": "0331fa09-0c6f-4f78-8f01-0923215ca30a", 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/html": [ 531 | "
\n", 532 | "\n", 545 | "\n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | "
rel_docnum_of_rowsnum_of_punc_in_text_wordsnum_of_punc_in_text_charslines_made_of_symbolsempty _spacescharacters_in_raw_invoicewords_raw_invoice_by_splitascii_characters_in_invoicewords_capital_first...country_HUcountry_ITcountry_LUcountry_NLcountry_NOcountry_NZcountry_PLcountry_UScountry_missingcountry_other
0True98.00.00.00.0349.02407.0400.01954.00.0...0000000000
1True19.00.00.00.016.0172.030.0131.00.0...0000000000
2True47.00.00.00.0184.01187.0196.0951.00.0...0000000000
3True9.00.00.00.020.0119.021.088.00.0...0000000000
4True76.00.00.00.0107.0937.0133.0749.00.0...0000000000
\n", 695 | "

5 rows × 50 columns

\n", 696 | "
" 697 | ], 698 | "text/plain": [ 699 | " rel_doc num_of_rows num_of_punc_in_text_words num_of_punc_in_text_chars \\\n", 700 | "0 True 98.0 0.0 0.0 \n", 701 | "1 True 19.0 0.0 0.0 \n", 702 | "2 True 47.0 0.0 0.0 \n", 703 | "3 True 9.0 0.0 0.0 \n", 704 | "4 True 76.0 0.0 0.0 \n", 705 | "\n", 706 | " lines_made_of_symbols empty _spaces characters_in_raw_invoice \\\n", 707 | "0 0.0 349.0 2407.0 \n", 708 | "1 0.0 16.0 172.0 \n", 709 | "2 0.0 184.0 1187.0 \n", 710 | "3 0.0 20.0 119.0 \n", 711 | "4 0.0 107.0 937.0 \n", 712 | "\n", 713 | " words_raw_invoice_by_split ascii_characters_in_invoice \\\n", 714 | "0 400.0 1954.0 \n", 715 | "1 30.0 131.0 \n", 716 | "2 196.0 951.0 \n", 717 | "3 21.0 88.0 \n", 718 | "4 133.0 749.0 \n", 719 | "\n", 720 | " words_capital_first ... country_HU country_IT country_LU country_NL \\\n", 721 | "0 0.0 ... 0 0 0 0 \n", 722 | "1 0.0 ... 0 0 0 0 \n", 723 | "2 0.0 ... 0 0 0 0 \n", 724 | "3 0.0 ... 0 0 0 0 \n", 725 | "4 0.0 ... 0 0 0 0 \n", 726 | "\n", 727 | " country_NO country_NZ country_PL country_US country_missing \\\n", 728 | "0 0 0 0 0 0 \n", 729 | "1 0 0 0 0 0 \n", 730 | "2 0 0 0 0 0 \n", 731 | "3 0 0 0 0 0 \n", 732 | "4 0 0 0 0 0 \n", 733 | "\n", 734 | " country_other \n", 735 | "0 0 \n", 736 | "1 0 \n", 737 | "2 0 \n", 738 | "3 0 \n", 739 | "4 0 \n", 740 | "\n", 741 | "[5 rows x 50 columns]" 742 | ] 743 | }, 744 | "execution_count": 23, 745 | "metadata": {}, 746 | "output_type": "execute_result" 747 | } 748 | ], 749 | "source": [ 750 | "df_copy.head()" 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "id": "15bf347b-3a89-4b85-9176-b19e73e9397e", 756 | "metadata": {}, 757 | "source": [ 758 | "We have prepared the data, time to model" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "id": "80fc408b-5048-4d8f-89f3-205222662c5a", 764 | "metadata": { 765 | "tags": [] 766 | }, 767 | "source": [ 768 | "## 3. build models\n", 769 | " - Feature and target values: X,y\n", 770 | " - combine training data with tf-idf matrix\n", 771 | " - Train test split\n", 772 | " - train a few algorithms\n", 773 | " - Deal with imbalanced classes\n", 774 | " - I will train two models: Random Forest and Gradient Boosting" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": 26, 780 | "id": "69688ed7-5795-44e4-be35-07c1692fe3e0", 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "from sklearn.model_selection import train_test_split\n", 785 | "from sklearn.pipeline import make_pipeline\n", 786 | "from sklearn.preprocessing import StandardScaler\n", 787 | "from sklearn.decomposition import TruncatedSVD" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "id": "93d09f73-1663-4996-b687-f272d1b8e3a7", 793 | "metadata": {}, 794 | "source": [ 795 | "### Reduce the dimension of the tf-idf matrix using truncated SVD\n", 796 | " The tf-idf matrix is way too big to work with, I will combine it with the text features and reduce the dimension using truncated SVD. This is common for sparse matrices" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 27, 802 | "id": "5a4c622f-1a77-468f-8793-02c034e8a017", 803 | "metadata": {}, 804 | "outputs": [], 805 | "source": [ 806 | "truncatedSVD = TruncatedSVD(150)" 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": 28, 812 | "id": "56ca0c38-16db-4957-b8d9-31e33572b1c2", 813 | "metadata": {}, 814 | "outputs": [], 815 | "source": [ 816 | "# split to features and target data and labels\n", 817 | "y = df_copy['rel_doc']\n", 818 | "X = df_copy.drop(['rel_doc'], axis = 1)" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "id": "eaf7bc6c-da93-4cb2-be20-9618c8a6b7da", 824 | "metadata": {}, 825 | "source": [ 826 | "combine the dataframe with the tf-idf matrix and save the whole thing as a sparse matrix." 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": 29, 832 | "id": "6f0ee05b-36fa-4364-a7b0-9c6ff6cb1923", 833 | "metadata": {}, 834 | "outputs": [], 835 | "source": [ 836 | "data = sparse.hstack([project_tf_idf_mat, X])" 837 | ] 838 | }, 839 | { 840 | "cell_type": "code", 841 | "execution_count": 30, 842 | "id": "bb40926a-a27d-4ca7-b8eb-f4b56a621598", 843 | "metadata": { 844 | "tags": [] 845 | }, 846 | "outputs": [], 847 | "source": [ 848 | "# split to train and test \n", 849 | "X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state = 1, stratify = y)" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 31, 855 | "id": "e2c840e6-dc1e-4f80-b0f5-37fb0f2267f1", 856 | "metadata": {}, 857 | "outputs": [ 858 | { 859 | "data": { 860 | "text/plain": [ 861 | "TruncatedSVD(n_components=150)" 862 | ] 863 | }, 864 | "execution_count": 31, 865 | "metadata": {}, 866 | "output_type": "execute_result" 867 | } 868 | ], 869 | "source": [ 870 | "truncatedSVD.fit(X_train)\n" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 38, 876 | "id": "695dea40-13b1-4b1f-83a2-43aed9acdf35", 877 | "metadata": {}, 878 | "outputs": [], 879 | "source": [ 880 | "truncated_X_train = truncatedSVD.transform(X_train)\n", 881 | "truncated_X_test = truncatedSVD.transform(X_test)" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "id": "d525cfb9-5828-4bb8-8189-7d5a8b37dfdf", 887 | "metadata": {}, 888 | "source": [ 889 | "Check for imbalance" 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": 35, 895 | "id": "d4cc6e57-46f0-4a81-9d60-f99fa36f429e", 896 | "metadata": {}, 897 | "outputs": [ 898 | { 899 | "data": { 900 | "text/plain": [ 901 | "True 0.928897\n", 902 | "False 0.071103\n", 903 | "Name: rel_doc, dtype: float64" 904 | ] 905 | }, 906 | "execution_count": 35, 907 | "metadata": {}, 908 | "output_type": "execute_result" 909 | } 910 | ], 911 | "source": [ 912 | "y.value_counts(normalize=True) \n", 913 | "# pretty imbablanced... " 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "id": "eb09b502-b0b5-408a-ad26-e790ea1fda36", 919 | "metadata": {}, 920 | "source": [ 921 | "### Setup ML pipelines" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": 36, 927 | "id": "5dbc625c-88ea-4986-bcd9-b924f08add16", 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [ 931 | "from sklearn.model_selection import GridSearchCV\n", 932 | "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier" 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "id": "1bbbf9a2-23d8-4de1-a313-da3d41c6eaa5", 938 | "metadata": {}, 939 | "source": [ 940 | "I will train a Random Forest classifier and a Gradient Boosting classsifier. Obviously, many more classifiers can be considered. " 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 41, 946 | "id": "2c407915-6e05-4897-a99f-920f8c5fde64", 947 | "metadata": {}, 948 | "outputs": [], 949 | "source": [ 950 | "# setup up pipelines to stack scaling and modelling \n", 951 | "pipelines = {\n", 952 | " 'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1)),\n", 953 | " 'gb': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))\n", 954 | "\n", 955 | " }" 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "id": "c2805308-3cb1-43f7-b301-73fd40dbd7fd", 961 | "metadata": {}, 962 | "source": [ 963 | "I will use grid search to optimize the n_estimators parameter for both models. Obviously, this can be done for all the parameters of the models, depending on how much time we want to invest." 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": 42, 969 | "id": "8458553f-ce8b-4a3a-8bc6-d3a26bae7598", 970 | "metadata": {}, 971 | "outputs": [], 972 | "source": [ 973 | "# grid search for tuning parameters\n", 974 | "grid = {\n", 975 | " 'rf': {\n", 976 | " 'randomforestclassifier__n_estimators' : [100, 200, 300] \n", 977 | " },\n", 978 | " 'gb':{\n", 979 | " 'gradientboostingclassifier__n_estimators' : [100, 200, 300] \n", 980 | " }\n", 981 | "}" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": 44, 987 | "id": "1ad05ae9-0b41-496a-b104-c107fad9a626", 988 | "metadata": {}, 989 | "outputs": [ 990 | { 991 | "name": "stdout", 992 | "output_type": "stream", 993 | "text": [ 994 | "training the rf model\n", 995 | "training the gb model\n" 996 | ] 997 | } 998 | ], 999 | "source": [ 1000 | "# find the best hyperparameters using GridSearchCV and save the best model\n", 1001 | "\n", 1002 | "# create a blank dictionary to hold models\n", 1003 | "fit_models = {}\n", 1004 | "# loop over algorithms and choose hyperparameters using GrisSearchCV\n", 1005 | "for algo, pipeline in pipelines.items():\n", 1006 | " print(f'training the {algo} model')\n", 1007 | " model = GridSearchCV(pipeline, grid[algo], n_jobs = -1, cv=10)\n", 1008 | " model.fit(truncated_X_train, y_train)\n", 1009 | " fit_models[algo] = model" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "markdown", 1014 | "id": "765c8fe6-3542-42de-b9c5-eb52577a0140", 1015 | "metadata": {}, 1016 | "source": [ 1017 | "## 4. Evaluate performance on test partition" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "code", 1022 | "execution_count": 60, 1023 | "id": "cb015ace-ed29-4094-b362-0454939414a5", 1024 | "metadata": {}, 1025 | "outputs": [], 1026 | "source": [ 1027 | "from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix\n", 1028 | "from matplotlib import pyplot as plt" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": 47, 1034 | "id": "0f7d1e79-224a-4935-b437-0c0cb452900a", 1035 | "metadata": {}, 1036 | "outputs": [], 1037 | "source": [ 1038 | "# transform the test dataset\n", 1039 | "truncated_X_test = truncatedSVD.transform(X_test)" 1040 | ] 1041 | }, 1042 | { 1043 | "cell_type": "markdown", 1044 | "id": "a0f9a231-1b43-47ec-bd07-3900c7f743a5", 1045 | "metadata": {}, 1046 | "source": [ 1047 | "For shortness of time I will consider a threshold of 0.5 for both models, This is not ideal since the data is very imbalanced. With more time I would try to optimize this threshold." 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "code", 1052 | "execution_count": 67, 1053 | "id": "48e51e1a-20d9-4103-a2b4-41d716efdff3", 1054 | "metadata": {}, 1055 | "outputs": [ 1056 | { 1057 | "name": "stdout", 1058 | "output_type": "stream", 1059 | "text": [ 1060 | " Confusion matrix for rf:\n", 1061 | "[[ 572 495]\n", 1062 | " [ 125 13808]]\n", 1063 | " Confusion matrix for gb:\n", 1064 | "[[ 601 466]\n", 1065 | " [ 170 13763]]\n" 1066 | ] 1067 | } 1068 | ], 1069 | "source": [ 1070 | "# look at confusion matrices for both models\n", 1071 | "\n", 1072 | "for algo, model in fit_models.items():\n", 1073 | " yhat = model.predict(truncated_X_test)\n", 1074 | " cm = confusion_matrix(y_test, yhat)\n", 1075 | " print(f' Confusion matrix for {algo}:')\n", 1076 | " print(cm)" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "id": "5132d84c-b5af-4f24-8ba9-cbe7d2e45d2b", 1082 | "metadata": {}, 1083 | "source": [ 1084 | "Both models are performing very well on the majority class, the performance on the minority class is not bad but it can probably be improved. " 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": 68, 1090 | "id": "3573df19-8d1d-42ab-8554-a3e8e93bb7e0", 1091 | "metadata": {}, 1092 | "outputs": [ 1093 | { 1094 | "name": "stdout", 1095 | "output_type": "stream", 1096 | "text": [ 1097 | "metrics for rf: accuracy = 0.959, precision = 0.965, recall = 0.991\n", 1098 | "metrics for gb: accuracy = 0.958, precision = 0.967, recall = 0.988\n" 1099 | ] 1100 | } 1101 | ], 1102 | "source": [ 1103 | "# evaluate the performance of the models\n", 1104 | "\n", 1105 | "for algo, model in fit_models.items():\n", 1106 | " yhat = model.predict(truncated_X_test)\n", 1107 | " accuracy = np.round(accuracy_score(y_test, yhat),3)\n", 1108 | " precision = np.round(precision_score(y_test, yhat),3)\n", 1109 | " recall = np.round(recall_score(y_test, yhat),3)\n", 1110 | " print(f'metrics for {algo}: accuracy = {accuracy}, precision = {precision}, recall = {recall}')\n", 1111 | " " 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "markdown", 1116 | "id": "35bc30b9-d431-4ca9-852f-195e0815a6ed", 1117 | "metadata": {}, 1118 | "source": [ 1119 | "### Save best model" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "code", 1124 | "execution_count": 69, 1125 | "id": "391334c9-9673-4fa6-9f40-beaa57496003", 1126 | "metadata": {}, 1127 | "outputs": [], 1128 | "source": [ 1129 | "with open ('RandomForestModel.pkl', 'wb') as f:\n", 1130 | " pickle.dump(fit_models['rf'], f)" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "markdown", 1135 | "id": "0ad5aaf4-81e0-44b9-a4d4-d65630697959", 1136 | "metadata": {}, 1137 | "source": [ 1138 | "## 5. Make predictions" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": 71, 1144 | "id": "f02efc21-6db9-4fd4-9e87-014207e5752f", 1145 | "metadata": {}, 1146 | "outputs": [], 1147 | "source": [ 1148 | "# transform text_features to columns of data and save the new dataframe.\n", 1149 | "holdout_df = pd.concat([holdout_df.drop(['text_features'], axis =1), holdout_df.text_features.apply(literal_eval).apply(pd.Series)], axis=1)" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "code", 1154 | "execution_count": 78, 1155 | "id": "bbf3440a-f45d-4141-ac96-ff9fa14b777f", 1156 | "metadata": {}, 1157 | "outputs": [], 1158 | "source": [ 1159 | "# fill missing countries with 'missing'\n", 1160 | "holdout_df['country'].fillna('missing', inplace=True)" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": 111, 1166 | "id": "722d9fb9-1bf3-4afb-ab67-70abf84105bb", 1167 | "metadata": {}, 1168 | "outputs": [], 1169 | "source": [ 1170 | "# copy the data to new dataframe so that we can always go back\n", 1171 | "from copy import deepcopy\n", 1172 | "holdout_copy = deepcopy(holdout_df)" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": 112, 1178 | "id": "d72ed8b7-bab9-476a-a235-1e1ccffd5b52", 1179 | "metadata": {}, 1180 | "outputs": [], 1181 | "source": [ 1182 | "# replace countries with small number of counts with 'other'\n", 1183 | "holdout_copy.loc[holdout_copy[\"country\"].isin(under_50.index.tolist()), 'country'] = \"other\"" 1184 | ] 1185 | }, 1186 | { 1187 | "cell_type": "code", 1188 | "execution_count": 113, 1189 | "id": "82f9f445-acf9-4340-9df4-878cf062dc39", 1190 | "metadata": {}, 1191 | "outputs": [], 1192 | "source": [ 1193 | "# replace countries with small number of counts with 'other', some countries that were not in the training data\n", 1194 | "country_count = holdout_copy.country.value_counts()\n", 1195 | "under_5 = country_count[country_count<5]\n", 1196 | "holdout_copy.loc[holdout_copy[\"country\"].isin(under_5.index.tolist()), 'country'] = \"other\"" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "code", 1201 | "execution_count": 119, 1202 | "id": "88fcff75-6b11-4adc-ab39-dc9826eb94ef", 1203 | "metadata": {}, 1204 | "outputs": [], 1205 | "source": [ 1206 | "holdout_copy.drop(['datapoint_id', 'invoice_arrival_date'], axis = 1, inplace=True)" 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "execution_count": 120, 1212 | "id": "02c7311a-cb65-4bd0-85bd-97e89078daf1", 1213 | "metadata": {}, 1214 | "outputs": [], 1215 | "source": [ 1216 | "holdout_copy = pd.get_dummies(holdout_copy)" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 122, 1222 | "id": "e37e2147-56d4-4f4f-b2d8-d7117c0bec97", 1223 | "metadata": {}, 1224 | "outputs": [], 1225 | "source": [ 1226 | "data = sparse.hstack([holdout_tf_idf_mat, holdout_copy])" 1227 | ] 1228 | }, 1229 | { 1230 | "cell_type": "code", 1231 | "execution_count": 123, 1232 | "id": "4c02111a-54a6-47e1-b875-321de0de7ad8", 1233 | "metadata": {}, 1234 | "outputs": [], 1235 | "source": [ 1236 | "truncated_data = truncatedSVD.transform(data)" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 124, 1242 | "id": "7095758b-4845-4c5d-aa59-45c5a2288336", 1243 | "metadata": {}, 1244 | "outputs": [], 1245 | "source": [ 1246 | "prediction = fit_models['rf'].predict(truncated_data)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 138, 1252 | "id": "3ccf29b9-6f59-40d2-b50e-bd5594b91b84", 1253 | "metadata": {}, 1254 | "outputs": [], 1255 | "source": [ 1256 | "prediction_submission = pd.DataFrame([holdout_df.datapoint_id, prediction]).T" 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "code", 1261 | "execution_count": 143, 1262 | "id": "1fb55bab-450e-4902-9f27-6c12af0b0b68", 1263 | "metadata": {}, 1264 | "outputs": [], 1265 | "source": [ 1266 | "prediction_submission.to_parquet('submission.parquet')" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "markdown", 1271 | "id": "fa55d3ec-dc1e-4b10-b084-da22144d3eaf", 1272 | "metadata": { 1273 | "tags": [] 1274 | }, 1275 | "source": [ 1276 | "## If I had more time\n", 1277 | "- Better preprocessing for text features: I could look at the relationships bertween these variables, and maybe come up with some insights that can improve the modelling\n", 1278 | "- More classifiers : I tried only two algorithms - Random Forest and Gradient Boosting, many more classifiers can be tried out, like Naive Bayes, Neural Network etc.\n", 1279 | "- More hyperparameter tuning: I only tuned one hyperparameter in the models using cross validation over three values. Any one of the other hyperparameters can also be tuned and improved.\n", 1280 | "- Find optimal number of components for truncated SVD: I used 150 components without any strong justification, the correct thing to do is to try out different numbers and look at the variance in the data for each one of these numbers. Then we can choose a value that doesn't lose too much information.\n", 1281 | "- More attention to imbalance: The data is very imbalanced and I didn't really take this into account in the modelling, with more time I would consider resampling, class weights in the model fit and choosing a better threshold than 0.5" 1282 | ] 1283 | } 1284 | ], 1285 | "metadata": { 1286 | "kernelspec": { 1287 | "display_name": "Python 3.8.8 64-bit", 1288 | "language": "python", 1289 | "name": "python3" 1290 | }, 1291 | "language_info": { 1292 | "codemirror_mode": { 1293 | "name": "ipython", 1294 | "version": 3 1295 | }, 1296 | "file_extension": ".py", 1297 | "mimetype": "text/x-python", 1298 | "name": "python", 1299 | "nbconvert_exporter": "python", 1300 | "pygments_lexer": "ipython3", 1301 | "version": "3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)]" 1302 | }, 1303 | "vscode": { 1304 | "interpreter": { 1305 | "hash": "14b42af619014e35325e49c45b3eb2852b785bc18d13b4dce70b076fe1a37f18" 1306 | } 1307 | } 1308 | }, 1309 | "nbformat": 4, 1310 | "nbformat_minor": 5 1311 | } 1312 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Home-assignments 2 | I will store here a bunch of home assignments that I got (without company names) and their solutions. 3 | I will not share any specific questions or company names, only my notebook and a general explanation of what I did. 4 | 5 | # Data Science Project 1 - Invoice Relevance Prediction 6 | 7 | ## Project Overview 8 | 9 | This project aims to address a common challenge in the data flow at an organization. The task involves determining the relevance of documents, specifically invoices, for tax purposes. The objective is to build a machine learning model that predicts the relevance of an invoice based on extracted features. 10 | 11 | ## Dataset 12 | 13 | The dataset comprises two sets: a training dataset (`project_df`) with labeled data and a holdout dataset (`holdout_df`) for future predictions. Features are provided in two formats: a pandas DataFrame and a sparse matrix with tf-idf scores. 14 | 15 | ## Project Tasks 16 | 17 | 1. **Building a Model:** 18 | - Developed a machine learning model to predict the relevance of invoices. 19 | - Utilized the training dataset and tf-idf scores for feature representation. 20 | 21 | 2. **Predictions:** 22 | - Applied the trained model to make predictions on the holdout dataset. 23 | 24 | 3. **Performance Evaluation:** 25 | - Assessed the model's performance on the holdout data using appropriate metrics. 26 | 27 | 28 | # Business Data Science Project 2 - Budget Allocation 29 | 30 | ## Project Overview 31 | 32 | This project addresses the challenge of optimizing ad spend budget for a company's performance marketing team. The goal is to maximize the return on investment (ROI) by recommending the best allocation of spend among different advertising networks. The task involves analyzing historical data, extracting insights, and building a model to predict next month's revenue. 33 | 34 | ## Dataset 35 | 36 | Two CSV files, "spend.csv" and "revenue.csv," contain mock data representing digital marketing spend and attributed revenues, respectively. The data includes information on the network, date, campaign, spend, purchase date, user ID, subscription details, and revenue. 37 | 38 | ## Project Tasks 39 | 40 | 1. **Data Examination and Visualization:** 41 | - Examined and visualized the data to gain insights that assist in further analysis. 42 | 43 | 2. **Relationship between Spend and Revenues:** 44 | - Explored and identified relationships between ad spend and revenues. 45 | 46 | 3. **Model Building:** 47 | - Constructed a model to predict next month's revenue. 48 | 49 | 4. **Answering Stakeholder Questions:** 50 | - Provided insights on how to allocate spend among different networks to maximize revenues. 51 | - Shared expectations for future revenues based on the proposed model and insights. 52 | 53 | 5. **Summary for Stakeholders:** 54 | - Composed a short summary for non-technical stakeholders, including key findings and visualizations. 55 | 56 | ## Addressing Specific Topics 57 | 58 | - **Data Examination and Usage:** 59 | - Thoroughly examined and visualized data to make informed decisions. 60 | 61 | - **Decision Explanation:** 62 | - Clearly explained decisions from both a data science and business perspective. 63 | 64 | - **Coding Skills:** 65 | - Demonstrated efficient usage of Python packages for data analysis and modeling. 66 | 67 | - **Avoided Processes/Methods Explanation:** 68 | - Explained reasons for avoiding certain processes or methods due to time constraints. 69 | 70 | # Sales Forecasting Model - Assignment #3 71 | 72 | ## Overview 73 | 74 | This project focuses on the development of a sales forecasting model for a range of products within the same category. The dataset comprises sales data, competitor sales, holiday dates, and business events, providing a comprehensive foundation for analysis and prediction. 75 | 76 | ## Project Scope 77 | 78 | ### Descriptive Statistics 79 | 80 | A meticulous exploration of descriptive statistics for both sales and competitor data was conducted to extract essential insights. This initial phase set the stage for subsequent modeling decisions. 81 | 82 | ### Sales Forecasting Model 83 | 84 | The development of a predictive model involved the selection of an algorithm tailored to the intricacies of the dataset. Standard procedures such as data cleaning and validation were rigorously applied to ensure the model's reliability. 85 | 86 | ### Model Validation 87 | 88 | Validation was a critical step to verify the accuracy and consistency of the model's predictions. Rigorous testing against known data ensured the robustness of the forecasting model. 89 | 90 | ### Interpretation and Future Steps 91 | 92 | Upon successful model implementation, sales predictions for the next 6 months were generated. The interpretation section provides an in-depth analysis of significant variables, assesses the model's strengths and weaknesses, and suggests avenues for improvement. 93 | 94 | ## Project Files 95 | 96 | - **sales_forecasting.ipynb:** Access the Jupyter Notebook to review the code, analysis, and modeling details. 97 | - **sales_forecast_predictions.xlsx:** Refer to the Excel file for a detailed breakdown of sales predictions. 98 | 99 | --------------------------------------------------------------------------------